The 
Mathematica Journal
Volume 9, Issue 1

Search

In This Issue
Tricks of the Trade
In and Out
Trott's Corner
New Products
New Publications
Calendar
News Bulletins
New Resources
Classifieds

Download This Issue 

About the journal
Editorial Policy
Staff
Submissions
Subscriptions
Advertising
Back Issues
Contact Information

XML and Mathematica
Pavi Sandhu

Introduction

Starting with Version 4.2, Mathematica includes comprehensive support for XML, the metamarkup language developed by the World Wide Web Consortium (W3C) for describing structured documents and data. Using the new XML features introduced in Version 4.2, you can do any of the following.

  • Import any arbitrary Extensible Markup Language (XML) document in the form of a Mathematica expression.
  • Analyze the contents of an XML document or transform its structure using Mathematica’s sophisticated programming and symbolic manipulation abilities.
  • Export the resulting expression back as an XML document to share it with other users and applications.
  • Save Mathematica notebooks in an XML format using the new NotebookML document type definition (DTD) defined for this purpose.
  • Import, export, and evaluate equations in MathML—the W3C standard for representing mathematical formulas on the web.
  • Export graphics in Scalable Vector Graphics (SVG) format—the W3C standard for representing graphics on the web.

These new features make Mathematica a powerful development tool for creating and processing XML documents. They ensure complete interoperability between Mathematica and other XML applications and between notebooks and other XML document formats. This article provides detailed information on Mathematica's XML-related capabilities in Version 4.2.

Native XML Formats

Mathematica has built-in support for four XML formats: MathML, SVG, NotebookML, and ExpressionML. More details about each of these formats are given below.

MathML

Mathematical Markup Language (MathML) is an XML format developed by the W3C for describing the structure and symbolic meaning of mathematical formulas. It provides a standard way of displaying mathematical notation in web pages. Version 4.1 of Mathematica included limited support for import and export of MathML. In Version 4.2, the support was greatly expanded with many new functions for generating and manipulating MathML and for converting between MathML and the expressions used internally by Mathematica to represent mathematics.

The new MathML features make Mathematica an excellent tool for authoring and editing MathML content. You can, for example, use Mathematica’s powerful typesetting system to create properly formatted equations and then copy and paste them in MathML format into an HTML document for display on the web. You can also import MathML equations from other applications and evaluate them using Mathematica.

SVG

SVG is an XML format developed by the W3C for describing two-dimensional graphics. SVG images can be rescaled without loss of resolution and are usually much smaller in size than comparable JPEG or GIF images. SVG files can also be manipulated with a scripting language to produce dynamic and interactive graphics. Using Mathematica, you can directly export any graphics present in a notebook, in SVG format.

NotebookML

NotebookML is an XML format developed by Wolfram Research for representing Mathematica notebooks. The tags and attributes used in NotebookML are specified by an XML DTD and correspond closely to the structures used in notebook expressions. Using NotebookML, you can save your notebooks as well-formed XML documents and then import them back into Mathematica in a completely lossless way.

NotebookML provides a bridge between Mathematica and other XML applications, allowing you to use notebooks with standard web technologies such as Cascading Style Sheets (CSS) and Extensible Stylesheet Language Transformations (XSLT). For example, you can display a NotebookML document in a web browser by using a CSS style sheet to specify how each element in the document should be rendered. Support for MathML and SVG is built into NotebookML, in that you can choose to save all equations in a notebook as MathML and all graphics as SVG when saving the notebook as NotebookML.

ExpressionML

ExpressionML is a specialized subset of NotebookML. ExpressionML fragments can represent any Mathematica expression in an XML format. NotebookML uses ExpressionML fragments to represent Mathematica expressions that are embedded within a notebook structure.

SymbolicXML

What Is SymbolicXML?

SymbolicXML is the format used by Mathematica for representing XML documents. The conversion from XML to SymbolicXML translates the XML document into a Mathematica expression, while preserving its structure. Since both XML documents and Mathematica expressions have a tree structure, there is a natural mapping from one to the other. You can then manipulate the SymbolicXML expression using the standard techniques of Mathematica programming.

You can import XML data into Mathematica using the standard Import or ImportString functions. You can also control various details of the import process, such as how to treat whitespace, whether to recognize entities, or whether to validate against a DTD, by using the conversion options of the Import function.

The following command imports an XML data file into Mathematica.

Unless the file you are importing is in an XML format recognized by Mathematica, such as MathML, the result is a SymbolicXML expression, expr1. You can then manipulate the expression using standard Mathematica commands to produce another SymbolicXML expression, expr2.

Finally, you can export the result as an XML file using the standard Export function.

You can use conversion options to control various details of the export process, such as the format of the exported XML, using the conversion options of the Export function.

The combination of SymbolicXML and Mathematica programming provides a useful alternative to other techniques for manipulating XML documents, such as XSLT transformations or the SAX or DOM APIs used with a low-level programming language such as Java. Mathematica allows you to achieve the same level of flexibility and control in processing XML documents. The advantage is that you can leverage Mathematica’s advanced support for symbolic manipulation and numerical computation to do some very complex and sophisticated transformations that would be difficult or impossible to do using other methods.

For example, you can use pattern-matching techniques to extract specific parts of an XML document, perform numerical computations on the data, and then convert the results into 3D graphics for easy visualization. You can also define transformations to convert one type of XML application to another. For example, you can import a DocBook document as SymbolicXML and then convert it into XHTML format by defining suitable transformation rules to replace one set of element names with another set. For some specific examples of useful applications of SymbolicXML, see Transforming XML.

Support for SymbolicXML is well-integrated with NotebookML, ExpressionML, and MathML. For example, when importing an XML document as SymbolicXML, Mathematica recognizes if the document is in NotebookML, ExpressionML, or MathML format and automatically converts it into a notebook expression in the case of NotebookML, an expression in the case of ExpressionML, or a typeset box expression in the case of MathML. You can also override the default behavior and choose to import any of these XML flavors as SymbolicXML if you wish. There is also a large number of kernel functions for quickly and easily converting between strings, boxes, or expressions on the one hand, and NotebookML, MathML, or SymbolicXML on the other.

Note that if you prefer to manipulate XML documents using Java directly, you can still do so using the J/Link add-on package. This package integrates Mathematica fully with Java, enabling you to call Java commands from Mathematica or to call Mathematica kernel functions from Java programs. You can, thus, have access to both the computational abilities of Mathematica as well as the low-level programming features and classes of Java, and combine the two as needed.

Representing Elements

Each element in an XML document corresponds to an XMLElement command in SymbolicXML. An XML expression of the form

has the following representation in SymbolicXML:

Each XMLElement command has three arguments:

  • The first argument specifies the name of the element.
  • The second argument specifies the attributes of the element. This argument is a list of zero or more rules, with each rule specifying a single attribute in the form: "attribute"->"value".
  • The third argument specifies the actual data contained in the element. This can be raw character data in the form of a string and/or child elements of the element being represented. Each child element is represented by its own XMLElement command. You can nest multiple XMLElement commands to the level necessary to replicate the nested structure of the original XML expression.

The names of all elements and attributes as well as any character data in the XML document are represented as strings in SymbolicXML. This is to prevent a large number of new symbols from being introduced into the Mathematica session, which could lead to possible naming conflicts.

Here is a simple XML fragment.

Here is the representation of this fragment in SymbolicXML.

Here is a more complicated XML expression, showing several levels of nesting.

Here is the corresponding SymbolicXML expression.

Handling Namespaces

If a namespace is specified in an XML element, the syntax of the corresponding SymbolicXML expression is slightly more complex. The exact syntax depends on whether the namespace is specified implicitly, as a default namespace, or explicitly, using a namespace prefix.

Using a default namespace

For any element that lies within a default namespace, the XMLElement expression is the same as it would be if no namespace was specified. However, the element in which the default namespace is declared has its XMLElement expression modified, as shown in the following example.

Here is a simple XHTML document with a default namespace declared on the html element.

Here is the corresponding SymbolicXML expression.

Note that the XMLElement expression representing the html element has a complex structure. Its second argument is:

This statement accomplishes two things:

  • It identifies the attribute xmlns with the namespace defined by the URL http://www.w3.org/2000/xmlns, as required by the XML specification.
  • It sets the value of the xmlns attribute to the URL http://www.w3.org/1999/xhtml, thus defining the default namespace.

In other words, when declaring a default namespace on an element, the syntax of the corresponding XMLElement structure is:

Here xmlns-url is the URL associated with the namespace of the xmlns attribute, and namespace-url is the URL of the default namespace being declared.

Using an explicit namespace prefix

If the namespace is specified explicitly on an element using a namespace prefix, the syntax of the SymbolicXML expression is modified, as shown in the following example.

Here is an XHTML document with some MathML markup embedded in it. The xmlns:m attribute in the math element binds the MathML namespace to the namespace prefix m. All the MathML element names are then written with this namespace prefix attached.

Here is the corresponding SymbolicXML expression.

There are two features to note here.

  • The first attribute of the XMLElement structure for the top-level math element is {"http://www.w3.org/2000/xmlns/","m"}->"http://www.w3.org/1998/Math/MathML". This associates the MathML namespace with the prefix m.
  • The XMLElement structure for each MathML element is of the form XMLElement[{url, element},{},{data}], where url identifies the MathML namespace. This is the SymbolicXML equivalent of writing an element name with the namespace prefix attached.

Representing Other Objects

The XMLObject expression is used as a container for parts of an XML document other than elements, such as comments, processing instructions, or declarations. It is also used as a container for the entire document itself. This structure has the syntax XMLObject[object][data], where object describes the type of object being represented and data specifies the details of the object. There are six types of objects that can be specified as the first argument, each corresponding to a specific type of XML construct.

  • Declaration
  • Comment
  • Document
  • Doctype
  • ProcessingInstruction
  • CDATASection

Declaration

The XMLObject["Declaration"] expression is used to represent the XML declaration that typically appears at the start of an XML document. This has the following syntax.

These two options are allowed.

  • "Standalone" takes the value "yes" if the document references an external DTD and "no" otherwise.
  • "Encoding" specifies the character encoding used in the document. Not all encodings will be honored on export. If an encoding that Mathematica cannot export is specified, an error message is produced and the encoding is changed in the document.

Here is a typical XML declaration.

Here is the corresponding SymbolicXML expression.

Comment

The XMLObject["Comment"] expression is used to represent XML comments. It has the following syntax.

Here is an example of an XML comment.

Here is the corresponding SymbolicXML expression.

Document

The most important XMLObject is XMLObject["Document"]. It is used as a container for the entire document and has the following syntax.

The prolog may contain an XMLObject["Declaration"], followed by optional processing instructions and DTD declarations. The epilog contains either processing instructions or comments.

Here is an example of a simple document consisting of an XML declaration, a comment, and a single element.

Here is the corresponding SymbolicXML expression.

The only option for XMLObject["Document"] is "Valid". This option is set automatically by the parser. If the document was validated on import and validation succeeded, then the option "Valid"->True will be included in the XMLObject expression. If validation was attempted but failed, then "Valid"->False will be included in the XMLObject. If validation was not attempted, then the option is omitted from the XMLObject expression.

Doctype

The XMLObject["Doctype"] expression is used to represent XML document type declarations. It has the following syntax.

These three options are allowed.

  • "System" specifies a DTD in the local file system, either as a relative pathname or a URL.
  • "Public" specifies a standardized name that is used to publicly identify the DTD.
  • "Internal" specifies an internal DTD subset. Its value is a string that contains the data in the internal DTD subset.

Here is a Doctype declaration with all three options.

Here is the corresponding SymbolicXML expression.

For more details on XML Doctype declarations, see the W3C XML specification at www.w3.org/TR/REC-xml.

ProcessingInstruction

The XMLObject["ProcessingInstruction"] expression is used to represent XML processing instructions. This has the following syntax.

It is common to use attribute-like syntax in processing instructions. These pseudo-attributes are not parsed but are returned as raw strings. Here is a processing instruction that specifies a style sheet.

Here is the corresponding SymbolicXML expression. Notice that the double quotes around the attribute values are escaped to distinguish them from the double quotes around the argument as a whole.

CDATASection

The XMLObject["CDATASection"] expression is used to represent CDATA sections. CDATA is a W3C abbreviation for Character Data. CDATA sections are used in an XML document as a wrapper for raw character data to avoid having to escape special characters such as " and <. These characters would normally have to be indicated as &quote; and &lt;, respectively. CDATA sections are used in XML to enclose character data that would require a lot of escaping, such as programs or math expressions.

Here is a simple fragment from an XML document containing a CDATA section.

Here is the corresponding SymbolicXML expression.

By default, CDATASection object wrappers are not preserved on import, and only the contents of the CDATA section are retained. To preserve the CDATASection wrappers, you must explicitly set the conversion option, "PreserveCDATASections" -> True. More information about conversion options is given below.



     
About Mathematica  Download Mathematica Player
Copyright © Wolfram Media, Inc. All rights reserved.