Volume 9, Issue 1

Tricks of the Trade
In and Out
Trott's Corner
New Products
New Publications
Calendar
News Bulletins
New Resources
Classifieds

Editorial Policy
Staff
Submissions
Subscriptions
Back Issues
Contact Information

XML and Mathematica

Transforming XML

Introduction

Mathematica is uniquely suited for processing symbolic expressions because of its powerful pattern matching abilities and its large collection of built-in structural manipulation functions. This section provides a few examples to illustrate the use of Mathematica for processing XML data.

When you import an arbitrary XML document into Mathematica, it is automatically converted into a SymbolicXML expression. The advantage of converting XML data into SymbolicXML is that you can directly manipulate SymbolicXML using any of Mathematica’s built-in functions.

The following command converts an XML string into a SymbolicXML expression.

Here we use a simple transformation rule to remove the unwanted “red” element from the list.

You can use ExportString to convert SymbolicXML into native XML syntax, which was designed to be easy to read.

Visualizing the XML Tree

Many XML tools display an XML document as a collapsible tree, where the nodes correspond to the elements of the document. This example shows how to produce a similar visualization using cell grouping in a Mathematica notebook.

We will do this by recursively traversing the SymbolicXML expression, and for each XMLElement object, creating a CellGroupData expression that contains cells for each of that XMLElement object’s attributes and children. Each nested CellGroupData will be indented from the previous one. We start with the function to process an XMLElement object.

Notice that we use the integer m for indentation. When we map XMLNote onto the XMLElement object’s children, we pass a larger value for m, increasing the indentation for the child elements.

A CellGroupData expression contains a list of cells. In the above definition, we have only created a cell for the XMLElement x. However, we have then mapped XMLNote onto the attribute list. Since this returns a list, we need to use Apply[Sequence] to the result in order to merge that list into the CellGroupData expression’s list of cells. We then do the same thing to the children of the XMLElement.

However, we have not yet defined XMLNote to work on attributes. The attributes of an XMLElement object are stored in SymbolicXML as rules. In most cases, the rule contains two strings: the key and the value. However, when namespaces are involved, the first element of the rule may be a list containing two strings: the namespace and the key. We will need to make two definitions to handle the attributes.

We will need one more definition in order to process simple SymbolicXML expressions. The text nodes in an XML document are stored simply as String objects in SymbolicXML. Thus, we need a definition that handles String objects.

With these definitions in place, we can construct a simple notebook to visualize a basic XML document.

Because the default value of the option "IncludeEmbeddedObjects" is None, we did not alter comments, processing instructions, or anything else that would be stored in an XMLObject. Adding definitions for these is not difficult and would be a good exercise in processing SymbolicXML.

The notebook produced as the result of the above evaluation is shown here.

Manipulating XML Data

XML applications are used for more than just document layout. XML is also an excellent format for storing structured data. Many commercial database vendors are now adding XML support to their products, allowing you to work with databases using XML as an intermediate format.

Mathematica is well suited for extracting and manipulating information from XML documents. To illustrate this, let us manipulate an XML file containing data on major league baseball players. You can download this file from www.mathematica-journal.com.

We first import this file into Mathematica as a SymbolicXML expression.

Each player’s information is stored in a PlayerRecord element. We can easily extract this with Cases.

As we can see, the XML document contains records for 294 players. Since we do not want to sift through all the American League hitters, we will just take a look at the Yankees. Inside each PlayerRecord element, there is a TEAM element that specifies a player’s team. By passing a slightly more sophisticated pattern to Cases, we can extract a list of all players on the Yankees team.

The variable yankees now contains a list of SymbolicXML expressions for all the Yankees players. To see the syntax of each PlayerRecord, we extract the first element of yankees.

We can see that the player’s name is stored in the PLAYER element of each PlayerRecord element. Suppose we just want to look at the names of the Yankees hitters we have already extracted. We can extract the name from one PlayerRecord easily enough.

We can then use Map to extract all the names from yankees.

Alternately, we could have used Cases on yankees with an appropriate pattern.

SymbolicXML is a general-purpose format for expressing arbitrary XML data. In some cases, you may find it more useful to convert SymbolicXML into a different type of Mathematica expression. This type of conversion is easy to do using pattern matching. In the following example, we import an XML file containing data about baseball pitchers and translate the resulting SymbolicXML expression into a list of Mathematica rules. You can download the data file used here from www.mathematica-journal.com.

Here, we have transformed the SymbolicXML expression for a PlayerRecord node into a simpler expression. All the information about the player is stored in a list of Mathematica Rule with Pitcher as the head.

In addition to transforming the data into a different expression syntax as above, we can also modify the data and leave the overall expression in SymbolicXML. This way we can alter our data, but still export it to an XML file for use with other applications. As an example, we will work with the salaries of our American League hitters. First, we delete any PlayerRecord entries where the salary is not available.

Next, we create a function to extract name–salary pairs from our PlayerRecord data. We will then sort these pairs by salary and look at the top ten.

As a simple example of how to change the data in our SymbolicXML expression, we will create a function that doubles players’ salaries.

Visualizing XML Data

Creating a 3D Graphic from an XML File

The following example illustrates how to use Mathematica programming and SymbolicXML to visualize data in XML format. The molecule description markup language (MoDL) is an XML application that describes molecules. For details, see www.oasis-open.org/cover/modl.html. In this example, we convert a MoDL description of the methane molecule into a Mathematica 3D graphic.

The following is the MoDL file, which contains the description of the methane molecule. You can download this file from www.mathematica-journal.com.

Here, we import the file into Mathematica in the form of a SymbolicXML expression.

In order to convert the resulting SymbolicXML expression into a Graphic3D expression, we will need the standard package Graphics`Shapes`.

The following code defines a function called MoDLToGraphics3D that turns the SymbolicXML expression into a Graphics3D expression. This function relies on a number of auxiliary functions that are defined in a later part of this section, which deals with the details of implementation.

Applying this function to the original SymbolicXML expression generates a 3D graphic representing the methane molecule.

The details of implementation of the MoDLToGraphics3D function, which performs the actual transformation from SymbolicXML to a 3D graphic, are provided here.

Implementation Details

Notice that the original MoDL file contains a head and a body. In the head, a number of definitions are made that are used throughout the body. We have extracted these definitions into the variable defs. We then map the function ProcessDefinition across the list of definitions. The function ProcessDefinition constructs a Mathematica expression out of a definition and stores it in the variable moldef, which is dynamically scoped inside of MoDLToGraphics3D.

A DEFINE element in the head typically defines either an atom or a molecule. First, consider an atom definition.

The DEFINE element essentially associates a unique key (in this case C) to an atom element. The atom element specifies its color and radius. We will turn this into a Mathematica expression of the form Atom[radius, color]. We will then store it in moldef[name], where name is the key specified in the name attribute of the DEFINE element.

In this case, a is the entire sequence of attributes of the atom element. The GetRad and GetColor functions, which we define later, extract the radius and color from this sequence. For now, assume that GetRad returns a number and that GetColor returns an RGBColor expression. We now need to process definitions of molecule elements. Like the atom definitions, molecules are given a unique key in the name attribute of the DEFINE element. The molecule element then contains atom elements and bond elements.

The atom elements contain three attributes: type, id, and position. The type attribute references the key from previous atom definitions. The id attribute is a unique key for this instance of the type of atom defined. In other words, what was defined previously in the atom definitions were types of atoms, like carbon or hydrogen. The atom elements inside a molecule element represent a distinct atom of some previously defined type.

The molecule element also contains bond elements. These have two attributes: atom1 and atom2. These reference the id of the atom elements in that molecule expression.

When we call ProcessDefinition on a molecule definition, we will want to store a list of the atoms and bonds in moldef.

In the definition of ProcessDefinition, subdef is the list of atoms and bonds. We map ProcessSubdef onto this list. That is, what we assign to moldef[name] is a list of the result of ProcessSubdef on each atom and bond element in the molecule. When ProcessSubdef is called on an atom, it extracts that atom’s type from moldef, appends the position to that expression, and stores the results under moldef[id]. When ProcessSubdef is called on a bond, it simply returns a Bond expression containing the positions of the two atoms it references.

We need an auxiliary function before we define GetRad, GetColor, and GetPos. Since positions are written as space-separated lists of numbers in MoDL, we first write a function that turns this string into a Mathematica list.

The functions GetPos, GetRad, and GetColor should take in a sequence of attributes of any length and create a list from the position attribute. In SymbolicXML, attributes are stored as Mathematica rules. Both GetPos and GetColor will need to use MolStringListToList. GetRad needs only to convert a string to a number.

Here is the definition of MoDLToGraphics3D again for reference.

Block scopes the variables defs, body, moldef, themols, and theatoms. We already discussed moldef, and defs simply contains a list of the DEFINE elements. The function body just contains the body element of the SymbolicXML expression. That leaves themols and theatoms.

After ProcessDefinition is mapped to defs, themols is defined. Molecules in the body have a type attribute, which references the key of the molecule type defined in the head. The Cases statement then matches the molecules in the body and returns that molecule’s type definition in moldef.

In our example, we only have one molecule in the body. If more molecules existed, the lists of Atom and Bond expressions would be merged together by Flatten. As we will see, the Graphics3D expression is simply made by drawing each Atom and Bond. Also, the body may contain other atoms as well. The definition of theatoms simply matches these elements, reads their type from moldef, and appends their positions. Thus, theatoms would contain a list of more atoms to be drawn.

The last line of MoDLToGraphics3D joins the Atom and Bond expressions in themols with the Atom expressions in theatoms. It then maps MolToGraphics onto this list. MolToGraphics is simply a function that returns a sphere for Atom expressions and a line for Bond expressions. Of course, we also need to define MolToGraphics. The definition is straightforward, provided you are familiar with Graphics`Shapes` and Graphics expressions in Mathematica.

The result is a 3D graphic of methane or any other molecule you have defined in MoDL.

Comparing XSLT and Mathematica

In many situations, there is a need to transform a document from one XML format into another. One popular technique used for this purpose is XSLT transformations. However, Mathematica’s pattern matching and transformation abilities allow you to do similar transformations simply by importing the original document and then manipulating the resulting SymbolicXML expression. This section gives examples of some basic XSLT transformations and explains how to do the equivalent transformations in Mathematica.

A Simple Template

Let us consider a very simple example. Say our XML dialect uses the code tag to enclose program code. Typically, this is displayed in a monospace font. If we were to convert such a document to XHTML, we would probably want to use the pre tag for program code. The following XSLT template would do this.

In Mathematica, you can create a function to do the same.

Inserting Attribute Values

Now consider an XML application that uses the termdef element to indicate the definition of a new term. Again, we will convert this to XHTML. We would like to anchor the definition with a named a element so that we can link directly to that location in the document. Assuming we have templates to handle whatever string formatting is inside the termdef element, we can use the following XSLT.

Notice that the name attribute in the resultant XHTML gets the value of the id attribute of the original termdef element. In Mathematica, you can do the following.

Using Predicates

Let us consider a more complicated example that involves XPath predicates. Assume we would like to match a note element, but only if it either has a role attribute set to example or if it contains an eg element as a child. Let us look at an XSLT template, and then explain what it does.

The first xsl:if element checks to see if the first child element is a p element. If it is, then xsl:apply-templates is called on that child. This is similar to calling Map across the results of Cases. In the second xsl:if element, we check if there are p child elements beyond the first child. If so, xsl:apply-templates is called on those. Here is the corresponding Mathematica code.

Traversing Upwards

So far, all the examples we have given in XSLT have had a very simple implementation in Mathematica using SymbolicXML. In each of these cases, however, we were selecting expressions that were nested inside of the given expression. What if we wanted to select an ancestor or sibling? The following examples show how this can be done in Mathematica.

To clarify the problem and find a solution, we have to realize that an XML document is just a stream of characters that follows a specific syntax. Tools for manipulating XML documents treat XML according to some model. In the case of XSLT (and its path-selection language, XPath), this model is that of a tree. Since Mathematica is a list-based language, it treats XML as nested expression lists.

While these two models are similar, they have important differences. Most notably, in nested lists you do not inherently have any concept of the containing list. Technically, any transformation that can be done with axis types, like ancestor, can be done without them. However, it is often convenient to traverse up the XML document.

Let us look at an example and then discuss how to implement the same behavior in Mathematica. While it will involve a slightly different technique than we have used above, it will nonetheless be rather simple. Consider the following XML document.

We will assume we simply want to have a template that matches bibref elements and replaces them with the text inside of the corresponding bibl element. In XSLT, we would write the following template.

The problem with using the same approach in Mathematica is that once we have matched a bibref element, we no longer have any information about the elements containing it. As a remedy, we will instead pass an expression containing the entire SymbolicXML expression. Notice that the bibref element in question can be obtained from

Rather than pass the XMLElement expression, we can pass this expression wrapped in Hold. That way, we can easily obtain the bibref element by calling ReleaseHold, and we can access ancestors by dropping indices from the Part expression. However, we will need to write a pattern matching function so that we can match these in definitions of functions.

The Mathematica transformation then becomes relatively simple.

Converting a Notebook to HTML

Suppose you need to export a notebook in a specific XML format (apart from standard formats listed under the Save As Special menu). One option would be to export to NotebookML and then use some external tool (e.g., XSLT rules) to transform to the desired form of XML. But often it is just as easy to perform the manipulation within Mathematica, after first converting the notebook expression into SymbolicXML. As an example, let us recreate a simplified version of the File > Save As Special > HTML menu’s functionality.

You can download the sample notebook used in this example from www.mathematica-journal.com. The following commands import the notebook expression so we can manipulate it.

Our method will be to define a recursive function, transform, to process the original (notebook) expression from top to bottom, similar to the templates of XSLT. First, we establish a default definition to discard anything not explicitly matched by other patterns. (Given our “top-down” approach, perhaps this should be the last definition, but we place it here to reduce extraneous output in the intermediate results.)

The above definition uses Sequence[] for the following reason: since transform will be applied recursively, the best “null” result is one that can be dropped in the midst of a list of arguments without disrupting the syntax.

We start with the notebook expression itself.

The following points are worth noting.

• The argument pattern must be robust enough to accept all variants. (Even though the notebook options are discarded in this conversion, a BlankNullSequence (___) is included to allow for them.)
• The only thing done with the contents argument is to pass it back to transform.
• The third argument is always a List. Forgetting this is a common pitfall.
• Notice that we have dropped the head section that is usually included in an HTML document.

The same general theme is followed for the remaining definitions.

Next, we discard cell-grouping information because the HTML has no use for it.

Mathematica sectional heads are translated to their HTML counterparts.

The Text cells introduce a complication: the contents of a Mathematica Text cell can be either a simple string or a TextData-wrapped list, if the text has additional information, such as font changes, specified. Thus, we need a definition for both cases.

Simple strings should just be passed on as is. Once again, in keeping with a top-down style, this should be placed later in the sequence of definitions. But placing it earlier helps make the intermediate results more meaningful.

Finally, we deal with (simple) font changes.

Here is the final product.

You can get output in a more human-readable form by using ExportString.

We can verify that this is well-formed XML.

And, of course, the SymbolicXML can be exported to a file, suitable for viewing with a web browser.

An alternative to a recursive function is to apply a list of replacement rules using ReplaceRepeated.

The two methods produce identical results.

Here is how the two methods differ.

• Since the recursion occurs implicitly via ReplaceRepeated, the latter implementation is cleaner in spots. In particular, contrast the handling of Text cells: the TextData rule can be separated from the Cell rule. The same could be accomplished for the recursive function, but at the cost of additional patterns for the various forms that contents might take (e.g., _List versus _String and so on). ReplaceRepeated, by acting on all subexpressions, obviates this need.
• There is no default rule for the second method. Any unhandled parts of the original Mathematica expression will pass through unchanged, probably rendering invalid XML.

Finally, we use Clear to remove the definitions of all the symbols.

Verifying SymbolicXML Syntax

You can use the function XML`SymbolicXMLErrors to find errors with a SymbolicXML expression. This function returns a part specification that you can use with functions like Part or Extract to access the problematic part of your SymbolicXML expression. Let us return to the American League hitters from our earlier example.

We saw earlier that there were PlayerRecord nodes for which the Salary node contained the string "#N/A". Suppose we decide that any player whose salary is not available to us, must be making \$1,000,000. However, when we do our transformation, we make an easily overlooked mistake.

We have now created incorrect SymbolicXML. We have put the string "1000000" as the third element of the XMLElement expression, rather than using a list containing that string. Suppose we did not know this, though, and that later on we find that Export produces errors when we try to write our modified XML to a file. We can use SymbolicXMLErrors to find the problematic expressions.

Notice that the output is preceded by several XMLElement::name error messages. This indicates that something is wrong with the input. The output of SymbolicXMLErrors, however, tells us exactly where something went wrong. ALerrors now contains a list of part specifications where the errors occurred. Here we examine the first error.

This problem is easy enough to fix.

We can see that the rest of the errors are of the same nature.

By using Map, we can fix the rest of the errors in the same way.

We can then verify that we have fixed the error using SymbolicXMLErrors again.