The 
Mathematica Journal
Volume 9, Issue 1

Search

In This Issue
Tricks of the Trade
In and Out
Trott's Corner
New Products
New Publications
Calendar
News Bulletins
New Resources
Classifieds

Download This Issue 

About the journal
Editorial Policy
Staff
Submissions
Subscriptions
Advertising
Back Issues
Contact Information

XML and Mathematica
Pavi Sandhu

Importing XML

Functions for Importing XML

Import

You can import XML data into Mathematica using the standard Import function, which has the following syntax.

The first argument of the function specifies the file to be imported. You can also specify an optional second argument to control the form of the output. For importing XML data, the relevant file formats are: "XML", "NotebookML", "ExpressionML", "MathML", and "SymbolicXML".

With "XML" as the import format, any XML formats that Mathematica does not recognize are returned as SymbolicXML. The formats that Mathematica does recognize—NotebookML, ExpressionML, and MathML—are treated slightly differently. A NotebookML file is imported as a notebook expression. An ExpressionML file is imported as the corresponding cell expression. A MathML file is returned as the corresponding box expression.

With "SymbolicXML" as the import format, even NotebookML, ExpressionML, and MathML files are imported as SymbolicXML. This gives you a way to override the XML format from being automatically interpreted.

Here is a file containing the MathML encoding for .

If you import the file and specify "XML" as the second argument, the equation is converted into a box expression.

If you specify "SymbolicXML" as the second argument, the equation is imported as a SymbolicXML box expression.

If Import is used with only one argument, Mathematica processes the data in the file based on its file extension. Any file with a .xml extension is imported as XML. This means that if it is in one of the XML formats explicitly supported by Mathematica, namely NotebookML, ExpressionML, or MathML, the file will be interpreted in the appropriate way. All other XML formats are imported as SymbolicXML.

Mathematica also recognizes the .mml extension for MathML files and the .nbml extension for NotebookML files. In the following example, we import a file with the .mml extension.

You can display the above box expression as conventional mathematical notation by using DisplayForm.

You can control the various details of the import process, such as how to treat whitespace, whether to recognize entities, or whether to validate against a DTD, by specifying conversion options to the Import function.

ImportString

You can also use the standard ImportString function to import XML data from a string. This is useful when you want to generate XML data directly within Mathematica instead of importing the data from an external file. The ImportString function has the following syntax.

With "XML" as the import format, any XML formats that Mathematica does not recognize are returned as SymbolicXML. Here is an example of a simple XML expression converted to SymbolicXML using ImportString.

The formats that Mathematica does recognize—NotebookML, ExpressionML, and MathML—are treated slightly differently. A NotebookML file is imported as a notebook expression. An ExpressionML file is imported as the corresponding cell expression. A MathML file is returned as the corresponding box expression. Here is an example of importing a simple MathML expression. Notice that the MathML markup is automatically converted to a Mathematica box expression.

You can stop the automatic interpretation of imported files based on file extension by specifying "SymbolicXML" as the second argument. In this example, the imported file is returned as SymbolicXML rather than the usual box expression.

You can control the various details of the import process, such as how to treat whitespace, whether to recognize entities, or whether to validate against a DTD, by specifying conversion options to the ImportString function.

XMLGet

The XMLGet function can be used to import an XML document as SymbolicXML. It is very similar to the Import function, in that XMLGet[file] is equivalent to Import[file, "SymbolicXML"]. The advantage of using XMLGet is that, unlike Import, it can retrieve files over the web. Hence, it is useful if you want to import an XML file posted at a URL.

For example, the following command retrieves stock quotes from a website and returns the data as SymbolicXML.

Note that XMLGet exists only in the XML`Parser` context. Hence, you must use the full name of the function, XML`Parser`XMLGet, when doing an evaluation. To use the function without the context name prefix, you must first add the XML`Parser` context to your context path.

XMLGet also accepts an optional second argument, which specifies a preinitialized parser object.

Initializing the parser involves loading a DTD into memory either from a URL or a local file. This only needs to be done once in each kernel session. Subsequent references to the DTD are then processed much faster because the DTD has already been read and parsed.

You can also specify conversion options for XMLGet to control its behavior. The conversion options for XMLGet are the same as the ones for Import. However, the syntax for specifying conversion options is slightly different. The conversion options can be specified directly in the XMLGet function, such that

is equivalent to

XMLGetString

The XMLGetString function can be used to import an XML string as SymbolicXML. It is very similar to the ImportString function, in that XMLGetString[string] is equivalent to ImportString[string, "SymbolicXML"].

Note that XMLGetString exists only in the XML`Parser` context. Hence, you must use the full name of the function, XML`Parser`XMLGet, when doing an evaluation. To use the function without the context name prefix, you must first add the XML`Parser` context to your context path.

The advantage of using XMLGetString is that it accepts a preinitialized parser object as its second argument.

Initializing the parser involves loading a DTD into memory either from a URL or a local file. This only needs to be done once in each kernel session. Subsequent references to the DTD are then processed much faster because the DTD has already been read and parsed.

For example, the following command loads the XML package and then preinitializes the parser according to the XHTML DTD located at the specified URL. In this example, the preinitialized parser is given the name XHTMLParser.

Now that the parser is initialized, we import an XML string. The string is validated with respect to the DTD stored in XHTMLParser by setting "ValidateAgainstDTD"->True. The option Valid->True in the output indicates that the XML input string was valid XML with respect to the XHTML DTD.

You can also specify conversion options for XMLGetString to control the various details of the import process. The conversion options for XMLGetString are the same as those for ImportString. As with XMLGet, the conversion options can be specified directly in the XMLGetString function, that is:

is equivalent to

For more information on the conversion options available for importing XML, see Import Conversion Options.

Entities and Validation

An XML document can contain any characters included in the Unicode character set. When importing an XML document into Mathematica, all numeric Unicode character entity references are automatically resolved into the corresponding Mathematica character.

Other entities that are not built into XML are resolved according to the rules present in the DTD.

In addition to simply converting an XML document to a SymbolicXML expression, Import can validate the XML data to ensure that it conforms to a content model defined by a DTD. So long as the document is well formed, a SymbolicXML expression will be returned. If the document is not valid, warning messages will be issued and the document wrapper will indicate the invalid nature of the document with the option Valid->False.

You can control the various aspects of how entities are treated and whether the document is validated or not by using the conversion options for the Import function.

Conversion Options

Introduction

The standard ConversionOptions feature of Import gives you more control over the import process. The syntax for specifying a conversion option is as follows.

Multiple conversion options can be specified by making the right-hand side of ConversionOptions a list of lists. There are nine conversion options available for importing XML data.

  • "NormalizeWhitespace"
  • "AllowRemoteDTDAccess"
  • "AllowUnrecognizedEntities"
  • "ReadDTD"
  • "ValidateAgainstDTD"
  • "IncludeDefaultedAttributes"
  • "IncludeEmbeddedObjects"
  • "IncludeNamespaces"
  • "PreserveCDATASections"

NormalizeWhitespace

This conversion option controls how whitespace in the document being imported is processed. Whitespace is defined as a space, tab, or newline character.

If "NormalizeWhitespace"->True, all the whitespace inside an element is normalized. This means that all leading and trailing whitespace is stripped and any interior whitespace is reduced to a single whitespace character. "NormalizeWhitespace"->True is the default setting for this option.

If "NormalizeWhitespace"->False, then all whitespace is preserved as it was in the original XML document.

If "NormalizeWhitespace"->Automatic, then ignorable whitespace is removed and nonignorable whitespace is preserved. Whitespace is ignorable when it occurs in places where character data is not permitted according to the content model specified by the DTD. The primary use of ignorable whitespace is to add indentation for formatting purposes.

The option "NormalizeWhitespace" and its possible values.

Here is an example of whitespace handling with the default setting "NormalizeWhitespace"->True.

Setting "NormalizeWhitespace"->False preserves the whitespace as it appears in the original string.

Note: If the option "NormalizeWhitespace"->False is specified, pattern matching on the resulting SymbolicXML expression may become problematic because of the intervening whitespace.

AllowRemoteDTDAccess

This conversion option controls whether the parser may access the network in order to retrieve DTDs. If "AllowRemoteDTDAccess"->True, the parser will automatically access the network to retrieve DTDs. If "AllowRemoteDTDAccess"->False, then remote DTDs will not be retrieved, but local DTDs may still be used.

The option "AllowRemoteDTDAccess" and its possible values.

If "AllowRemoteDTDAccess"->False and the document refers to a remote DTD, the parse will fail and an error message will be generated unless the conversion option "ReadDTD" is also set to False.

AllowUnrecognizedEntities

This option determines what the parser will do if undefined entity references are encountered in the XML document.

The option "AllowUnrecognizedEntities" and its possible values.

The following examples contain an undefined entity called 'dogs'. If "AllowUnrecognizedEntities"->False, then an error message is reported and the parse fails.

If "AllowUnrecognizedEntities"->Automatic, an error message is reported for any unrecognized entity, and the entity is wrapped in special entity delimiter characters. However, this does not interrupt the importing and parsing of the XML data. Automatic is the default setting for this option.

If "AllowUnrecognizedEntities"->True, then any undefined entities are wrapped in special entity delimiter characters and no error messages are reported.

ReadDTD

This conversion option determines whether an external DTD subset is read or not. The most important uses of a DTD are to define a content model for validation and to define character entities.

The option "ReadDTD" and its possible values.

Since reading the DTD can directly affect the contents of the document, "ReadDTD"->True is the default setting for this option. Setting "ReadDTD"->False can improve the efficiency, but you should only make this change if you are certain that no information is required from the DTD. "ReadDTD" is ignored if you are using a preinitialized parser.

Setting "ReadDTD"->False is the only way to prevent the parser from attempting to read the DTD. Setting "AllowRemoteDTDAccess"->False will prevent network access and setting "ValidateAgainstDTD" ->False will prevent validation from happening, but neither of these options will prevent an error caused by the parser failing to read the DTD.

ValidateAgainstDTD

This conversion option determines whether the XML document is validated or not.

The option "ValidateAgainstDTD" and its possible values.

If the document is valid, the parser will set the XMLObject["Document"] option "Valid"->True. If the document is invalid, the parser will generate validity error messages and will set "Valid"->False.

The following is an example of trying to parse a document that is not valid by setting "ValidateAgainstDTD"->True. The parser generates error messages because the document is not valid.

On the other hand, if the document is valid, then no messages are generated and "Valid"->True is included in the output.

Parsing both of the examples above with "ValidateAgainstDTD"->False generates no error messages, nor does it add a "Valid" option to XMLObject["Document"].

With "ValidateAgainstDTD"->True, validation is attempted even if there is no DOCTYPE declaration.

To have validation on only when there is a DOCTYPE Declaration, use the default setting "ValidateAgainstDTD"->Automatic. In the following example, no DTD is specified so the parser does not attempt to validate the XML string.

Here the parser tries to validate the input string because a DTD is specified explicitly.

Note that even when using a preinitialized parser, "ValidateAgainstDTD"->Automatic will not validate unless there is a DOCTYPE declaration in the document.

IncludeDefaultedAttributes

This conversion option determines whether attributes that are specified by the DTD as default attributes are included in the SymbolicXML expression. "IncludeDefaultedAttributes"->False is the default setting for this option because the default values for the attributes will be known to application developers, and, therefore, it is unnecessary to include the values in the SymbolicXML expression. Setting "IncludeDefaultedAttributes"->True will include them in the SymbolicXML expression.

The option "IncludeDefaultedAttributes" and its possible values.

Here is a simple example to illustrate how this option works. For brevity, let us assign a variable to represent the XML fragment.

This converts the XML fragment into SymbolicXML.

If you want the default attributes to be included in the imported SymbolicXML, set "IncludeDefaultedAttributes" -> True.

Though default attributes are defined in a DTD, including them in the expression is not the same as validation; thus, default attributes can be included even with "ValidateAgainstDTD" -> False.

IncludeEmbeddedObjects

This conversion option determines the treatment of comments and processing instructions that occur inside the document tree.

The option "IncludeEmbeddedObjects" and its possible values.

As before, we set a variable to represent a simple XML fragment to facilitate further examples. "IncludeEmbeddedObjects"->All will include all embedded objects in the body of the XML document.

With "IncludeEmbeddedObjects"->All, all the embedded objects will be included in the document tree.

Since comments and processing instructions are not intended to affect applications that use the XML document, they are usually excluded from the document tree. Including them runs the risk of hampering pattern matching. Hence, the default setting is "IncludeEmbeddedObjects"->None.

ProcessingInstructions and Comments

Setting "IncludeEmbeddedObjects"->"ProcessingInstructions" or "IncludeEmbeddedObjects"->"Comments" will include only the embedded processing instructions or comments, respectively. You can also set "IncludeEmbeddedObjects"->{"Comments", "ProcessingInstructions"} so that a list of the embedded comments and processing instructions will be included.

IncludeNamespaces

This conversion option determines how namespaces are handled.

The option "IncludeNamespaces" and its possible values.

We will set a variable to represent a simple XML fragment with namespaces to facilitate our examples.

True

"IncludeNamespaces"->True reports the namespace information for each element and attribute via a list of the form {namespace, localname}. This form is verbose, but it is more faithful to the data model of the XML document. Additionally, this form may be easier to use for pattern matching.

False

"IncludeNamespaces"->False only reports the local name of each element or attribute. While this setting should not be used in any serious XML application, it is useful for applications that have only a single namespace because it makes the SymbolicXML expression easier to read. Note that the names of all the child elements appear to be identical when parsed this way. Consequently, this option value cannot be trusted whenever multiple namespaces are used.

Automatic

With the default value "IncludeNamespaces"->Automatic, the namespace is determined by means of scoping. If the namespace of an element is the same as the default namespace, then the name is represented as a single string for the local name. If the namespace of an element is different, then the name is represented by a list with the structure {namespace, localname}.

Here, we see that the only element whose name is represented by a two-string list is the one in namespace http://anothernamespace.com. The other elements are implicitly contained in the http://mynamespace.com namespace. Attributes are not compacted because according to the W3C specification, the attributes and the elements have different namespace scoping.

Unparsed

Because the XML namespace recommendation, which extends XML, was made after the initial XML recommendation, there are some documents that use names in a non-namespace-compliant fashion. "IncludeNamespaces"->"Unparsed" is provided to allow parsing of these documents. With this value, the name is always represented as a single string: the exact string that appears in the XML file. Unless absolutely necessary, this option value should not be used.

PreserveCDATASections

This option controls whether the distinction between CDATA sections and regular character data is maintained on import. The value can be either True or False. CDATA sections are meant as a convenience for document authors; for most applications, they should not be treated differently from ordinary data. This means that preserving CDATA sections can make pattern matching difficult. For this reason, "PreserveCDATASections"->False is the default setting.

The option "PreserveCDATASections" and its possible values.

Here is an example of the default behavior of PreserveCDATASections.

To preserve CDATA sections, specify "PreserveCDATASections"->True.



     
About Mathematica  Download Mathematica Player
Copyright © Wolfram Media, Inc. All rights reserved.