First published by IBM at http://www-106.ibm.com/developerworks/xml/library/x-tipstx3/index.html.
The screening or classification of XML documents is a common problem, especially in XML middleware. Routing XML documents to specific processors may require analysis of both the document type and the document content. The problem here is obtaining the required information from the document with the least possible overhead. Traditional parsers such as DOM or SAX are not well suited to this task. DOM, for example, parses the whole document and constructs a complete document tree in memory before it returns control to the client. Even DOM parsers that employ deferred node expansion, and thus are able to parse a document partially, have high resource demands because the document tree must be at least partially constructed in memory. This is simply not acceptable for screening purposes.
Like DOM, SAX parsers control the complete parsing process. By default, a SAX parser starts parsing at the beginning of a document and continues until the end. Client event handlers are informed through callbacks about the events during this parsing process. To avoid unnecessary overhead during document screening, such an event handler may want to stop the parsing process once it has gathered the required information. A common technique for achieving this in SAX is throwing an exception, which is discussed in the developerWorks tip "Stop a SAX parser when you have enough data" by Nicholas Chase. This will cause SAX to stop the parsing process. The information gathered by the event handler must be encoded in an error message that's wrapped in an exception object and posted to the parser's client. A special error handler in the client receives this exception and must parse the parser's error message to retrieve the required information! This may be a solution to the screening problem, but it's a complicated one.
Enter StAX
StAX offers a pull parser that gives client applications full control over the parsing process. A client application may decide at any time to discontinue the parsing process, and no tricks are required to stop the parser. This is ideal for screening purposes.
Listing 1 shows what a simple document classifier might look like. I use the cursor-based StAX API for this example. At the very first start tag of the document (the root element tag), I retrieve the kind attribute from this element. The value of this attribute is then passed back to the client and the parsing process is discontinued. The client may now act upon this returned value.
Listing 1. Screening documents
import java.io.*;
import javax.xml.stream.*;
public class Classifier {
// Holds factory instance
private XMLInputFactory xmlif;
public static void main(String[] args)
throws FileNotFoundException, XMLStreamException {
Classifier router = new Classifier();
String kind1 = router.getKind("somefile.xml");
String kind2 = router.getKind("otherfile.xml");
}
/**
* Return the document kind
* @param string - the value of the "kind" attribute of the root element
*/
private String getKind(String filename)
throws FileNotFoundException, XMLStreamException {
// Create input factory lazily
if (xmlif == null) {
// Use reference implementation
System.setProperty(
"javax.xml.stream.XMLInputFactory",
"com.bea.xml.stream.MXParserFactory");
xmlif = XMLInputFactory.newInstance();
}
// Create stream reader
XMLStreamReader xmlr =
xmlif.createXMLStreamReader(new FileReader(filename));
// Main event loop
while (xmlr.hasNext()) {
// Process single event
switch (xmlr.getEventType()) {
// Process start tags
case XMLStreamReader.START_ELEMENT :
// Check attributes for first start tag
for (int i = 0; i < xmlr.getAttributeCount(); i++) {
// Get attribute name
String localName = xmlr.getAttributeName(i);
if (localName.equals("kind")) {
// Return value
return xmlr.getAttributeValue(i);
}
}
return null;
}
// Move to next event
xmlr.next();
}
return null;
}
}
|
Note, that I use an instance field to hold the XMLInputFactory instance. This is done to improve efficiency. Compared to the actual parsing process (which is blazingly fast), the execution of XMLInputFactory.newInstance() and xmlif.createXMLStreamReader() cause considerable overhead. While createXMLStreamReader() must be executed once for each new document, you may reuse the XMLInputFactory instance and thus avoid the repeated execution of XMLInputFactory.newInstance().
Next steps
This tip demonstrated the use of StAX parsers for screening and classification of XML documents. In the next tip, I will show how XML documents can be created through the StAX API.
Resources