Transform Legacy Data to XML Using JAXP

he ability to parse and transform XML documents is nothing new for Java developers. There are several packages available that come with the tools to do it. But these products all have incompatible implementations, so once you start using one package, you’re locked in.

Sun Microsystems has attacked this problem by creating interfaces for many different aspects of XML programming. One of the packaged interfaces Sun created is the Java API for XML Processing (JAXP), which is part of its early release XML pack. JAXP abstracts both XML parsing and transformations. Thus, writing your XML processing code using JAXP will make it portable to any other implementation that supports JAXP. As of this writing, the XML pack comes with Xerces2, XalanJ, and Crimson, which all support JAXP. Many people consider Apache’s Xerces2 and XalanJ to be the de facto standards for Java XML processing anyway, so their support for JAXP is no surprise.

In this article I will show you how to create your own legacy data parser using JAXP. Why write a parser for a non-XML document using JAXP? I want to show how to transform a legacy data file into an XML document. By using JAXP to parse the file I will be able to create a DOM representation of it, which can then easily be transformed to XML.

Later in the article, I will demonstrate a simple XML transformation using JAXP. I’ll use the legacy data parser created in the first part of the article to transform a comma-separated value (CSV) file into an XML document. Because the transformation uses JAXP, the code will never have to change so long as the parser you use conforms to JAXP’s interface.

Writing the Parser
Before beginning, I should define exactly the type of file format that I’ll be using. Well before XML came along, developers used various types of ASCII data formats to exchange data?and often still do. One of the most common of these formats is comma-separated values. The CSV format is a very straightforward way of representing tabular data. In a CSV file each line represents a row of data. The row data is actually just a list of column values delimited by commas. However, there are lots of other similar formats to CSV for ASCII data representation. The difference between these formats is often as minor as using a different delimiter, e.g. the pipe character.

There are many different approaches to parsing a CSV file. However, I want to parse the CSV file into a DOM document, so I’ll use the simple API for XML (SAX). SAX, as you may know, is event driven. It is different from a DOM parser in that a DOM parser loads the whole file into memory whereas SAX simply fires an event every time it encounters a tag in the XML document. I chose SAX because it allows for parsing of the CSV file line-by-line and column by column, firing SAX events for each element. Each of these elements will then be added to the DOM tree resulting in a DOM representation of my CSV file.

In order to write a JAXP parser one needs to implement the XMLReader interface. Because I am going to write a few different parsers I decided to abstract XMLReader by creating an abstract class named AbstractXMLReader. To implement the XMLReader interface I first determined what imports and properties I would need for the required methods. Below is the XMLReader class with just the imports and properties.

   import java.io.*;   import java.util.*;   import org.xml.sax.*;      public abstract class AbstractXMLReader       implements org.xml.sax.XMLReader   {      private Hashtable handlers = new Hashtable();      private Hashtable properties = new Hashtable();      private Hashtable features = new Hashtable();   }
With the shell of my class ready, I can start implementing the required methods of the XMLReader interface, starting with the basic getters and setters.

   public void setContentHandler(ContentHandler handler) {       this.handlers.put(“ContentHandler”, handler);    }      public void setDTDHandler(DTDHandler handler) {       this.handlers.put(“DTDHandler”, handler);    }      public void setEntityResolver(EntityResolver handler) {       this.handlers.put(“EntityResolver”, handler);    }      public void setErrorHandler(ErrorHandler handler) {      this.handlers.put(“ErrorHandler”, handler);    }   public ContentHandler getContentHandler() {       return (ContentHandler)          this.handlers.get(“ContentHandler”);    }   public DTDHandler getDTDHandler() {       return (DTDHandler)          this.handlers.get(“DTDHandler”);    }   public EntityResolver getEntityResolver() {       return (EntityResolver) this.handlers.get(“EntityResolver”);    }   public ErrorHandler getErrorHandler() {       return (ErrorHandler) this.handlers.get(“ErrorHandler”);    }
As you can see, the XMLReader interface requires implementers to have methods to get and set four different handlers: ContentHandler, DTDHandler, EntityResolver, and ErrorHandler. I created a Hashtable named handlers to hold their values. You must be careful to cast objects that you get from Hashtables to their appropriate type because the get method always returns an Object. The XMLReader interface also requires accessors and mutators for supported properties and features, which need to throw specific exceptions in case a requested property or feature is not supported.

   public void setFeature(String name, boolean value)      throws SAXNotRecognizedException, SAXNotSupportedException   {      this.features.put(name, new Boolean(value));   }   public boolean getFeature(String name)       throws SAXNotRecognizedException, SAXNotSupportedException   {      Boolean value = (Boolean) this.features.get(name);      return value.booleanValue();   }   public Object getProperty(String name)      throws SAXNotRecognizedException, SAXNotSupportedException   {      return this.properties.get(name);   }   public void setProperty(String name, Object value)      throws SAXNotRecognizedException, SAXNotSupportedException   {      this.properties.put(name, value);   }
Next comes the all-important parse method. This is an abstract class, so normally you want the parse method to be abstract. That way, any class that extends the abstract class would need to override the parse method. However, there is a certain amount of common work the parse method needs to do, so I added an additional method, parseImplementation, and made that method abstract instead.

   public void parse(String systemId)       throws IOException, SAXException    {       parse(new InputSource(systemId));    }      public void parse(InputSource input)       throws IOException, SAXException   {      BufferedReader br = null;         if(input.getCharacterStream() != null)          br = new BufferedReader(input.getCharacterStream());      else if(input.getByteStream() != null)         br = new BufferedReader(new InputStreamReader            (input.getByteStream()));      else if(input.getSystemId() != null) {         java.net.URL url = new java.net.URL(input.getSystemId());         br = new BufferedReader(new InputStreamReader(url.openStream()));      }      else         throw new SAXException(“Invalid InputSource object”);         this.parseImplementation(br);   }      public abstract void parseImplementation(BufferedReader br)       throws IOException, SAXException;
The XMLReader interface requires an overloaded parse method. My parse method doesn’t actually parse anything; it offloads the actual parsing to the parseImplementation method. It sets up a BufferReader for the passed in InputSource and throws a SAXException in the case that it can’t create a BufferReader.

Abstraction and Reuse
Now that I have my AbstractXMLReader class I can easily extend it to create any kind of parser I need. Because I want to create parsers to handle legacy data formats that represent rows with lines, I created another abstract class to parse each line. Here is the complete class.

   import java.io.*;   import org.xml.sax.*;   import org.xml.sax.helpers.*;      public abstract class AbstractLineReader extends AbstractXMLReader   {      public void parseInput(BufferedReader br)          throws IOException, SAXException      {         String line = null;         while((line = br.readLine()) != null)         {            line = line.trim();            if(line.length() > 0)               this.parseLine(line);         }      }      public abstract void parseLine(String line)          throws IOException, SAXException;   }
This class loops through the input from the BufferedReader and then calls the parseLine method on the resulting String. Whatever class extends the AbstractLineReader must override the parseLine method and ultimately fire a SAX event based on the line.

Finally, I am ready to parse a CSV file. To do that I extend the AbstractLineReader class with a new class named CSVReader. Below is the shell class.

   import java.io.*;   import java.util.*;   import org.xml.sax.*;   import org.xml.sax.helpers.*;      public class CSVReader extends AbstractLineReader   {      private ContentHandler ch = null;   }
This code first imports all the classes I need and then declares a ContentHandler. The ContentHandler is the representation of the DOM tree I want to build. Now that I have the shell class I can override the parseImplementation method.

   public void parseImplementation(BufferedReader br)       throws IOException, SAXException   {      this.ch = getContentHandler();         ch.startDocument();      ch.startElement(“”, “”, “csv”, new AttributesImpl());         this.parseInput(br);         ch.endElement(“”, “”, “csv”);      ch.endDocument();   }
This code references the ContentHandler, and then begins to create the DOM tree. The CSV file has an element named csv for the root of the tree. This element will have no attributes, so I call the default AttributesImpl constructor. Everything in XML is a container, so I need to fire an event at the start and end of each container.

XML containers are, of course, represented with tags, so when the startElement method is called the resulting tag is . Later I will call the endElement method to create the closing tag , but first I will parse each line of the file by calling the parseInput method of the AbstractLineReader. Remember that the parseInput method calls the parseLine method on each line it finds. However, parseLine is an abstract method, so I need to override it in CSVReader.

   public void parseLine(String line) throws IOException, SAXException   {      StringTokenizer st = new StringTokenizer(line, “,”);      String curElement = null;         ch.startElement(“”, “”, “line”, new AttributesImpl());      while(st.hasMoreTokens())         this.parseElement(st.nextToken());      ch.endElement(“”, “”, “line”);   }
Because this is a CSV file, I know that each line is just a list of data separated by commas. To pull out each column’s value, I create an instance of the StringTokenizer class. But first I create the tag , a, container for each row of data, using a new element named line. With that done a loop through each token will pass the token’s value to the parseElement method. Once all of the tokens have been parsed, I can close the line container by calling endElement, which results in the tag .

   private void parseElement(String element) throws IOException, SAXException   {      ch.startElement(“”, “”, “value”, new AttributesImpl());      element = this.cleanQuotes(element);      ch.characters(element.toCharArray(), 0, element.length());      ch.endElement(“”, “”, “value”);   }
The String element that is being passed into the method parseElement is the actual value of the column I am looking for, and I want to wrap the column’s value in a value container. To do this, the startElement method creates the tag . CSV files sometimes wrap the values of data in quotations, so I call the method cleanQuotes to overwrite the value of every element and strip off any quotation marks.

With the String element all cleaned up, I can finally put some data in the containers I have made. I do that by calling the characters method, which expects a character array as well as the index it should start with and the length. The String class’s toCharArray method gets the character array and the length method finds its length. Finally, the endElement method creates the closing
tag.

   private String cleanQuotes(String element)   {      if(element.startsWith(“””) && element.endsWith(“””))         return element.substring(1, element.length() – 1);      else         return element;   }
The cleanQuotes method checks to see if the element has a quotation mark at the start and end of the String. If it finds a quotation mark at the start and end, it strips them off and returns the String. Otherwise it returns the String untouched.

Reusing the Parsing Code
Before I use my new CSVReader class to transform CSV files into XML documents, I thought I would implement a PipeReader class to parse pipe-delimited files instead of CSV files. This helps demonstrate other ways you might make use of the parsing code once you’ve created it. Assuming I abstracted things correctly, the PipeReader class should be easy to create. The PipeReader class is the same as the CSVReader class, semantically; I have included the diff of CSVReader and PipeReader below to show how similar they are.

   6c6   < public class CSVReader extends AbstractLineReader   ---   > public class PipeReader extends AbstractLineReader   15c15   <               ch.startElement("", "", "csv", new AttributesImpl());   ---   >               ch.startElement(“”, “”, “pipe”, new AttributesImpl());   19c19   <               ch.endElement("", "", "csv");   ---   >               ch.endElement(“”, “”, “pipe”);   25c25   <               StringTokenizer st = new StringTokenizer(line, ",");   ---   >               StringTokenizer st = new StringTokenizer(line, “|”);

Doing the XML Transformation
Now I’ll actually use the JAXP parser I’ve just created in a transformation. In order to do any type of transformation you need two things: an XML file and an XSL file. The XML file is what you want to transform from and the XSL tells the XSLT engine how to transform it. This parser creates a DOM tree that outputs directly to XML, so you don’t actually need the XSL file. However, there could be instances in which you’d want to transform the DOM tree into another XML format. With that in mind, I am going to create a command-line application that takes an input filename as a parameter and, optionally, an XSL filename.

   if(args.length == 0)   {      System.err.println(“Usage: java ” +          Processor.class.getName() + ”  [xslt file]”);      System.exit(1);   }   String dataFile = args[0];
The above code checks to see if the application was run without any parameters and, if so, prints out usage information. Once the user successfully supplies a filename for the input data file, you set the String dataFile to its value. Next, create an instance of the SAXTransformerFactory class.

   SAXTransformerFactory saxTransFact = (SAXTransformerFactory)       TransformerFactory.newInstance();   TransformerHandler transHand = null;
I need the TransformerHandler constructor because it will handle the instruction to use an XSL file for the transformation, if one is provided. It can optionally take a StreamSource as an argument.

   if(args.length > 1)      transHand = saxTransFact.newTransformerHandler         (new StreamSource(new File(args[1])));   else      transHand = saxTransFact.newTransformerHandler();
This code checks to see if an XSL file was specified from the command-line, and if so, passes a new StreamSource instance to the TransformerHandler.

   transHand.setResult(new StreamResult(System.out));   XMLReader reader = null;
The TransformerHandler also needs to know where to stream its results. In this case I have chosen to use System.out, but any stream would work just fine. Now I need to create an instance of the parser. To make the code more generic, I cast whatever parser I instantiate down to the XMLReader interface.

   if(dataFile.endsWith(“.csv”))      reader = (XMLReader) new CSVReader();   else if(dataFile.endsWith(“.pipe”))      reader = (XMLReader) new PipeReader();   else   {      System.err.println(“Invalid file extension”);      System.exit(1);   }   InputSource is = new InputSource(new FileReader(dataFile));
The above logic block simply looks at the file extension of the input data file to determine which parser to use. If it can’t match up an extension with a parser, it simply prints an error and exits the program. After creating an instance of an XMLReader I create a new InputSource using the data file.

   reader.setContentHandler(transHand);   reader.parse(is);   System.out.println();
I then pass my TransformerHandler to my XMLReader as the ContentHandler. All that is left to do is actually call the parse method, which takes an InputSource as a parameter. And, finally, print out an extra blank line for good measure.

The parser I created in this tutorial reads legacy data files and creates a DOM representation of them. I created a very simple program to do an XML transformation based on this DOM. I then used this program to test the legacy data parsers I created. I could have focused simply on the transformation code, but I thought a proper way to introduce JAXP was to show how it can be used to do useful, if unexpected, new things.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: