Processing EDI Documents into XML with Python

any companies devote a sizeable portion of their IT infrastructure to converting traditional EDI data to and from the data formats their back office systems use. Typically, they handle this conversion using software packages purchased from EDI software vendors. But as more and more back-end systems become capable of consuming XML, it’s becoming increasingly attractive to avoid all the proprietary formats and simply translate EDI X12 data to and from XML. This article shows you how to create your own tools for parsing, validating, and then translating EDI X12 data to XML. All the code examples for this article are in Python, but you could just as easily use any other programming language.

Introduction
EDI (Electronic Data Interchange) is a generic term used to describe the exchange of electronic business documents between business partners. Specific incarnations of EDI such as ANSI X12, EDIFACT, Tradacom, and TDCC are character-delimited text files that follow a specific format.

Python
Python is an object-oriented, byte-compiled language with a clean syntax, clear and consistent philosophy, and a strong user community. These attributes (both of the language and the community) make it possible to quickly write working, maintainable code, which in turn makes Python an excellent choice for nearly any programming task. Processing any “flavor” of EDI is no exception.

EDI Translation
Traditional EDI data such as X12 is rarely integrated directly into back office systems. While some ERP systems (and certainly some other types of applications) provide direct support for importing EDI data, it’s far more common for developers to convert the EDI data to a format more usable by the back office systems, such as flat file (either fixed length record-based files or some delimited format) or XML. EDI software vendors offer what are basically EDI development environments in which you can create custom data transformation descriptions and push EDI data through the translation tool to complete the conversion.

The approach taken in this article illustrates how to build your own simple EDI-to-XML transformation framework. The advantages of such an approach are:

  • No costs for EDI transformation software, which can be quite expensive.
  • Absolute flexibility in the conversion of data (because you have all the power of a programming language available).
  • Potentially higher staffing productivity with custom rather than vendor development environments.

This approach is not without potential problems which warrant the following warnings:

  • While you can find excellent help from the Python community, no specific vendor will be available to help you solve problems.
  • This is an approach for more technically advanced individuals, which you should take into account when considering appropriate staffing.
  • This particular framework is not mature enough for production use and other similar freely available frameworks may not be overly mature, either.

Sample X12 Input
You can find all the files and code described in this article in the downloadable sample code. For example, the sample input file used for this article is an X12 Purchase Order (also referred to as a PO or an 850 transaction set). Here’s the sample document:

      ISA*  *          *  *          *ZZ*SENDER         *ZZ*RECEIVER   *041201*1200*U*00305*000000101*1*P*^!GS*PO*   SENDER*RECEIVER*041201*1200*101*X*003050!ST   *850*000000101!BEG*22*NE*101**041201*123456   !FOB*DF*ZZ*JMJ!DTM*037*041205!DTM*038*04121   5!DTM*002*041218!TD1*CNT90*1!TD5****JJ*X!TD   3*40!N1*OB**92*7759!N3*111   Buyer St!N4*Conyers*GA*30094*US!N1*SE*Foo Bar   Sellers!N4****US!REF*DP*101!PO1*100*1*EA***   ZZ*BL47*HD*100!PID*F****Widget!PO4**1*EA!N1   *CT**38*CN!N4****CN!CTT*1*100!SE*22*0000001   01!GE*1*101!IEA*1*000000101!

As you can see, it’s difficult to discern the document’s structure simply by looking at it. Here’s the same document rendered in a “prettified” view (unwrapped at the segment level and indented to show the looping structure):

   ISA*  *          *  *          *ZZ*SENDER               *ZZ*RECEIVER             *041201*1200*U*00305*000000101*1*P*^!   GS*PO*SENDER*RECEIVER*041201*1200*101*X*003      050!       ST*850*000000101!         BEG*22*NE*101**041201*123456!         FOB*DF*ZZ*JMJ!         DTM*037*041205!         DTM*038*041215!         DTM*002*041218!         TD1*CNT90*1!         TD5****JJ*X!         TD3*40!         N1*OB**92*7759!           N3*111 Buyer St!           N4*Conyers*GA*30094*US!         N1*SE*Foo Bar Sellers!           N4****US!           REF*DP*101!         PO1*100*1*EA***ZZ*BL47*HD*100!           PID*F****Widget!           PO4**1*BC!           N1*ST**9!             N4****US!         CTT*1*100!       SE*22*000000101!     GE*1*101!   IEA*1*000000101!

Unwrapping and formatting the input helps to show the structure, but the content of the document is still not likely to be very clear to most people.

A Layered Translation Framework
The framework described in this article has a layered design. The layers are as follows:

  • General Purpose State Machine
  • Generic EDI Parsing Layer
  • EDI X12 Parser Plugin
  • Handlers

While the design is layered, it is also intended as a pluggable component architecture. The EDI X12 parser and the EDI handler are both plugins to a generic parser. It is designed like this to easily allow other parsers (such as those for EDIFACT or Tradacom) and handlers (such as translators for different transaction set types) to be plugged in.

General Purpose State Machine
The topmost layer of the framework is a state machine borrowed from chapter four of David Mertz’s excellent book, “Text Processing in Python.”

Author’s Note: Anyone interested in parsing EDI (or any other form of text) files with Python would greatly benefit from reading Mertz’s book. While the book is not about EDI, it is about processing text—and EDI is simply text.

I decided to use Mertz’s state machine because it is an excellent piece of concise, understandable, and usable code. It was written with the obvious intention of handling a file in a stateful manner, which is an approach well suited for EDI. While writing this EDI parser, I decomposed the parsing into four states:

  1. Looking for the beginning of an EDI document
  2. In a header segment
  3. In a body segment,
  4. In a trailer segment

You’ll find the state machine code in the downloadable code that accompanies this article, in the file state_machine.py (see Listing 1).

Generic EDI Parsing Layer
The next layer of the translation framework is a generic pluggable parser. You can find the code in the file gen_parser.py (see Listing 2). The main purpose for this file is to contain references to parsers and handler (both EDI and Non-EDI) plugins. The generic parser searches through the input file until it hits something that looks like an EDI document, and then passes it off to the proper parser.

Listing 2 contains a single class named gen_parser. The two most interesting methods in the gen_parser class are run() and searching_header().

The run() method explicitly adds the generic EDI transitional states to the state machine by calling the add_state() method. It also adds all X12-specific transitional states to the state machine by calling the add_transitions() method of the X12 parsing class.

The searching_header() method searches for what may be a header segment by iteratively reading three characters, seeing if the three characters are “ISA” (which are the first three characters of an X12 interchange), backing up two characters if not and repeating. When the code finds an “ISA” sequence, it calls a method in the X12 parsing code to determine whether subsequent characters contain a valid ISA segment. This is an inefficient way of searching for a potential header segment and you may be better off with an algorithm more like:

      while 1:       chunk = file_obj.read(100)       if len(chunk) < 100:           return self.eof, ""       try:           ndx = chunk.index('ISA')           return x12.header_state       except ValueError:           file_obj.seek(-3, 1) #backup 3            # characters just in case the tail           # end of our chunk was in the            # middle of the "ISA" sequence

However, if there is only a small amount of garbage text before valid X12 interchanges, this inefficiency should have minimal impact on performance.

Extending the Parser Class
You can extend this class to accommodate other EDI parsers, such as Tradacom, EDIFACT, etc., by taking the following steps:

  • Write a parser class for the desired type of EDI following the example of the X12 parsing class (described later in this article).
  • Import the module containing the new parser class in the gen_parser module.
  • Create a reference to a parser object for the desired type of EDI in the run() method of the gen_parser class similar to the following example for EDIFACT:
   self.edifact =       edifact_parser.edifact_parser(self)
  • Call the add_transitions() method on the (Edifact) parsing instance created above.
  • Add a check for the new EDI's header segment in the searching_header() method. Following the already established EDIFACT example, you would end up with something like this:
   elif poten_tag == "UNA":       return (self.edifact.header_seg, (infile, poten_tag))   elif poten_tag == "UNB":       return (self.edifact.header_seg, (infile, poten_tag))

While gen_parser.py is a module containing a class definition, it also contains a main section that you can call from the command line. It accepts two parameters: the name of the EDI input file and an output XML file prefix. The parser reads in the specified input file and the EDI translation handler writes a translated XML file for each X12 interchange encountered in the input file using the naming convention _.xml.

EDI X12 Parser Plugin
You'll find the X12 parser plugin implementation in the file x12_parser.py. The parser consists of a single class named x12_parser. This class primarily does two things:

  • recognizes entire valid X12 interchanges
  • tokenizes segments as it is recognizing each interchange

The x12_parser class contains five methods: __init__(), add_transitions(), header_seg(), body_seg(), and end_seg(). The latter three methods pass each tokenized EDI segment to the segment() method of the EDI handler object (discussed below).

The header_seg() method determines whether the potential ISA segment is valid. An ISA segment is a fixed 106 characters long with each element of the ISA segment having a fixed length. Since we have already read in 3 characters ("ISA"), we need to read in another 103 characters and make sure everything is where it should be. If it is, we extract information about the document thus far, such as the characters used for the delimiters (element separator, segment terminator, and sub-element separator) and some interchange identifiers (sender and receiver IDs and qualifiers and interchange date, time, and control number).

The body_seg() method reads in a chunk of characters at a time (100 by default) and looks for a segment terminator. If it hits EOF before it can find a segment terminator, it breaks out. If it finds a segment terminator, it passes that segment to the segment() method on the EDI handler and starts looking for the next element separator. The characters between the segment terminator and element separator are the next segment tag. If the next segment tag is not an "IEA", it will keep looping through the document. If it is an "IEA," the code jumps to the end_seg() method.

The end_seg() method simply reads in the rest of the IEA segment and makes sure everything is where it should be. The IEA is not a fixed-width segment as the ISA is, so you have to do a little more validation work.

Building an EDI Handler
EDI handlers in this framework are similar to SAX content handlers. The four essential EDI handler methods are start_interchange(), end_interchange(), segment(), and error(). The parser calls these methods when it encounters the appropriate events.

You'll find the EDI handler code in a file named edi_handler.py. It contains two relevant classes: Translator and NonEDIHandler (there's also a GenericEDIHandler class that simply prints each segment to standard output).

The Translator EDI handler's primary functions are to:

  • Validate that segments occur in the proper order based on the order specified in a description file.
  • Maintain state for the current position in the looping structure of the EDI file.
  • Provide a do_() method for each X12 tag to facilitate the creation of the output XML.
  • Build a DOM output object and populate it with data encountered in the input file
  • Maintain a dictionary of nodes added to the XML output DOM object

Translator contains the four essential methods mentioned above as well as a number of helper methods.

The start_interchange() method creates a DOM object that will contain the output XML. The end_interchange() method displays the final result of the XML. Translator calls the error() method if a segment appears out of order. It stops all further processing of a file after calling error().

The segment() method splits the segment into elements, calls the validate_seg() helper method to determine if the segment is allowed to appear where it does, and finally calls the appropriate do_() method.

The validate_seg() method is responsible for determining if the X12 segment just encountered appears in its proper sequence. It does so by "remembering" the last X12 segment encountered, being "informed of" the latest X12 segment encountered, and traversing up, down, and/or laterally in a DOM description of the EDI file (which was created from the X12 schema file x12_schema.xml) to discover if the latest segment can occur now.

Author's Note: I have not implemented a validator to determine whether the maximum occurrences of a particular segment have been exceeded. However, that should be a minor modification.

The X12 schema XML file mentioned above details the proper looping structure of the X12 document it describes, whether a segment is required or optional, and whether it can occur multiple times. It's worth noting that I haven't described the elements for each segment. Creating the element descriptions and the code to validate them should be another fairly minor modification to the XML file and the Python code.

XML Output
As mentioned in the Generic EDI Parsing Layer section, you can call the gen_parser.py script from the command line. It accepts the name of an EDI input file. For example, to call gen_parser.py on the sample input file, you use a command like this:

   [email protected]:~/svn/home/source/edi$ python       gen_parser.py example_edi_stream.txt xml_output

As the code reads the X12 input file, it creates an output DOM object and builds upon that as it encounters data. It converts that DOM object to an XML output file. The resultant XML file looks like Listing 3:

Compare the XML in Listing 3 to the documents shown on the first page of this article, and you'll instantly see why translating EDI to XML makes sense. Traditional EDI data formats such as X12 are simply text files that can be processed with any modern programming language. This Python-based example of an EDI translator, while not exhaustive, is a good first step in the development of more extensive tools for managing EDI.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: