A Layered Translation Framework
The framework described in this article has a layered design. The layers are as follows:
- General Purpose State Machine
- Generic EDI Parsing Layer
- EDI X12 Parser Plugin
While the design is layered, it is also intended as a pluggable component architecture. The EDI X12 parser and the EDI handler are both plugins to a generic parser. It is designed like this to easily allow other parsers (such as those for EDIFACT or Tradacom) and handlers (such as translators for different transaction set types) to be plugged in.
General Purpose State Machine
The topmost layer of the framework is a state machine borrowed from chapter four of David Mertz's excellent book, "Text Processing in Python."
|Author's Note: Anyone interested in parsing EDI (or any other form of text) files with Python would greatly benefit from reading Mertz's book. While the book is not about EDI, it is about processing text—and EDI is simply text.
I decided to use Mertz's state machine because it is an excellent piece of concise, understandable, and usable code. It was written with the obvious intention of handling a file in a stateful manner, which is an approach well suited for EDI. While writing this EDI parser, I decomposed the parsing into four states:
- Looking for the beginning of an EDI document
- In a header segment
- In a body segment,
- In a trailer segment
You'll find the state machine code in the downloadable code that accompanies this article, in the file state_machine.py (see Listing 1).
Generic EDI Parsing Layer
The next layer of the translation framework is a generic pluggable parser. You can find the code in the file gen_parser.py (see Listing 2). The main purpose for this file is to contain references to parsers and handler (both EDI and Non-EDI) plugins. The generic parser searches through the input file until it hits something that looks like an EDI document, and then passes it off to the proper parser.
Listing 2 contains a single class named gen_parser. The two most interesting methods in the gen_parser class are run() and searching_header().
The run() method explicitly adds the generic EDI transitional states to the state machine by calling the add_state() method. It also adds all X12-specific transitional states to the state machine by calling the add_transitions() method of the X12 parsing class.
The searching_header() method searches for what may be a header segment by iteratively reading three characters, seeing if the three characters are "ISA" (which are the first three characters of an X12 interchange), backing up two characters if not and repeating. When the code finds an "ISA" sequence, it calls a method in the X12 parsing code to determine whether subsequent characters contain a valid ISA segment. This is an inefficient way of searching for a potential header segment and you may be better off with an algorithm more like:
chunk = file_obj.read(100)
if len(chunk) < 100:
return self.eof, ""
ndx = chunk.index('ISA')
file_obj.seek(-3, 1) #backup 3
# characters just in case the tail
# end of our chunk was in the
# middle of the "ISA" sequence
However, if there is only a small amount of garbage text before valid X12 interchanges, this inefficiency should have minimal impact on performance.