Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Convert a Text File to XML

The fundamental problem with any text-to-XML conversion is that without tags, an XSL filter or schema sees such files as undifferentiated blobs. However, here's one case where you can harness the XML Document Object Model itself.


advertisement

ere's an interesting conundrum. You have an application that doesn't have an object model (or has one that's so hideously complex that you'd just as soon not pay a programmer three months' wages just to build a matching automation interface). The details of the program are unimportant, except that it can output a comma or tab-separated text file; it can produce a text file with fields of fixed length; or it can work with an INI file or some similar minimally structured text file.

These sort of text files are the poor cousins of the database world. They often contain sufficiently small numbers of records, which makes them ideal to work with as XML files. With the data in an XML file, you can perform queries, sort and filter the data, and even handle multi-layer formats—all with a common set of tools. This data file could be sent to a different machine without worrying about such coding conventions as whether you're using a comma and I'm using a tab to separate data. Once in an XML file, you can easily transform the data using XSL (Extensible Stylesheet Language) into tables, combo boxes, or far more sophisticated controls or behaviors. In other words, the record becomes vastly easier to manipulate anywhere.

If this is the case, then it should be easy to create an XSL filter or perhaps a schema to load this data into an XML file. Unfortunately, as useful as this would be, the fundamental problem that any text to XML conversion faces is that without tags, an XSL filter or schema sees such files as undifferentiated blobs. However, here's one case where you can harness the XML Document Object Model itself to create a few fairly generic functions for dealing with straight text.



Delimited Text to XML
Consider a simple text database that contains information about employees in your company:


name,dateOfBirth,dept,jobTitle
John,1962-11-24,accounting,senior accountant
Tina,1962-09-26,administration,manager
Karen,1972-01-10, marketing,graphic designer
Michael,1978-02-11,research,programmer
Sandra,1976-10-26,marketing,account manager

This simple text database called employees.txt shows a prime example of a comma delimited file. A routine that converts text to XML should make use of as much information as it can from the records. In general, it's easier to filter out specific pieces of information using XSL query on the created XML than it is to put extensive logic in the code that creates the XML in the first place. The XML file that's generated should use the header information provided at the top of the text file.

The File System Object (FSO) has turned out to be one of the more recent useful additions to Visual Basic and scripting. It encapsulates a number of standard text I/O functions into a single object (with ProgID "Scripting.FileSystemObject") replacing the bewildering collection of legacy Visual Basic commands (such as the highly irregular syntax of the VB Open command). If you are working with text files, I would highly recommend that you take advantage of the features provided with the FSO.

In this particular case, the primary purpose of the FSO is to create a text stream object. A stream is a generic interface for handling data input/output without needing to worry about its source or destination. To read the sample file line by line with the FSO, you would create a text stream object as a read source, iterate through the file line by line until you run out of lines, and then close the text stream:


Dim Fs as FileSystemObject 
Dim TS as TextStream
Set Fs = CreateObject("Scripting.FileSystemObject")
Set Ts = fs.OpenTextFile(textFilePath, ForReading, False,
TristateUseDefault)
While not Ts.AtEndOfStream
	Buffer=Ts.ReadLine
	' Process the buffer
Wend
Ts.Close
Fs.Close

Next, you need to find a way to parse the fields. A general-purpose routine should let you parse on any character, although you most likely will end up using either comma-separated value files (CSV), or tab-separated files (TSV or TAB). In any of these cases, you can take advantage of two useful VB6 (or VBS5) string functions—split() and join()—to parse the strings. Split takes a string and a delimiter and converts that into a variant array, while Join takes a variant array and a delimiter and concatenates each array item as a string using the delimiter as a separator.


	Dim myArray as Variant
	' Split a comma separated string, setting the comma as the 
' delimiter:
	MyArray=split("a,b,c,d,e",",")
' Print the fourth (zero-based) array element ("d")
	Debug.print MyArray(3) 
	' Print the index of the lower and upper bounds of the array
	Debug.print lbound(MyArray),uBound(MyArray)
	' Join the array elements together with a pipe ("|")
	Debug.print join(MyArray,"|")
	' This prints a|b|c|d|e to the debug window.

The Text2XML() function takes advantage of both of these functions, along with the XML DOM, to do the mapping. Because an XML structure requires a single root node, and each object should be encapsulated, the function also takes an "objectName" parameter (for example, "user") and a "collectionName" parameter (for example, "users"). For the sample table, the Text2XML would produce this XML structure:


<employees>
	<employee>
		<name>John</name>
		<dateOfBirth>1962-11-24</dateOfBirth>
		<dept>accounting</dept>
		<jobTitle>senior accountant</jobTitle>
	</employee>
	<employee>
		<name>Tina</name>
		<dateOfBirth>1962-09-26</dateOfBirth>
		<dept>administration</dept>
		<jobTitle>manager</jobTitle>
	</employee>
	<!-- and so forth -->
</employees>

Note the introduction of both an object label ("employee") and a collection label ("employees").

The function itself works by creating an XML object template using the fields defined in the first line of the text field (see Listing 1). This template is then repeatedly copied (or cloned), populated with a new line worth of data, and added to the collection node. You could also create the element manually by buffering the XML structure as a string, but by taking this route you lose the advantage of possible error checking. (Note: I haven't implemented an error handling mechanism here, but if it were critical, you would use the XML DOM parseError object to output to a log or perform other error handling). Thus, to generate the XML file for the sample user list, you'd make the function call:


Dim xmlDoc
Set xmlDoc=Text2XML("employees.txt","employee","employees",",")

Working With Fixed Length Fields
Older databases, EDI messages, and similar applications often either maintain or generate data in a fixed data text format. In this format, the fields themselves are of specific lengths, and to be able to read the record, you need to parse each record at well-defined points in the string. Here is a sample of such a field set, based upon the record file previously defined:


123456789012345678901234567890123456789012345678901234567890
John      1962-11-24accounting      senior accountant
Tina      1962-09-26administration  manager
Karen     1972-01-10marketing       graphic designer
Michael   1978-02-11research        programmer
Sandra    1976-10-26marketing       account manager

This example of a fixed length data field is contained in the file "employees_fl.txt", with fields starting at position 1, 11, 21, and 37. (The numeric line isn't included and is shown for reference only.)

The function FixedText2XML looks complicated, but most of the routines primarily handle the assignment of the field names and the specific parsing of the fields themselves (see Listing 2). In either case, the XML portion of this function is nearly identical—create a collection node and a template XML tree, pull the data from each line of text and assign it to a clone of the template, then export the XML file that is created. The one other difference between the Text2XML and FixedText2XML function stems from the need to trim white space in the latter (you can set this option with the TrimWhiteSpace Boolean value, normally defaulted to true).

You can then call FixedText2XML() by passing fieldnames and lengths, along with the object and collections names. The output is the same as before: an XML document with each node corresponding to either containers or field header names:


<employees>
<employee>
		<name>John</name>
		<dateOfBirth>1962-11-24</dateOfBirth>
		<dept>accounting</dept>
		<jobTitle>senior accountant</jobTitle>
	</employee>
</employees>


dim xmlDoc
set xmlDoc=FixedText2XML("employees_fl.txt",_
          "employee",_
          "employees",_
          Array("name,dateOfBirth,dept,jobTitle"),_
          Array(1,11,21,37), _
          True)

Going the Other Way
It's also worth examining the reverse process, or how to create a standard text file from an XML file. In many respects, this process is much simpler because the XML DOM contains a mechanism to convert any subset of an XML structure into a linear list—the selectNodes property.

In the XML2Text function, you pass the XML document as a DOM, the XML query to generate the list, and the name of the output file as a string to the function (see Listing 3). SelectNodes walks through the XML tree and copies pointers to respective nodes into a list. The nodes themselves stay in their original positions, so you don't need to worry about causing your XML structure to self-destruct when using SelectNodes—the list is just a linked list that can be iterated like any other collection.

Again, a significant portion of this code is housekeeping—keeping the function reasonably generic. It uses two query strings: a record query for retrieving the nodes corresponding to the data at the query level (relative to the initial node, not the overall document) and a property query for determining which query to include. The default action is to include all children of the starting node as record nodes and all children of these nodes as property nodes, although in general you really should specify these explicitly.

If you wanted to specify that the function should retrieve a list only of those employers working in accounting, and then only retrieve the name and date of birth to be stored in employeesSub.txt as tab delimited text, then you'd call the function as follows (assuming EmployeeXML contained the XML structure displayed earlier):


XML2Text  EmployeeXML,_
        "//employee[dept='Accounting']",_
        "name|dateOfBirth",_
        employeesSub.txt",_
        vbTab

I leave it up to you to convert an XML structure to a fixed length text, with the hint that the code is nearly identical to the previous code, save for passing field length parameters.

So Why Do This?
Being able to convert XML into text and vice versa seems like an interesting exercise in working with the DOM, but what are the advantages for moving the text into and out of an XML format? To understand this, you have to realize that the comma- (or tab-) delimited file format suffers from a number of shortcomings. First, you must manually parse the file and build the storage mechanisms. Also, you must write any search routines on the text itself, making all but the simplest searches an expensive proposition. With a comma-delimited file, you can't represent hierarchical data in a list (easily). In addition, text files generally don't support Unicode. Finally, to display a text format in something like HTML, you have to write customer "viewer" programs.

On the other hand, with XML, parsing is performed automatically, as is error handling if the parsing fails. XML supports a rich set of searching capabilities in the form of XPath (or XSL Query, depending upon which marketing department you're listening to). XML easily represents multiple levels of hierarchy without complex indexing schemes. In addition, XML intrinsically supports Unicode and the full ISO character set standards.You can use either the Document Object Model or XSL to transform the XML into a different format, even non-XML formats such as Comma Separated Values (but that's for a different article).

Many programs—spreadsheets, databases, tables in word processing documents, accounting packages, and so forth—give you the option of saving out to a delimited text file. Using the routines in this article gives you the advantage of both this generic form of output coupled with the robustness and sophistication of XML.

Other applications could also benefit from a similar approach. For example, initialization files (INI files) are text files that give a first-level pass to producing hierarchical data. Converting keys and sub-keys into XML is a straightforward process. Similarly, many other application standards (such as a POP3 e-mail format) have similar name/value structures, which you can readily convert into and out of an XML. Think of the possibility of an application that could intercept incoming e-mails, convert them into XML before passing the originals on, and then perform basic search and sorting on the entries.

These sorts of applications will be important in the future as document management moves beyond simply providing directories and pointer structures and into building cohesive information structures based upon meaning. This is one of the most significant reasons for looking at XML, and the methods mentioned here help open the door.





   
Kurt Cagle is the author or co-author of twelve books and several dozen articles on web technologies, XML and web services. He is the president of Cagle Communications (Olympia, WA), which specializes in the production of training materials for XML and Web Services education. He can be reached at kurt@kurtcagle.net.
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap