Using YAML to Decrease Data Transfer Bandwidth Requirements

ML is a wonderful thing. It has empowered a whole new class of application?loosely-coupled Web services cooperating to form applications?with XML being the glue that binds them together through well-known, easy to parse documents of data or well-known easy to understand commands in SOAP (an XML variant).

The power behind XML lies in the fact that XML data is both well-structured and (to some degree) self-describing using tag names and attributes. Coupled with the availability of powerful parsers to deserialize XML documents, and XML schemas that let you define how your XML should appear, validating parsers can “prove” that an XML document is a ‘good’ document that meets the schema criteria.

But XML has a size problem. The simplest XML document can look something like this:

   

Even a minimal example uses many characters to represent a simple value, such as:

   

The preceding line uses 19 characters to store a text representation of the integer value 1. And that doesn’t include the open and close tags for the document, nor any schema references or other tags that may be necessary.

With XML, increased usability tends to lead towards increased file size, particularly when that also involves schemas, taxonomies, XLINK pointers, rollups, etc. Possibly the single most important use of XML in the future will be XBRL (eXtensible Business Reporting Language) which is revolutionizing the way that businesses interpret financial information?but XBRL carries incredible overhead. Take a look at this Microsoft SEC filing in XBRL as an example. The ratio of overhead to content in that link has to be at least 5:1.

Solving the XML overhead problem is where YAML (which stands for “YAML Ain’t Markup Language”) is attempting to carve a niche. There are many cases, particularly for smaller, simpler, well-known data documents where XML’s high overhead is unnecessary, and its bandwidth expense can be prohibitive. Many Web sites run by smaller companies have caps on their bandwidth allotment that they don’t want to waste. For them, YAML can provide a great alternative.

Comparing YAML and XML
You should note that YAML isn’t intended to compete with XML, as there is no direct correlation between them. Instead, YAML is intended primarily as a data serialization language. It doesn’t have the overhead that XML has because it isn’t designed to have the backward compatibility that XML’s designers wanted XML to have. In addition, while XML is designed to support generalized structured documents, YAML is targeted specifically at data structures and messaging. There are ongoing efforts to define XML/YAML mappings, and a good resource to find them is http://yaml.org/xml.html.

In XML you create a document using hierarchical tags and child tags to describe data. A simple XML document could look something like:

                  20.00      21.23      21.34      19.92                 21.20      21.35      21.37      21.00        ...       

In YAML, the same information could be rendered as:

   ---   Day: "1 January 2004"   values:     - open: 20.00     - close: 21.23     - high: 21.34     - low: 19.92   Day: "2 January 2004"   values:     - open: 20.00     - close: 21.23     - high: 21.34     - low: 19.92

Not only is this less verbose, but it’s also easier for humans to read. In this article you’ll learn how to turn your data into YAML, and use an open source Java parser to read it.

A Whistlestop Tour of YAML Grammar
You can read the full YAML specification, but here’s a very brief introduction to YAML’s grammar. Sequences of information are separated by a dash and a space ‘- ‘, as follows:

   - Monday   - Tuesday   - Wednesday

Mappings, which associate one piece of information with another, use a colon and a space ‘: ‘ as shown below:

   Age: 34   Height: 200   Weight: 170

You can easily associate sequences with a piece of information:

   United:   - Freddy Adu   - Troy Perkins   - Earnie Stewart      Metrostars;   - Eddie Pope   - Eddie Gaven   - Jonny Walker

Or you can create a sequence of value mappings:

   -      Name: Freddy Adu      Age: 15      Goals: 7   -      Name: Earnie Stewart      Age: 34      Goals: 2

These few examples just scratch the surface of the grammar set for YAML, but they should suffice to give you a good feel for the sheer simplicity of the language, and how easily human-readable it is.

A Use Case for YAML
By now you’ve heard of how Web services, using XML as their binding, allow disparate clients on different machines to come together to form distributed applications. The key to this is that Web services use XML for the data and XML/SOAP for the control of that data. In many cases, particularly for smaller, simpler applications, a Web service is overkill, but you’ll probably still use one because the nice thing about XML is that it is easy to generate, and you don’t need to write a parser to accept it. There are parsers available for every programming language and operating system that you can think of!

Similarly, for YAML to work, you need to have a YAML parser that can handle this serializable data format, or, why would you bother? It would be much easier to write a custom data format optimized for your needs than it would be to write a YAML parser. Fortunately, you don’t have to, because there are YAML parsers out there, in various states of construction, for most programming languages, with .NET being the exception.

At this moment in time, YAML is probably best supported in Ruby, thanks to the Syck parser, which is lightweight, rapid, and built in with later versions of the language. Syck is also available for PHP, and there are open-source parsers for many other languages. For this article I used the Java language parser, available at http://homepages.ihug.com.au/~zenaan/zenaan/files/. Interestingly enough, if you want to use .NET, the Java Language Conversion Assistant from Microsoft converts this parser quite painlessly, although you should get in contact with the source owner for permission to distribute it if you take that route.

 
Figure 1. The Time Series Database: The figure shows a Query Window view of the PriceHistory.Prices database table.

The rest of this article describes how to build a simple PHP YAML-over-HTTP service that reads data from a MySQL database and renders it in YAML. A Java client can then read this YAML with the aid of the parser noted above.

Included with the downloadable code is a SQL script called CreatePriceHistory.sql that creates and populate a database with thirty days’ worth of time-series data for the stock symbol MSFT. Figure 1 shows the database table structure generated by the CreatePriceHistory.sql script.

Writing the PHP Script
For this article, the goal is to create a PHP script that queries this database and generates YAML from it. The brevity of the following PHP code reflects the ease and brevity of YAML itself.

   Unable to locate the main ' .                'database at this time.

' ); } $strSql = "select * from Prices"; $result= mysql_query($strSql) or die(mysql_error()); echo("--- "); if($result) { while ($row = mysql_fetch_array($result)) { echo("Day: "" . $row['date'] . "" "); echo("values: "); echo(" - open: " . $row['open'] . " "); echo(" - close: " . $row['close'] . " "); echo(" - high: " . $row['high'] . " "); echo(" - low: " . $row['low'] . " "); } } ?>

The code is quite simple?it outputs a YAML document with the format shown in Table 1.

Table 1. Output YAML Document: The table shows a single record from the YAML document created by querying the database and formatting the results in PHP, along with comments explaining the YAML syntax.

YAML Comments
---Day: "2004-12-27"Values:open: 27.01close: 26.85high: 27.10low: 26.82Day:.
Start the documentStart a data section, for this dayWe'll be populating values with a series of dataUse '-' for a series of data. Therefore, the values for open, close, high and low are in the valuesStart another day series for this day

What the PHP Script Does
You must separate each line in the YAML document by a carriage return and line feed (
). To generate this output, your PHP simply queries the database (an expanded service would parameterize this for specific stock tickers and date ranges) for the data using a the query ‘Select * from Prices‘.

It then does the following:

  • Output ‘?-‘ indicating the start of the document.
  • For each record in the resultset:
  • Output ‘Day: ‘ followed by the value of the date in the current row. The date needs to be contained within inverted commas (“), and the line has to end with a ”
  • Output the string “values:” followed by a
    .
  • For each of the data columns (open, close, high and low) output the descriptive string (i.e. “open: “), followed by the value, followed by a
    .

Running the query will return a page that looks similar to Figure 2.

 
Figure 2. Running the YAML Generator: When you run the YAML Generator, the resulting page doesn’t display the line breaks in the generated YAML.

The results in Figure 2 may not look too good?but don’t worry, it’s just that the browser doesn’t handle the “
” too well. Save the file to a text file called yamlgen.yaml, and then open it with WordPad, and it will look much more like what you’d expect.

Parsing the Data
The Java YAML parser doesn’t currently explode the data into a usable set of data structures; instead, it simply outputs the values to the Java console as it scans through the data, though it does collect them properly within their intended structures. You can see the output of the parser in Listing 1.

Next Steps
There are many useful scenarios for simplified data formats such as YAML?one example might be to retrieve data from a database on the server side and chart it with a client. If you don’t want to use XML, then YAML is a good solution, and the YAML Java parser gives you a head start in parsing the retrieved data. It’s not a complete solution yet, as it doesn’t allow you to load data from a URL, and it doesn’t load the data into a DOM that you can work with. You’ll have to take the data from the various elements as sliced out by the parser, and instead of outputting them to the console (as in Listing 1), do something useful with them (such as charting).

The biggest challenge facing YAML is the lack of good parser support in the .NET and Java platforms. YAML is already well supported in Ruby, Python, and Caml, and has emerging support in PHP. Still, if you need a serialization schema to pass data between systems and don’t want to start from scratch, you could do a whole lot worse than YAML!

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: