XML 1.0 Superset Makes XML Concise

Because XML was not designed for data, it has serious ambiguities and constraints. These limitations are hard for many to understand because most articles and books never address them. A good analogy to the problem is how we’ve perceived the earth. Hundreds of years ago the prevailing view was the world is flat, because people experience a flat world. Even people who saw the complexity of the flat world continued to justify their view after the logic of the round-world view was quite clear. Imagine the difficulty Columbus had trying to convince people that the world was round. People thought he was crazy.

How is this discussion about a flat versus a round world related to XML? The vast majority of people working with XML today believe in the equivalent of an XML flat world: XML as a syntax for documents. This flat-world view is everywhere; supported by standards, committees, books, and seminars?it is the dominant view for XML. The round world model treats XML as a syntax for data and objects. The differences are slight, but the implications are huge.

A small but growing group of people understands that the document-centric model has outlived its usefulness and that a data-centric model is required for progress. This article aims to convert people to a simpler, round-world view, leading them toward seeing XML in a whole new way. It also introduces ConciseXML, a superset of XML and the syntax of the Water programming language, which intends to solve not only the limitations, but the verbosity of XML.

XML is a Syntax
XML is wrapped in a number of misconceptions. Many articles define XML as one of the following:

  1. A standard for describing data in a machine-understandable way
  2. The solution for application integration problems
  3. A programming language

XML is none of these things. XML is simply a syntax. The XML 1.0 standard does not define any tag names or attributes that carry meaning. XML only describes the syntax for representing elements and attributes. XML does not even specify a standard way to represent objects and data, because there is no standard for how elements and attributes should be used to represent objects and fields.

Hundreds of standards, including XHTML, WSDL, UDDI, SVG, and ebXML, use anXML syntax for describing data. Because no single way exists for representing data inXML, every standard chooses its own way. Each standard with an XML syntax then can not easily share data. This problem is not yet well understood in the industry and has led to significant complexity when trying to deploy solutions based on XML.

Document Versus Data
XML is commonly used for representing data structures. A data structure is simply a way to represent data that obeys some well-defined structure. The Water language, using ConciseXML, can formally describe the structure of data by using Water Type and Water Contract. Using Water, you also can unambiguously represent static data.

Representing static data might seem straightforward, but XML 1.0 has design constraints carried over from the document markup world that can make representing data in XML quite confusing. The quandary between elements and attributes is a common example of this confusion.

Most programming languages and other technologies for representing data employ the concept of a data structure or object. This article, by convention, uses the term object. The word object is similar to other terms such as a record, structure, or tuple from other technologies.

In most programming languages, an object has fields, and those fields hold values that are also objects. Water objects have this property as well: An object is a collection of fields; each field has a key and a value; and the value can be any object.

The following ConciseXML is an example of an item object:

The preceding ConciseXML could be described as creating an instance of an item object. The instance has three fields: id, color, and size. The value of the id field is the string "XL283", the value of the color field is "blue", and the value of the size field is the number 10.

The type or class of the object appears as the element’s name, immediately followingthe opening angle bracket (<). The fields of the object are represented as key-value pairs within the element's opening area. An opening angle bracket syntactically is the start of an XML element, but it has the semantic meaning of performing a call. The call is either the calling of a method or the calling of a constructor method of an object. Fields of an object have a clear and unambiguous key and value:

In the preceding line, the instance of item has three fields. "id" is the key of the first field, and "xx283" is the value of the field. "color" is the key of the second field and "blue" is its value. "size" is the key of the third field and the integer 10 is its value.

It is very common, though, to see the following XML to represent the instance of item above:

	xx283	blue	10

To the vast majority of people, the above XML is normal and easily understood, but this is an example of XML in the flat-world model. The round-world model sees this as an ambiguous, poorly constructed XML data object. One problem (which is described in detail later in this article) is that the syntax of an XML element is used to represent two very different things: an object and a field of an object. Having one syntax to represent two different concepts presents a serious ambiguity problem. This ambiguity leads to a serious problem when a machine tries to interpret the meaning of the XML data.

For a data structure to be useful, the distinction between objects and fields is extremely important. How, for example, do you know that blue represents a field of item and not an instance of type color? As humans, we use our gift of pattern recognition to deduce that color must be a field of item because it occurs within the content of item and it has blue in the content of the element.

To emphasize the ambiguity, what if you wrapped the item within another color element? Is item now a field of color? Did the meaning of item radically change because it moved to a different level in the structure? Consider the following example:

			xx283		blue		10	

If a serious ambiguity appears in such a small example, imagine the scope of the problem when objects and data structures get more complex. At a minimum, data structures need to be unambiguous and not depend on any other knowledge for interpreting a data structure.

Water’s use of XML makes a clear separation between objects and fields. An XML element represents an object. XML attributes represent fields of an object. The ConciseXML syntax allows any type of object as the value of an attribute; therefore, Water supports fields that can store any type of object?not just strings.

ConciseXML and XML 1.0
XML was derived from SGML and contains constructs from a document-centric?not data-centric?world. The XML 1.0 syntax has no standard representation for objects with fields. Field key, data, and type information might be encoded in a dozen different ways in a dozen different standards. A substantial number of developers find XML 1.0 syntax verbose and cumbersome and avoid using it for these reasons. ConciseXML is efficient and readable without the use of special tools.

ConciseXML can be as concise as the comma separated value (CSV) format for data. When used for programming, ConciseXML is as concise as Java or C++ syntax.

ConciseXML has one consistent representation for encoding data: it uses attributes for fields. XML 1.0 specifies that attributes must be quoted strings, but ConciseXML allows the value of a field to be any object, not just a string. The key of a field can also be any object, not just a string. This simple extension significantly reduces the ambiguity of XML 1.0 documents and makes XML much more suited for representing data and logic.

ConciseXML is a superset of XML 1.0 and is both forward and backward compatible with XML 1.0. ConciseXML supports both the XML 1.0 form as well as a concise form. ConciseXML supports three syntactic extensions to XML 1.0 that reduce the verbosity of XML 1.0 (see Sidebar: ConciseXML Makes Eight Extensions to XML 1.0).

First, ConciseXML permits the closing tag name to be omitted. The closing tag name is redundant because a machine reading the XML can simply match opening tags with closing tags. Requiring the ending tag name encourages developers to choose short, abbreviated tag names. When calling a method in a normal programming language, you don’t expect to type the method name twice:

XML 1.0: ConciseXML: