Because XML was not designed for data, it has serious ambiguities and constraints. These limitations are hard for many to understand because most articles and books never address them. A good analogy to the problem is how we’ve perceived the earth. Hundreds of years ago the prevailing view was the world is flat, because people experience a flat world. Even people who saw the complexity of the flat world continued to justify their view after the logic of the round-world view was quite clear. Imagine the difficulty Columbus had trying to convince people that the world was round. People thought he was crazy.
How is this discussion about a flat versus a round world related to XML? The vast majority of people working with XML today believe in the equivalent of an XML flat world: XML as a syntax for documents. This flat-world view is everywhere; supported by standards, committees, books, and seminars?it is the dominant view for XML. The round world model treats XML as a syntax for data and objects. The differences are slight, but the implications are huge.
A small but growing group of people understands that the document-centric model has outlived its usefulness and that a data-centric model is required for progress. This article aims to convert people to a simpler, round-world view, leading them toward seeing XML in a whole new way. It also introduces ConciseXML, a superset of XML and the syntax of the Water programming language, which intends to solve not only the limitations, but the verbosity of XML.
XML is a Syntax
XML is wrapped in a number of misconceptions. Many articles define XML as one of the following:
- A standard for describing data in a machine-understandable way
- The solution for application integration problems
- A programming language
XML is none of these things. XML is simply a syntax. The XML 1.0 standard does not define any tag names or attributes that carry meaning. XML only describes the syntax for representing elements and attributes. XML does not even specify a standard way to represent objects and data, because there is no standard for how elements and attributes should be used to represent objects and fields.
Hundreds of standards, including XHTML, WSDL, UDDI, SVG, and ebXML, use anXML syntax for describing data. Because no single way exists for representing data inXML, every standard chooses its own way. Each standard with an XML syntax then can not easily share data. This problem is not yet well understood in the industry and has led to significant complexity when trying to deploy solutions based on XML.
Document Versus Data
XML is commonly used for representing data structures. A data structure is simply a way to represent data that obeys some well-defined structure. The Water language, using ConciseXML, can formally describe the structure of data by using Water Type and Water Contract. Using Water, you also can unambiguously represent static data.
Representing static data might seem straightforward, but XML 1.0 has design constraints carried over from the document markup world that can make representing data in XML quite confusing. The quandary between elements and attributes is a common example of this confusion.
Most programming languages and other technologies for representing data employ the concept of a data structure or object. This article, by convention, uses the term object. The word object is similar to other terms such as a record, structure, or tuple from other technologies.
In most programming languages, an object has fields, and those fields hold values that are also objects. Water objects have this property as well: An object is a collection of fields; each field has a key and a value; and the value can be any object.
The following ConciseXML is an example of an
The preceding ConciseXML could be described as creating an instance of an
item object. The instance has three fields:
size. The value of the
id field is the string
"XL283", the value of the
color field is
"blue", and the value of the
size field is the number
The type or class of the object appears as the element’s name, immediately followingthe opening angle bracket (
In the preceding line, the instance of
item has three fields.
"id" is the key of the first field, and
"xx283" is the value of the field.
"color" is the key of the second field and
"blue" is its value.
"size" is the key of the third field and the integer
10 is its value.
It is very common, though, to see the following XML to represent the instance of
xx283 blue 10
To the vast majority of people, the above XML is normal and easily understood, but this is an example of XML in the flat-world model. The round-world model sees this as an ambiguous, poorly constructed XML data object. One problem (which is described in detail later in this article) is that the syntax of an XML element is used to represent two very different things: an object and a field of an object. Having one syntax to represent two different concepts presents a serious ambiguity problem. This ambiguity leads to a serious problem when a machine tries to interpret the meaning of the XML data.
For a data structure to be useful, the distinction between objects and fields is extremely important. How, for example, do you know that
represents a field of
item and not an instance of type
color? As humans, we use our gift of pattern recognition to deduce that
color must be a field of
item because it occurs within the content of
item and it has
blue in the content of the element.
To emphasize the ambiguity, what if you wrapped the
item within another
color element? Is
item now a field of
color? Did the meaning of
item radically change because it moved to a different level in the structure? Consider the following example:
xx283 blue 10
If a serious ambiguity appears in such a small example, imagine the scope of the problem when objects and data structures get more complex. At a minimum, data structures need to be unambiguous and not depend on any other knowledge for interpreting a data structure.
Water’s use of XML makes a clear separation between objects and fields. An XML element represents an object. XML attributes represent fields of an object. The ConciseXML syntax allows any type of object as the value of an attribute; therefore, Water supports fields that can store any type of object?not just strings.
ConciseXML and XML 1.0
XML was derived from SGML and contains constructs from a document-centric?not data-centric?world. The XML 1.0 syntax has no standard representation for objects with fields. Field key, data, and type information might be encoded in a dozen different ways in a dozen different standards. A substantial number of developers find XML 1.0 syntax verbose and cumbersome and avoid using it for these reasons. ConciseXML is efficient and readable without the use of special tools.
ConciseXML can be as concise as the comma separated value (CSV) format for data. When used for programming, ConciseXML is as concise as Java or C++ syntax.
ConciseXML has one consistent representation for encoding data: it uses attributes for fields. XML 1.0 specifies that attributes must be quoted strings, but ConciseXML allows the value of a field to be any object, not just a string. The key of a field can also be any object, not just a string. This simple extension significantly reduces the ambiguity of XML 1.0 documents and makes XML much more suited for representing data and logic.
ConciseXML is a superset of XML 1.0 and is both forward and backward compatible with XML 1.0. ConciseXML supports both the XML 1.0 form as well as a concise form. ConciseXML supports three syntactic extensions to XML 1.0 that reduce the verbosity of XML 1.0 (see Sidebar: ConciseXML Makes Eight Extensions to XML 1.0).
First, ConciseXML permits the closing tag name to be omitted. The closing tag name is redundant because a machine reading the XML can simply match opening tags with closing tags. Requiring the ending tag name encourages developers to choose short, abbreviated tag names. When calling a method in a normal programming language, you don’t expect to type the method name twice:
XML 1.0: ConciseXML:
An optional closing tag name is actually required for Water because of its dynamism. In some cases, the object of the call is calculated during runtime and therefore is not known when writing the code:
some_text. hypertext.TEXTAREA else hypertext.INPUT />
Second, ConciseXML supports by-position arguments. This means the attribute key can be omitted. When calling a method in a typical programming language, you do not expect to name the arguments in a call. This eliminates ambiguity because the definition of the method defines the exact ordering of arguments. In this respect, ConciseXML supports the calling convention of popular programming languages:
"abe" ConciseXML:"abe". bar .bar
ConciseXML also eliminates the need for character entities and replaces them with a standard object reference. See the www.waterlanguage.org site for more details.
For every ConciseXML extension to XML 1.0, there is a corresponding representation in the XML 1.0 syntax. A single file can include both ConciseXML and XML 1.0 forms. Many different XML 1.0 representations for ConciseXML are possible, but ConciseXML uses a simple representation where attributes that have non-string keys or values are put within an attributes element. The following two examples create the same object:
Backward-compatible XML Solution
ConciseXML is an innovation on XML 1.0 that enables XML to better handle data and logic. The constraints of XML 1.0 have limited its use because of significant ambiguity and verbosity. ConciseXML removes those constraints and provides a backward-compatible solution.