The Extensible Markup Language, or XML, is big. But is it too big? And if so, should we do anything about it?
The World Wide Web Consortium says that XML “is a simple, very flexible text format,” but in reality, non-trivial XML documents can be quite complex. Parsing an XML document takes a lot of code and a lot of CPU horsepower — it’s actually more difficult to parse a large document than to create one.
If an XML document is damaged or malformed, software can become very confused, and often, even trivial errors or corruption in the XML document can stop processing. Working with schema extensions can be difficult, and older documents written using DTDs (Document Type Definitions) and Document Object Models (DOMs) can be incomprehensible.
XML, however, is crucial to exchange data, such as documents. Modern file formats, such as Microsoft’s DOCX and XLSX, are XML-based updates of the old Microsoft Word and Excel spreadsheet formats. Similarly, the Open Document Format used by the non-Microsoft world is also an XML-based format.
Still, XML is complex — hard to understand, difficult to validate, requiring extensive resources for parsing and creating documents. That has led to suggestions for a simplified version of the spec, such as MicroXML, proposed by James Clark and others.
Clark’s thoughts about MicroXML, published on his blog in December 2010, lay out a solid set of requirements, ditching “problematic” parts of XML like the DOCTYPE declaration, namespaces, coding other than UTF-8, XML declarations, attribute value normalization, and CDATA sections.
What has happened since then? In mid-2011, John Cowan built on Clark’s requirements with a draft spec for MicroXML.
And then, what prompted today’s musing is a two-part set of articles by Uche Ogbuji, published on IBM DeveloperWorks in mid-June 2012: Explore the Basic Principles of MicroXML and Process MicroXML with MicroLark.
What do you think about XML and MicroXML — and would you welcome a subset?