devxlogo

Model XML to Please Humans and Computers Alike

Model XML to Please Humans and Computers Alike

euristics” is a word to inspire fear and loathing among the non-technical. But actually, it’s one of the friendliest design-related concepts out there. This article suggests five heuristics for ensuring the human readability of your XML documents. These aren’t iron-clad rules or complicated design patterns. So use them when you need to, and throw them out when you don’t.

Why Bother?
That’s the first question: When designing XML, why worry about humans at all? After all, SOAP makes few concessions to human readability, and it’s a W3C standard.

The answer is: human authorship. SOAP is designed to be written and read by applications. But there are lots of XML dialects designed to be written by humans?XML Schema and XSLT spring to mind. Less ambitious examples of XML documents that need to be human-writable might include a complex config file, or an XML-based macro language.

Creating a dialect for human-authorship isn’t easy, but XML has some big advantages, and the biggest is platform support. XML is so popular, there are so many class libraries, and so many XML-aware editors that I don’t want to reinvent the wheel by writing my own parser and my own editing software. Working with XML gives you a lot of things for free (or at least at a bargain-basement price point). Here are just a few:

  • Syntax-checking based on an XML Schema
  • Searching whether programmatically or using XQuery
  • Processing with XSLT
  • Writing with one of the many good, cheap, XML-aware editors out there.
What You Need
Familiarity with XML and XML Schema and a good, XML-aware text-editor.

Introducing NoteML
As mentioned earlier, this article will teach you how to write XML that maximizes human readability. To do that, I’ll create a sample application?an XML dialect to organize the output of my own human-authorship. I wanted a way to manage all of my writing in a single place but to emphasize human readability and writeability. In a typical day, I might write:

  • Personal e-mail
  • Work-related e-mail
  • Test cases
  • User stories
  • API documentation
  • Notes
  • Weblog entries

And maybe a whole lot of other stuff as well. The trouble is that, despite similarities in form and content, each of the categories above potentially has its own application and its own file format. That makes finding old pieces of writing difficult, since the search could involve several different applications?Outlook, Word, and TextEdit at the very least.

It’s true that my penmanship is genuinely shameful, but there’s another reason why I don’t take notes on paper: computers make text-management easier. I call this dialect NoteML. There may be more practical ways to meet the same requirements. But an XML dialect for marking up notes does a good job of illustrating the tradeoffs between the needs of humans and the needs of software. This dialect should enable:

  • Searching. The trouble with notes is finding them months later when I actually need them. A proper model should make searching for relevant notes by date or by subject easier.
  • Processing. Email and blog entries don’t do anyone any good if they just stay on my hard drive. Processing them with XSLT or application-logic should make it easy to get data out of NoteML and into a final format ready for consumption by the outside world.
  • Writing. At the very least, this format should be able to take advantage of an XML editor. But, at some point in the future, maybe I’ll get around to writing an application based on this format.

Listing 1 and 2 exemplify the two extremes. Listing 1 is a human-friendly format that requires a custom parser. Listing 2 shows well-formatted SOAP, which no person should really be expected to write in.

The requirements of NoteML are simple. Eventually, this format should be extensible enough to include all kinds of writing: e-mail, spec documents, and everything else. But for right now, I’ll just worry about three types of entry:

  • Generic. Sometimes, I just want to write a simple text note, but I may not have a fancy text editor handy. Maybe it’s a grocery list. Or maybe it’s a great idea that came to me in the shower and I want to get it down before I forget.
  • Notes on a book or a Web site. If I take notes on a specific topic, NoteML should be able to catalog my reactions as well as the sources I used.
  • Blog entries. I should be able to write a format-neutral post in NoteML, and transform it into HTML using XSLT.
Author’s Note: This article borrows its approach from Arthur J. Riel’s classic design textbook, Object-Oriented Design Heuristics. If you haven’t read it, check it out. The book is filled with fundamental advice that even the most seasoned designer occasionally forgets.

Heuristics
The central challenge in designing a schema for NoteML will be to apply the semantic transparency that makes XML so useful without compromising human readability.

Heuristic 1: For balancing competing goals, consider creating subsets of the language.

When using NoteML, sometimes I want to write simple notes with a no-frills text-editor; Sometimes I want to write complex notes with an XML-aware editor. One way to meet my two different needs would be to create two separate languages. I could create, say, SimpleNoteML in addition to standard NoteML and then use XSLT to transform between the two.

But an even better approach is to create a sub-language. The English language does this beautifully: Scientists at a physics conference use a different set of words and phrases than do musicians at a rehearsal.

XML schema can enable both simplicity and complexity through the use of extension. Here’s the schema type definition for a generic note:

                       

This format makes a generic note very easy to compose even with Notepad. Here’s an example:

  Patterns with generics.

But XML Schema allows the developer to extend types for more specific applications, like this:

                                                                                                                        
Author’s Note: Be careful when deriving new types by extension: It’s perfectly legal in XML Schema, but not all editors are smart enough to work with it.

An instance of the StudyNoteType is complex enough that I wouldn’t want to write it in Notepad. Here’s one example:

    There are some complaints-- what's (exactly) the distinction between data and metadata? 
Information represented tags, elements, attributes, hierarchy, and sequence
Research schematron next. XML Schema Part 0 Primer Second Edition http//www.w3.org/TR/xmlschema-0/ Fallside David Walmsley Priscilla

Think of this as the XML equivalent of the object-oriented principle of data hiding. Of course the data isn’t really hidden?in fact it’s right there in plain text?but if the user doesn’t have to care about the complexity, it might as well be hidden. Extension can give you power when you need it and simplicity when you don’t.

Heuristic 2: To increase human-writability, minimize the number of required tags and attributes.

Object-oriented design gurus sometimes say, “limit the number of data-members in each class to around six.” The idea is based on studies suggesting that a typical human has about seven to 10 short-term memory slots.

For writing without the aid of an XML-aware text editor, I like to keep the number of required tags and attributes even lower?say three or four. When writing, it’s dependent clauses and prepositional phrases that I’m trying to juggle in my short-term memory, not markup tags. Consequently, NoteML should let users focus (as much as possible) on the notes instead of on the markup. If the price is extensibility, then I’m willing to pay it.

Using defaults is one way to limit the number of required attributes. So, the NoteML schema defines a generic element. But, if I’m writing in a simple text editor, the odds are that it’s not yet ready for public consumption, so the revisionStatus attribute has a default of “InProgress”.

Heuristic 3: Avoid dependencies between siblings.

This is a trap that’s easy to fall into when observing earlier heuristics. The trouble is that humans are very good at contextualizing information. So there’s often a temptation to bend software into accommodating human-style contextualization.

For example, take this fragment:

                

What if I want to extend the revisionStatus attribute to include a reason for the abandonment?

One option is to add another attribute called reasonAbandoned, like this:

  

But adding the new attribute creates a dependency between sibling nodes. In other words: the reasonAbandoned attribute only has meaning if the value of revisionStatus is “Abandoned”. Documents like are a hassle to parse. Even worse, they can allow you to create valid markup that is in fact meaningless, like this example:

  

Adding context-sensitivity leans too far toward human-readability.

Heuristic 4: Use attributes to improve human writability.

It used to be that XML gurus would advise users to avoid attributes altogether, but now there seems to be consensus: Attributes have their place, but they sacrifice extensibility for syntactic simplicity.

Figure 1. Roots and Leaves: In the tree model of data, attributes are guaranteed to be leaves. Syntax can be cleaner because they never have to store complex data.

Clean syntax matters because the limitation of human memory is only one of the challenges when dealing with human data input (see Figure 1). Another challenge is human fallibility. We make mistakes, little ones especially, and lots of them. Attributes can help to minimize the mistakes.

Here’s a slightly modified version of an earlier example:

          Project cancelled.        

Setting aside the preponderance of different tags, a document with lots of nested elements can be difficult to write correctly by hand because the complexity can mask little errors. Using attributes can help in at least three ways:

  • Attributes don’t need a closing tag.
  • Attribute data makes less use of whitespace.
  • Attribute data tends to be shorter.

Here’s a different version of the same information. Notice how much easier it is to check visually:

           Final note: Forget it. Project got cancelled.  

But, of course, we’re giving up a certain amount of semantic transparency.

Heuristic 5: Research before you design.

This final principle isn’t specific to writing human-friendly XML formats. But basic research is such a good idea, and so many people just skip the step entirely that I’m throwing it in, too.

With the explosion of XML dialects, the odds are decent that there’s already one out there for whatever you want to build. So whether you’re into math, music, or cave exploration, make sure you do your research before investing a lot of time in design. You may find that someone else has done your work for you, and at the very least, you’ll get a few ideas.

Designing for Extensibility, Readability, and Writability
This may be an extreme case. I may have an exceptionally limited short-term memory. I may be the only person who gets irritated at the proliferation of applications designed for writing English prose. But there are certainly other reasons to increase the human-friendliness of your XML documents: editing config files, debugging, and spot-checking data. The good news is that accommodating grey matter doesn’t have to come at an exorbitant price.

devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist