Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Model XML to Please Humans and Computers Alike : Page 2

Modeling XML documents is often a balancing act between human readability and extensibility. But you can build an XML schema that gives you the best of both worlds by following these five heuristics.




Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js

The central challenge in designing a schema for NoteML will be to apply the semantic transparency that makes XML so useful without compromising human readability.

Heuristic 1: For balancing competing goals, consider creating subsets of the language.

When using NoteML, sometimes I want to write simple notes with a no-frills text-editor; Sometimes I want to write complex notes with an XML-aware editor. One way to meet my two different needs would be to create two separate languages. I could create, say, SimpleNoteML in addition to standard NoteML and then use XSLT to transform between the two.

But an even better approach is to create a sub-language. The English language does this beautifully: Scientists at a physics conference use a different set of words and phrases than do musicians at a rehearsal.

XML schema can enable both simplicity and complexity through the use of extension. Here's the schema type definition for a generic note:

<xsd:complexType name="NoteType"> <xsd:sequence> <xsd:element name="Content" type="ContentType" /> <xsd:element name="Comment" type="Comment" minOccurs="0" maxOccurs="unbounded" /> </xsd:sequence> <xsd:attribute name="subject" type="xsd:string" use="required" /> <xsd:attribute name="date" type="xsd:date" use="required" /> <xsd:attribute name="revisionStatus" type="RevisionStatus" default="InProgress" /> </xsd:complexType>

This format makes a generic note very easy to compose even with Notepad. Here's an example:

<Note subject="Think about..." date="2005-01-01"> <Content>Patterns with generics.</Content> </Note>

But XML Schema allows the developer to extend types for more specific applications, like this:

<!—Remarks: NameType and ReferenceType are used by StudyNoteType--> <xsd:complexType name="NameType"> <xsd:sequence> <xsd:element name="Family" type="xsd:string" /> <xsd:element name="Given" type="xsd:string" /> </xsd:sequence> </xsd:complexType> <xsd:complexType name="ReferenceType"> <xsd:sequence> <xsd:element name="Title" type="xsd:string" /> <xsd:element name="URI" type="xsd:anyURI" /> <xsd:element name="Contributor" type="NameType" maxOccurs="unbounded" /> </xsd:sequence> </xsd:complexType> <xsd:complexType name="StudyNoteType"> <xsd:complexContent> <xsd:extension base="NoteType"> <xsd:sequence> <xsd:element name="Reference" type="ReferenceType" maxOccurs="unbounded" /> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType>

Author's Note: Be careful when deriving new types by extension: It's perfectly legal in XML Schema, but not all editors are smart enough to work with it.

An instance of the StudyNoteType is complex enough that I wouldn't want to write it in Notepad. Here's one example:

<Note xsitype="StudyNote" date="2005-01-01" subject="Xml Schema Notes" revisionStatus="Finished" > <Content>There are some complaints-- what's (exactly) the distinction between data and metadata?
Information represented tags, elements, attributes, hierarchy, and sequence</Content> <Comment subject="TODO" date="2005-01-20"> Research schematron next. </Comment> <Reference> <Title>XML Schema Part 0 Primer Second Edition</Title> <URI>http//www.w3.org/TR/xmlschema-0/</URI> <Contributor> <Family>Fallside</Family> <Given>David</Given> </Contributor> <Contributor> <Family>Walmsley</Family> <Given>Priscilla</Given> </Contributor> </Reference> </Note>

Think of this as the XML equivalent of the object-oriented principle of data hiding. Of course the data isn't really hidden—in fact it's right there in plain text—but if the user doesn't have to care about the complexity, it might as well be hidden. Extension can give you power when you need it and simplicity when you don't.

Heuristic 2: To increase human-writability, minimize the number of required tags and attributes.

Object-oriented design gurus sometimes say, "limit the number of data-members in each class to around six." The idea is based on studies suggesting that a typical human has about seven to 10 short-term memory slots.

For writing without the aid of an XML-aware text editor, I like to keep the number of required tags and attributes even lower—say three or four. When writing, it's dependent clauses and prepositional phrases that I'm trying to juggle in my short-term memory, not markup tags. Consequently, NoteML should let users focus (as much as possible) on the notes instead of on the markup. If the price is extensibility, then I'm willing to pay it.

Using defaults is one way to limit the number of required attributes. So, the NoteML schema defines a generic element. But, if I'm writing in a simple text editor, the odds are that it's not yet ready for public consumption, so the revisionStatus attribute has a default of "InProgress".

Heuristic 3: Avoid dependencies between siblings.

This is a trap that's easy to fall into when observing earlier heuristics. The trouble is that humans are very good at contextualizing information. So there's often a temptation to bend software into accommodating human-style contextualization.

For example, take this fragment:

<Note revisionStatus="Abandoned"> <Content><!--Content here--></Content> </Note>

What if I want to extend the revisionStatus attribute to include a reason for the abandonment?

One option is to add another attribute called reasonAbandoned, like this:

<Note revisionStatus="Abandoned" reasonAbandoned = "Project cancelled" > <Content><!--Content here--></Content> </Note>

But adding the new attribute creates a dependency between sibling nodes. In other words: the reasonAbandoned attribute only has meaning if the value of revisionStatus is "Abandoned". Documents like are a hassle to parse. Even worse, they can allow you to create valid markup that is in fact meaningless, like this example:

<Note revisionStatus="InProgress" reasonAbandoned = "Project cancelled" > <Content><!--Content here--></Content> </Note>

Adding context-sensitivity leans too far toward human-readability.

Heuristic 4: Use attributes to improve human writability.

It used to be that XML gurus would advise users to avoid attributes altogether, but now there seems to be consensus: Attributes have their place, but they sacrifice extensibility for syntactic simplicity.

Figure 1. Roots and Leaves: In the tree model of data, attributes are guaranteed to be leaves. Syntax can be cleaner because they never have to store complex data.
Clean syntax matters because the limitation of human memory is only one of the challenges when dealing with human data input (see Figure 1). Another challenge is human fallibility. We make mistakes, little ones especially, and lots of them. Attributes can help to minimize the mistakes.

Here's a slightly modified version of an earlier example:

<Note <RevisionStatus> <Abandoned> <Reason>Project cancelled.<Reason> </Abandoned> <RevisionStatus> <Content><!--Content here--></Content> </Note>

Setting aside the preponderance of different tags, a document with lots of nested elements can be difficult to write correctly by hand because the complexity can mask little errors. Using attributes can help in at least three ways:
  • Attributes don't need a closing tag.
  • Attribute data makes less use of whitespace.
  • Attribute data tends to be shorter.
Here's a different version of the same information. Notice how much easier it is to check visually:

<Note revisionStatus="Abandoned"> <Content> <!--Content here--> Final note: Forget it. Project got cancelled. </Content> </Note>

But, of course, we're giving up a certain amount of semantic transparency.

Heuristic 5: Research before you design.

This final principle isn't specific to writing human-friendly XML formats. But basic research is such a good idea, and so many people just skip the step entirely that I'm throwing it in, too.

With the explosion of XML dialects, the odds are decent that there's already one out there for whatever you want to build. So whether you're into math, music, or cave exploration, make sure you do your research before investing a lot of time in design. You may find that someone else has done your work for you, and at the very least, you'll get a few ideas.

Designing for Extensibility, Readability, and Writability
This may be an extreme case. I may have an exceptionally limited short-term memory. I may be the only person who gets irritated at the proliferation of applications designed for writing English prose. But there are certainly other reasons to increase the human-friendliness of your XML documents: editing config files, debugging, and spot-checking data. The good news is that accommodating grey matter doesn't have to come at an exorbitant price.

Eric McMullen is a director at Falstaff Solutions, a Denver-based consulting house which specializes in data-centric .NET applications. Check out Falstaff's Web site at www.falstaffsolutions.com.
Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date