agging an XML document is, in many ways, similar to tagging an HTML document. Here are some of the most important guidelines to follow.
Rule #1: Remember the XML declaration
This declaration goes at the beginning of the file and alerts the browser or other processing tools that this document contains XML tags. The declaration looks like this:
You can leave out the encoding attribute and the processor will use the UTF-8 default.
Rule #2: Do what the DTD instructs
If you are creating a valid XML file, one that is checked against a DTD, make sure you Know what tags are part of the DTD and use them appropriately in your document. Understand what each does and when to use it. Know what the allowable values are for each. Follow those rules. The XML document will validate against the specified DTD.
Rule #3: Watch your capitalization
XML is case-sensitive.
is not the same as
. Be consistent in how you define element names. For example, use ALL CAPS, or use Initial caps, or use all lowercase. It is very easy to create mis-matching case errors.
Also, make sure starting and ending tags use matching capitalization, too. If you start a paragraph with the
tag, you must end it with the
tag, not a.
Rule #4: Quote attribute values
In HTML there is some confusion over when to enclose attribute values in quotes. In XML the rule is simple: enclose all attribute values in quotes, like this:
Rule #5: Close all tags
In XML you must close all tags. This means that paragraphs must have corresponding end paragraph tags. Anchor names must have corresponding anchor end tags. A strict interpretation of HTML says we should have been doing this all along, but in reality, most of us haven’t.
Rule #6: Close Empty tags, too
In HTML, empty tags, such as
or , do not close. In XML, empty tags do close. You can close them either by adding a separate close tag () or by combining the open and close tags into one tag. You create the open/close tag by adding a slash, /, to the end of the tag, like this:
Examples
This table shows some HTML common tags and how they would be treated in XML.
Tag | Comment | End-Tag |
Technically, in HTML, you’re supposed to close this tag. In XML, it’s essential to close it. | ||
All Elements in XML must have a Start-tag and an end-tag. | ||
This tag must be closed in XML in order to ensure a Well-Formed XML document. | ||
META tags are considered empty elements in XML, and they must close. | ||
Break tags are considered empty elements. | ||
This is an empty element tag. |
Element and Attribute Rules
The first table contains the basic guidelines for creating element rules in an XML DTD.
The second contains attribute value types.
The third contains attribute default options.
Element Rules:
Attribute Values:
Type | Meaning | Example |
CDATA | Character data, text. | The COMMENT element has an attribute named category. This attribute contains letters, numbers, or punctuation symbols. |
NMTOKEN | Name token, text with some restrictions. The value contains number and letter. However, it cannot begin with the letters “xml” and the only symbols it can contain are _, -, ., and :.. | The COMMENT element has an attribute named category. This attribute contains a name token. |
(value-1 | value-2 | value-3) | A value list provides a set of acceptable options for the attribute to contain. In general, you should always include “other” as one of the options. | The COMMENT element has an attribute named category. The category can be “red,” “green,” “blue,” or “other.” The default value is “other.” |
ID | The keyword ID means that this attribute has an ID value that idenifies this particular element. | The COMMENT element has an attribute named category. The category will contain an ID value. ID and IDREF work together to create cross-references. |
IDREF | The keyword IDREF means that this attribute has an ID reference value that points to another instance’s ID value. | The COMMENT element has an attribute named category. The category will contain an IDREF value. ID and IDREF work together to let you do cross-reference elements. |
ENTITY | The keyword ENTITY means that this attribute’s value is an entity. An entity is a value that has been defined elsewhere in the DTD to have a particular meaning. | The COMMENT element has an attribute named category. The category will contain an entity name rather than text. |
NOTATION | The keyword NOTATION means that this attribute’s value is a notation. A notation is a description of how information should be processed. You could set up a notation that allows only numbers to be used for the value, for example. | The COMMENT element has an attribute named category. The category attribute will contain a notation name. |
Attribute Default Options:
Type | Meaning | Example |
#REQUIRED | The attribute must always be included when the element is used. | The COMMENT element has an attribute named category. This attribute contains letters, numbers, or punctuation symbols. The attribute must always be used with the element. If you omit the attribute, the parser will give you an error message. |
#IMPLIED | The attribute is optional. If you see the keyword #IMPLIED, you know that this attribute will be ignored unless it is included in the element tag. It won’t take on any default values. | The COMMENT element has an attribute named category. You may use the attribute or omit the attribute, as the instance requires. |
#FIXED | The attribute is optional, but if it is used, it must always have a certain value. If you see the keyword #FIXED, you know that this attribute will always have the specified value when it is entered. | The COMMENT element has an attribute named confirm. If it is used, its value will be “yes.” If it is not used, it will not have a value. |
“value” | A value in quotes is the default value of this attribute. If you don’t enter the attribute in the element tag, the processor will assume the attribute has this default value. | The COMMENT element has an attribute named category. If you don’t use the attribute in the element tag, the attribute will automatically receive the value “other.” |
Interaction Between Components
XML, CSS, script, the DOM, and the browser work together to let you create interactive presentations of your content. |
XML Parsers
Parsing is the process of checking the syntax of your document and creating the “tree structure.” If you are using a validating parser, the process will also compare the XML file to its DTD.
On-line Parsers
There are a number of online parsers. To use these, you typically type in the URI of your file and tell the process to begin.
- Online validating parser, from the W3C
The W3C offers an online parser. Type the URL of the file into the form and the XML file is both parsed and validated. - Validating Parser from Brown University Scholarly Technolgy Group
This is the most easily accessible and understandable presentation of the online parsers.
Downloadable Parsers
There are many parsers that you can download and run on your local machine. Most of these require you to have either a Windows or UNIX machine. They are written in a variety of langauges; this is a cross section of some of the many which are available.
- James Clark’s expat parser
James Clark is amost a brand in the SGML/XML world. His rendition of an XML parser is widely used. - Java-based Validating XML Parser
From IBM’s AlphaWorks group, this parser claims to be 100% pure Java. - Microsoft XML Parser Version 3.0 Release
A parser from Microsoft. - XML Parser written in JavaScript.
This parser is non-validating and checks XML syntax only. - SiRPAC, Simple RDF Parser and Compiler
From the W3C.
XML Resources
There are many resources about XML out there on the Web. Here are a few of our favorites.
- Tim Bray’s Annotated XML Spec
You’ll want to read the XML spec, but the best way to do so is with a helping hand beside you. That’s exactly what Tim Bray has created in this very useful dual-frame presentation of the spec. And if anyone knows the inside scoop on the spec, it is Tim, who was one of the key people in its development. - W3C XML Recommendation
This is the XML Recommendation in its full form.
Additional Resources
There are a range of trade groups and publications that focus on XML issues. Here are few that are most useful for XML in a Web design/development context.
- W3C’s XML Activity Page
This is XML central, the place where new development is recorded. - OASIS
Formerly known as SGML Open, this group covers structured documents, SGML and XML issues. It also hosts Robin Cover’s XML Overview. - XML.com
An O’Reilly publication that focuses on XML. - The DevX XML Zone
XML news and features related to XML development. - Simon St. Laurent’s site. Simon is involved in various XML development projects and is the author of The XML Primer. He hosts a site with on-going XML development information.
- Microsoft’s MSDN XML section. Microsoft is jumping full force onto the XML bandwagon. This is the company’s tutorials on the topic.
XML Vendors
There are a number of companies working within the XML tools space. This section contains links to company information.
Project Cool does not endorse these products; the items included here are some of the offerings on the market today or companies who are developing for the marketspace.
If you have a machine capable of it, however, we do recommend you try downloading IE5 and Gecko and take a look at some of the XML demos, to get a sense of how the XML plus CSS feels in a real browser.
XML-capable browsers
The 5.X browsers support XML documents.
- IE 6.X
Now in public release and available for download. - Gecko
The layout engine that is part of Netscape Navigator.
Document Authoring Tools
If there is a weakness to implementing large-scale XML projects it is the lack of good authoring tools. Handcoding is possible, of course, but structured editors make the task much easier and more error-free. These companies are working on XML document authoring products. They are listed in alphabetical order.
- Adobe – Framemaker plus SGML is being adapted for use as an XML editor as well.
- Arbortext – This SGML tools company is leaping into the XML game.
- Macromedia – The popular Dreamweaver tool is supporting the creation of XML documents.
- Corel XMetaL
XML and Microsoft Office 2000
We see frequent quuestions on the relationship between Office 2000 and XML. Our research shows that there is a relationship, but it isn’t quite what rumor holds it to be:
Microsoft is an active supporter of XML and various XML initiatives. It is also incorporating XML support and XML structure in its various products.
Microsoft is a perfect example of a company that needs – as an end user – a solution like XML. It has data that needs to move across different platform, without losing its meaning.
One place this is very obvious is in its Office suite. Its customers want to move data between applications and also to share data with other users who may or may not be using the same applications or the same versions of applications.
Remember, XML is a Markup Language. All a markup language does is identify pieces of a document so that another application can do something with those pieces. All word processors have a markup language. In early days of text processing, WordStar, XyWrite, and Word Perfect used to let you see and edit their markup code; Word and MacWrite usually didn’t.
Traditionally, markup languages were specific to an application. But what if you want to see a document and don’t own the exact same application in which it was created? There’s always ASCII, but that strips out most of the meaning. So we saw the rise of interchange formats. For text, Microsoft turned to Rich Text Format (.rtf) as its solution. The .rtf format provided a structure for opening up, say, a Word/Macintosh file in a Word/PC program or a Word Perfect program, but it was hardly the ideal solution.
Three years ago, Microsoft decided an emerging markup language called XML, in combination with HTML and CSS, provided a better option for marking up data.
The goal, says Marc Olson, Microsoft Group Program Manager, was “to use HTML to make Office documents universally viewable by anyone with a browser on any platform. Embedded XML tags are used as a means for Office to re-open the HTML document with no loss of information or quality.” Microsoft is using the phrase “document round-tripping” to describe this back and forth process.
Given the seemingly endless buzz around both XML and Microsoft, word on the street was that Office 2000 was to be an XML application. This notion quickly segued into a series of questions and complaints that Office 2000 doesn’t “support” XML correctly. As with many other XML stories, there’s a bit of truth and a lot of confusion in the XML/Office 2000 relationship.
Microsoft is itself using an application of XML to create to underlying structure for the data in Office 2000 documents. It is not offering up Word as an XML editing tool or a means of creating well-formed documents. It is not saying that its XML application is a “standard.” Rather, the Office 2000 example is a good case study of how a company can apply the extensibility and metadata capabilities of XML to find a solution for itself. In the Office 2000 case, Microsoft is the “customer” of the technology, not the vendor.
“Creating a standard format wasn’t the goal for Office 2000” says Olson. “It was to find a way to let people view a document, regardless of whether they own Program X. If they have a browser that supports HTML, they can view any Office 2000 file. HTML provides the viewing framework while XML provides the framework for data stored within that HTML document.”
Office uses XML in a very specific way?to structure the non-viewable contents of Word, PowerPoint, and Excel files. It has developed a set of tags and a data schema that defines the Office 2000 document set, much as you or I might create a set of tags and data schema for our “Flying Widget documentation set” or our inventory of tropical fish.
Within the Office 2000 document is a namespace tag that identifies the schema. When a browser or other program that can display HTML sees the Office 2000 document, it uses the schema and its associated style information to first process and then display the document. When Office200 applications open a document, they use the schema to access the underlying XML data structure.
Some schemas are private, while other others are publicly published. Microsoft has published its Office 2000 documents at http://www.msdn.microsoft.com/library. You have to navigate a little to get to it but the material is under Office Developer Documentation, Office 2000 Documentation, Microsoft Office HTML and XML reference.
So does Office 2000 use XML? Yes and no. It is an example of how one document set applies XML to meet a specific goal and that’s very exciting. But it isn’t the magic XML bullet – it is an application that shows that the “extensible” concept does indeed live up to its name.