Using RDFa with DITA and DocBook

he RDF data model gives you a way to add attribute name/value pairs to any resource that you can reference with a URI. This makes it easy to create metadata about nearly anything. The W3C’s RDFa standard is an increasingly popular syntax for storing RDF statements inside HTML documents, but according to the RDFa in XHTML: Syntax and Processing W3C Recommendation, “RDFa is a specification for attributes to express structured data in any markup language” (my emphasis).

The recommendation goes on to state: “this specification deals specifically with the use of RDFa in XHTML, and defines an RDF mapping for a number of XHTML attributes, but RDFa can be easily imported into other XML-based markup languages.” This flexibility can come in very handy when you work with publishing systems based on the DITA or DocBook XML specification. The flexibility of DocBook and DITA Document Type Definitions (DTDs) make it easy to add RDFa to any documents that conform to these standards, and perhaps even to reduce your need to further customize these DTDs for your own company’s publishing system.

This article explains how to create valid DocBook and DITA documents that incorporate the RDFa metadata demonstrated in the W3C’s RDFa Primer. The RDFa Primer demonstrates how to incorporate machine-readable metadata to simple HTML examples by adding RDFa attributes in the appropriate places. If you’re new to RDFa, read the W3C’s RDFa Primer first and treat this article as a refresher and sequel that explains using RDFa outside of XHTML, particularly in DocBook and DITA. The attached source code download combines the embedded data from the RDFa Primer examples and a few new ones with a DocBook document and a DITA document. It also includes add-on DTD modules to make these sample documents valid.

Inside the RDF Data Model

The RDF data model stores information in a simple data structure known as a triple, so named because it has three parts: a subject, a predicate, and an object. In more database-oriented terms, think of these three parts as a resource ID, an attribute name, and an attribute value. For example, a triple could store the statement “index.html has a title of ‘My Home Page’.”

Figure 1. Connecting Two Triples: When the same resource (in this case, index.html) is the object of one triple and the subject of another, you can combine statements in ways that let you answer new questions.

RDF requires that the subject and predicate in a triple be represented by URIs. After all, many web pages have the filename index.html, and the word “title” could mean a job title, the deed to a piece of property, or the title of a work. So, a triple consisting of {http://www.snee.com/bob/index.html, http://purl.org/dc/elements/1.1/title, “My Home Page”} makes it clear exactly which index.html the URI is referring to, and that “title” is in the Dublin Core sense of the term: the title of a work.

The “My Home Page” part of this triple demonstrates that the third part need not be a URI. If it is a URI, though, you can more easily connect triples together to learn more from the combination (see Figure 1). For example, if another triple says {http://www.someclub.org/memberID/4329, http://xmlns.com/foaf/0.1/homepage, http://www.snee.com/bob/index.html} (or, in English, “someclub.org member 4329 has a home page at http://www.snee.com/bob/index.html“), then the two triples together tell you that the person represented by the URI http://www.someclub.org/memberID/4329 has a home page with the title “My Home Page.”

More complex inferencing from larger data sets is making RDF and related standards such as OWL and SPARQL popular in biopharmaceutical research and other domains looking to draw connections among disparate sets of data.

RDF + Attributes = RDFa

Various syntaxes such as RDF/XML, n3, Turtle, and RDFa enable you to represent RDF triples so that programs can read this data and then store them in databases, query them, and do inferencing with them. The “a” in RDFa refers to attributes, because RDFa lets you embed triples into non-RDF XML by simply adding a few attribute values.

When adding these attributes, RDFa’s design lets you minimize redundancy with two nice tricks:

  • You can treat data that’s already part of the XML file as a triple’s object by adding the rest of the triple as attributes in an element that wraps that data.
  • If you specify only the predicate and object of a triple, a program that extracts those triples from the document assumes that the document itself is the subject. This is quite handy, because RDFa is often used to add metadata about the containing document, such as workflow, provenance, and rights re-use information.

Combining these two tricks, if the document http://www.snee.com/bob/index.html starts off with the following:

My Home Page

You can add the triple {http://www.snee.com/bob/index.html, http://purl.org/dc/elements/1.1/title, “My Home Page”} to the page by simply adding this single attribute:

property="dc:title">My Home Page

A program looking for RDFa triples will treat “My Home Page” as the triple’s object and the containing document as the subject. (It will also look for that “dc:” prefix to be properly associated with a namespace; this is the prefix traditionally used with the Dublin Core http://purl.org/dc/elements/1.1/ URL.)

Table 1 lists some of the most popular attributes that RDFa offers to identify subjects, predicates, and objects in an XML document.

Table 1. Popular RDFa Attributes
Attribute Used to Identify
about subject
property predicate, when object is a string of text
rel predicate, when object is a URI
content object, when it is a string of text
href object, when it is a URI
typeof class of subject

While some of these attributes are new, you’ll recognize href as an existing HTML attribute. The content and rel attributes, although never popular, have also been part of HTML for years.

RDFa in DocBook and DITA: Why Bother?

One reason the DocBook and DITA standards have been popular for storing XML documents is their adaptability. If you want to add new information that the original DTDs don’t provide for, the DocBook and DITA architectures let you define additional elements or attributes in an orderly, structured way that will survive upgrades to the standards with minimal fuss. Both DTDs offer slots for arbitrary metadata but nothing with the flexibility and structure of RDFa, because adding specialized elements or attributes to the DocBook and DITA DTDs requires you to write specialized code to extract their values.

When you’ve added a brief module to either standard’s DTD to allow RDFa attributes, the RDF data model’s flexibility means that storing new kinds of information in the future may not require additional DTD modifications. For example, if you add an RDFa module to DocBook or DITA today to let you store the employee ID of the editor who reviewed a document, and next year you want to add a workFlowStage value so that a staff member can quickly identify which work has been done on the document, you won’t need to modify the DTD at all. The same set of RDFa attributes let you accommodate any RDF triples.

Software that can extract RDFa triples from a document (see the rdfa.info site’s Implementations and Tools pages for some good lists) let you do all the things you want to do with document metadata, including loading this data into a database, querying for documents that meet certain metadata conditions, and creating reports on an aggregated set of such data.

The next section shows how the RDFa Primer examples look in DocBook.

RDFa in DocBook

The DocBook sample points at a DTD called DocbookRDFa.dtd (shown in Listing 1), which does three things:

  1. It references RDFaAttributes.mod, a small file I created by copying the declarations of the special RDFa attributes from Appendix A of RDFa in XHTML: Syntax and Processing, which has a DTD for XHTML+RDFa. (The RDFaAttributes.mod file and all the others mentioned in this article are included in the attached code download.)
  2. It redefines the DocBook db.common.attributes parameter entity, which defines various pieces of metadata that can be added to nearly any DocBook element, to include the href attribute, the attributes declared in RDFaAttributes.mod, and the namespace declarations needed for the sample document.
  3. It references a copy of the DocBook 5.0 DTD.

A document pointing at this DTD, like the one shown in Listing 2, can be a valid DocBook document and still store the extra attributes necessary to embed RDF triples.

Author’s Note: In addition to Listing 2 being a DocBook document, another difference between the Primer’s examples and those in Listing 2 is that the Primer mentions the mythical “Bob” and “Alice” several times. As the first paragraph of the dbrdfasample.xml DocBook version of the Primer says, it includes numbers after each use of these names?for example, Bob1 and Bob2?to make it easier for you to see exactly which metadata triples get extracted from where in the document.)

When adding RDFa attributes to HTML, you can add them to nearly any element. However, as you’ll see in the RDFa Primer, the span element is the most popular, being HTML’s most flexible element. DocBook’s phrase element is similarly flexible; using it to store RDFa attributes lets you add triples nearly anywhere in a DocBook document. Still, developers accustomed to the DocBook DTD know that some elements are more logical places for metadata than others. For example, to identify the document editor’s employee ID and the workFlowStage values for the mythical MyPubCo publishing company, the bibliomisc child of DocBook’s info element is a sensible place, so I stored these triples there.

You will see the predicate and object but not the subject of the mpc:editor and mpc:workFlowStage triples because of the second trick mentioned earlier: when an RDFa parser doesn’t see a subject, it assumes that the document itself is the subject. In this case, the parser will know that the document containing this content has an editor identified by http://mypubco.com/empid/53234 and that the document’s workFlowStage status is “final review.”

Author’s Note: As you compare the DocBook sample with the W3C Primer, note the examples from section 2.3 of the Primer. The W3C’s HTML version uses HTML’s second-most flexible element, div, to create containers around content that can hold the RDFa metadata attributes. However, DocBook has no equivalent to div (what with this standard’s stronger adherence to semantically meaningful names), so I used phrase elements again.

To demonstrate image metadata?something valuable to publishers?I added a bit more metadata that you won’t find in the Primer. Screenshots can be a pain in software documentation because if the software gets upgraded after you take your screenshot, your screenshot may be out of date. So, I added triples to indicate the lastScreenShotDate and softwareRelease associated with the screenshot. Unlike the other metadata examples, these are not metadata about the containing document, but rather about a different resource referenced by the document: http://example.com/bob/photos/sunset.jpg. Because the DocBook inlinemediaobject element can have an info child element for metadata, just as the document itself can, I stored the image metadata there, where info element’s about attribute indicates the subject.

RDFa in DITA

DITA XML documents usually have a document type of task, reference, or concept. The ditarefsample.xml sample file is a DITA reference document with content that parallels dbrdfasample.xml.

Author’s Note:Like dbrdfasample.xml, ditarefsample.xml includes numbers after proper names to make the correspondence between the content and extracted triples clearer.

One RDFa attribute not previously mentioned (because it’s not used in the W3C RDFa Primer) is rev. This attribute expresses an object-predicate-subject relationship as the “reverse” of the subject-predicate-object relationship described with a rel attribute. DITA already has a rev attribute that you can add to most elements. Although it’s short for “revision level,” if you want to use standardized RDFa software to extract metadata and still track the revision level of specific elements, you can use rev for RDFa and declare a revision attribute to fill the role formerly held by the DITA rev attribute.

The sample DITA reference document that incorporates RDFa metadata is called ditarefsample.xml, and it points to the DITARefRDFa.dtd DTD shown in Listing 3. Like DocbookRDFa.dtd, this DTD references the RDFaAttributes.mod module that declares the RDFa attributes, redefines a parameter entity from the standard DTD (in the DITA case, base-attribute-extensions) to include these new attributes, and includes the standard DITA DTD for reference documents: reference.dtd. By pointing at that DTD, the DITA reference document shown in Listing 4 ensures that its XML is valid and that it can also store RDFa triples.

As with the DocBook example, you can now add the new attributes just about anywhere you want in the DITA document, but whenever possible the document has them in elements provided by the standard for metadata. For example, the mpc:editor and mpc:workFlowStage values are stored in a prolog element right after the document’s title. Unlike DocBook, DITA has no specialized elements for wrapping references to images, so I used ph elements (DITA’s equivalent of DocBook’s phrase element) to hold the attributes for the sunset.jpg image’s lastScreenShotDate and softwareRelease values.

What About Existing DocBook and DITA File Processing?

You may wonder how the metadata additions described in the previous sections affect the processing of DocBook and DITA files. In short, it doesn’t. Popular frameworks for converting DocBook and DITA files into HTML, PDF, and other output formats (for example, the DocBook XSL Stylesheets and the DITA Open Toolkit) process the documents you give them by pulling values from the elements and attributes they need and ignoring the other attributes. In fact, this was part of the design of both DTDs: to let you add new attributes without worrying about backward-compatibility. The “a” in RDFa helps keep this new add-on module from being too intrusive when added to other DTDs and schemas because you don’t need to incorporate any new elements into the target DTD’s content models.

Speaking of the add-on module, in this article’s examples the syntax for including DTD declarations for the RDFa attributes are a bit simplified. In a production system, you should use DocBook or DITA conventions to incorporate the full XHTML Metainformation Attributes Module into your system. As the official DTD module from the standard, this module makes your system more compliant with that standard. The DITA DTDs already declare a datatype attribute for the data element and a content attribute for the othermeta element. To avoid warning messages about the RDFa declarations making those attributes redundant, your production version should also re-declare those elements’ attribute lists without those attributes.

What Have You Learned?

Adding RDFa to your DocBook or DITA documents has a nice payoff: easier addition of metadata that can be extracted by existing tools that follow an open standard. And it comes at a minimal cost.

Try adding some of the sample metadata shown in this article to your own documents, and then modify it to reflect metadata your system needs but that doesn’t have an obvious place in your existing DTD. As tools for aggregating and manipulating RDF triples proliferate, you’ll find an increasing amount of technology that can help you get more out of your own content and metadata.

Read the actual RDFa specification, the “RDFa in XHTML: Syntax and Processing Recommendation,” to learn more about the RDFa attributes discussed here and several others.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Related Posts