Creating and Managing RDF Vocabularies

o you like to relax in the morning over a cup of coffee and the “zoopepe”? Do you like to put “kapatz” on your “hogatz”? These are obviously meaningless terms to most people, but it is easy to imagine a young child using them in place of “newspaper,” “ketchup,” and “hotdog” because of how similar they sound. Parents have the option of correcting or playing along with their children when they make up their own words. In choosing the latter, they are accepting the terms as legitimate concepts within their child’s vocabulary. While you’ll find proponents of making children learn the right way to pronounce words from the beginning, you are just as likely to find an even larger number of people willing to let the children communicate freely without fear of getting it wrong, and confident that new and correct terms can be substituted at a later stage of development.

On the web and in the enterprise we frequently need to use terms from specific domains and vocabularies to convey meaning in a business context. Here, too, there is a tension between using terms that make sense to a particular party and adopting the concepts and relationships from a larger body. Most of the time our nouns, verbs, and relationships are caught up in our applications and business models. The whole thrust of the object-oriented methodology was conceived to model business terms in software. However, now there is a growing demand for modeling these terms and concepts outside of the software silos in which they are created to increase the opportunity for reuse.

As organizations make the transition toward this external modeling approach, they may feel like children acquiring new language skills. There will be questions about whether to use their own terms or to spend the time digesting what is available from industry organizations, partners, and so on. The effort to reach consensus on the appropriate names and relationships is significant, if not intractable, and going it alone may often be the right choice.

One of the biggest misconceptions about the semantic web initiative is that it requires everyone to adopt the same terms to be able to work together. This notion could not be further from the truth. While there is certainly benefit from the reuse of existing vocabularies, nothing prevents an organization from “rolling their own.” There is a common set of technologies that allow you, at worst, to agree to disagree. When an existing vocabulary suits your needs, reuse it. When you need a collection of terms that caters more specifically to your immediate context, it may be worth creating your own.

Like the children who move on from their own made-up words, however, creating your own vocabularies doesn’t mean you will be permanently ostracized from more central terminological activity. One of the beautiful things about the technology stack made up of the Resource Description Framework (RDF), RDF Vocabulary Description Language 1.0: RDF Schema (RDFS), and Web Ontology Language (OWL) is that at any time you can start relating your terms to those from other vocabularies. This approach provides an unprecedented ability to start small and focused, and adopt or connect to standards as they emerge.

Existing RDF Vocabularies
Examples of existing vocabularies include the usual suspects: Friend-of-a-Friend (FOAF) Project, Description of a Project (DOAP), Really Simple Syndication (RSS), and the ubiquitous Dublin Core. The FOAF vocabulary is designed to establish a decentralized language for describing personal and professional interests and social network linkage. It may be usurped eventually by the recent activity on Google’s OpenSocial networking efforts, but the benefits over closed proprietary networks and strictly hierarchical modeling languages are clear. DOAP is a vocabulary for describing open source projects. Dublin Core is a vocabulary for expressing publication metadata.

Beyond these canonical examples, new vocabularies are emerging to capture things such as Geotagging information, Creative Commons licensing, life sciences terminology, calendaring data, key Wikipedia facts, the CIA Factbook, temporal relationships, and so on. Some of these vocabularies are consensus based and designed to model a domain; others are made by independent individuals or groups looking to express existing content in a more machine-readable format. In either case, as people consider creating vocabularies like these, they will probably need some guidance. Even for seasoned data modelers these skills are new.

This discussion provides a series of practical recommendations to help you with this process.

This fragment from the Dublin Core vocabulary describes the term creator and will be referenced throughout the rest of the article.

CreatorAn entity primarily responsible for making   the resource.Examples of a Creator include a person, an   organization, or a service. Typically, the name of a Creator should be used   to indicate the entity.1999-07-022006-12-04

Mint Persistent URIs
Before you start to define the classes, properties, and constraints of your RDF vocabulary, begin with a commitment to use good names, and decide where your vocabulary will be hosted initially. RDF predicates are usually grounded in resolvable contexts. Don’t simply throw a vocabulary up without considering the potential lifetime of its use. Systems that reference your vocabulary terms will break if you move or restructure this location.

The reality is that any URL-based system is likely to change eventually. One way to get around this issue is to mint persistent URIs using the open source software infrastructure developed by the Online Computer Library Center (OCLC)”and recently updated by Zepheira. The URI is grounded within a resolvable URL context, but supports user-editable redirection rules that allow the hosted location to change over time without affecting clients of the URIs. Here’s an example:

...

The rdf:about attribute references the persistent URL (PURL) http://purl.org/dc/elements/1.1/creator, which currently resolves to http://dublincore.org/2006/12/18/dces.rdf#creator. Notice how the named element represents a logical structure that is unlikely to ever change. The resolved URL references a fragment on an RDF file located somewhere else. That file can move safely to another location as long as the rewrite rule is updated. Any references to the PURL in other RDF statements will remain valid in the face of this kind of a migration, which makes facts expressed on the web that much more resilient and universal.

Use Human-Readable RDFS Elements
The semantic web initiative is about making data on the web more accessible for machine processing. This laudable goal shouldn’t exclude humans as consumers of metadata as well, however. The RDF and OWL ontologies may be designed for processing by software, but the terms expressed should be well documented so that people can evaluate the intent of the terms for possible reuse and extension. It may not be at all obvious what a term is supposed to mean in a domain context.

Vocabulary authors must be explicit by using the RDFS constructs and . These allow for human meaningful, machine-processable metadata about the terms being discussed.

Notice as well that the prior example specifies an xml:lang attribute to indicate the cultural context under which the rdfs:label applies:

Creator

Because we desire the human readability to be accessible both to people reading our vocabulary files directly as well as through the parsed and processed results, we cannot simply use XML comments to indicate intention.

Do Not Insist on Rigor Up Front
Scientist, professor emeritus, and author Donald Knuth has for years warned about the evils of prematurely optimizing software. Until you understand the run-time characteristics of your software, you will not know where to expend the effort to get the biggest performance improvements. Spending time working on performance improvements without this knowledge is likely to be a wasted effort.

There is a similar problem with overconstraining terms in an RDF vocabulary. RDFS includes predicates to indicate domain and range constraints for the applicability of a property to certain classes. This approach is undoubtedly helpful for production vocabularies, but spending the time on this endeavor in the early stages of the development of a vocabulary is possibly wasted effort and is almost guaranteed to slow you down. Get the terms right, get some examples of using them under your belt, consider any feedback from external parties, and only then go about the effort of constraining your vocabularies. By then you will likely understand the constraints sufficiently well enough to make good choices.

Use Metadata to Describe Your Metadata
While you certainly want to avoid any “turtles all the way down” meta trips, it is a great idea to add metadata to your metadata. RDF vocabularies are themselves information resources that deserve suitable annotations.

Vocabularies will not always be consumed directly from the files in which they are created. Services like Swoogle parse known vocabularies to make their terms and concepts accessible through search. This parsing can be enabled by applying the predicate. The prior Dublin Core example demonstrates this link back to the source:

This approach makes it easier to track the definitions back to their origins if they are found in the wild.

Additionally, as vocabularies evolve, it is helpful to indicate the stability of specific terms, which gives consumers either confidence or a warning that dependence on a term might not be the best idea. The World Wide Web Consortium (W3C) has a set of terms that is useful for this very purpose.

There are three terms defined: , , and . The metadata on this property tells you that it is itself an unstable term, although it should be safe enough to use:

  term status  the status of a vocabulary term, one of     'stable','unstable','testing'.  unstable

Dublin Core extends the idea of metadata for metadata to include when terms were defined, when they were last modified, and what version they represent currently. Here’s the relevant portion from the prior Dublin Core example:

1999-07-022006-12-04

Reuse and Extend Existing Terms
When the terms in a vocabulary are well described as in the foregoing discussion, it makes it easier for someone else to reuse them appropriately. Your vocabulary may need to introduce some new concepts, but that doesn’t mean you must invent all new terms.

As an example, Edd Dumbill, noted columist, author, and creator of the DOAP vocabulary, chose to reuse in DOAP to refer to the maintainers of a project:

      Edd Dumbill      

He certainly could have created a new notion of a person in this role, but there was simply no need to. RDF quite ably supports this mixing and matching of terms from different vocabularies and namespaces; it is one of its chief charms.

Even if it is necessary to introduce a new term, it is a reasonable approach to tie it back into an existing vocabulary. You might want to extend through relationship for and (or ) to model the world of comic book authors.

While defining these files with nothing more than a good text editor is convenient, most people will want better tool support for creating and managing RDF vocabularies and their attendant metadata. There are several tools available to assist you with this process (see the sidebar, “Vocabulary Management Tools“).

This discussion covered some good strategies for deciding whether to create your own vocabularies or to seek consensus with others from your domains of interest. The W3C’s semantic web technologies are designed to help keep it relatively easy to start with an approach that makes sense to you and your organization and consider external vocabularies at some future date.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: