Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Gleaning Information From Embedded Metadata

Put GRDDL-enabled agents to the task of extracting valuable information from machine-processable metadata embedded in documents, courtesy of prevailing semantic web standards.

ne of the fundamental visions of the semantic web is the ability to provide improved technologies for machine-processable data. The current web is a swell place for people, but absent a series of open, global standards for metadata, it is difficult to imagine the interoperability necessary to link software, data, and documents in all their various forms. (Note that only the standards are required to be global; the specific terms, relationships, and concepts can be as diverse as the communities they reflect.)

While the vision eventually spirals off into spiders, agents, and bots, you do not have to go quite that far to imagine how vitally useful this automated data processability will be. Right now, the only real metadata available everywhere is the address of the documents you browse and the date and time when you did so. You can collect this data in your browser or on sites like del.icio.us to find them at a future point through tags that you create. You are externalizing the metadata about the document either into a taxonomy (that is, browser history menus) or through keyword tags. It is the browsing experience that directly provides the where and when.

These two fundamental pieces of information are important, but the entire spectrum of expressible metadata offers a compelling promise of semi-automated data gathering that is just now beginning to be appreciated. Imagine passively tracking who wrote each of the pages you visit and where these authors work and live, what they are interested in, and who they know. Consider the efficiency of looking back at what you have perused and being told which documents are Creative Commons licensed in ways that allow you to directly mine what you have read as long as you attribute accordingly. Or, how about hitting a band's web page and capturing when they are going to be playing in your town?

One of the biggest complaints about this vision, however, is that critics do not believe people will be willing to put in the effort to produce and maintain quality metadata. Their complaint is that without a solid foundation, the whole house of cards will fall or fail to emerge in the first place. While sites like del.icio.us, Flickr, and similar folksonomy-based approaches—and the rampant success of Atom/RSS feeds—seem to disprove these concerns, for the purposes of this article the assumption is that at least some publishers will be willing to do so.

The question is, how do you go about embedding this information into your web pages?

HTML and XHTML traditionally have had only modest support for metadata tags. They also have structural guidelines that make directly adding metadata more difficult than you might expect. Historically, developers and publishers have played some clever games to put domain-specific metadata into HTML by using microformats—for more information on microformats, see the article, "Discover Microformats for Embedding Semantics" (DevX, July 4, 2007). While useful, these specific formats fail to support an open-ended metadata language like the Resource Description Framework (RDF), which allows the use, reuse, and mixture of open-ended vocabulary spaces. You cannot ignore microformats, and you shouldn't, because they have been adopted successfully and extensively, but they simply do not paint a complete picture.

The World Wide Web Consortium (W3C) is working on including richer metadata support in HTML/XHTML with emerging standards such as RDF with attributes (RDFa), embedded RDF (eRDF), and so on. These standards allow more specific metadata to be attached to different structural and presentation elements, which provides a unified information resource. Avoiding data duplication or forking information resources into text/data and metadata are key goals of these efforts, which are currently in the works and will likely result in very compelling strategies to solve this problem. Now it is time to take a look at what is available and widely usable.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date