Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Gleaning Information From Embedded Metadata : Page 2

Put GRDDL-enabled agents to the task of extracting valuable information from machine-processable metadata embedded in documents, courtesy of prevailing semantic web standards.

GRDDL Support
Gleaning Resource Descriptions from Dialects of Languages (GRDDL, pronounced griddle) offers a solution to the embedded metadata problem in a flexible, inclusive, and forward-compatible way. It allows the extraction of standard forms of metadata (RDF) from a variety of sources within a document. People usually associate XHTML with GRDDL (as will this article), but it is worth noting that GRDDL is useful for extracting standardized RDF metadata from other XML structures as well.

GRDDL theoretically supports a series of naming conventions and standard transformations, but it does not require everyone to agree to particular markup strategies. It allows you to normalize metadata extraction from documents using RDFa, microformats, eRDF, or even custom mark-up schemes. The trick is to identify the document as a GRDDL-aware source by specifying an HTML metadata profile:

<html xmlns="http://www.w3.org/1999/xhtml"> <head profile="http://www.w3.org/2003/g/data-view"> <title>My Cool Document</title> . . </html>

This profile indicates to any GRDDL-aware agents that the standard GRDDL profile applies. Anyone wishing to extract metadata from the document should identify any relevant <link> tags with a rel attribute of transformation and apply it to the document itself. This approach avoids the conventional problem of screen scraping, where the client has to figure out how to extract information. With GRDDL, the publisher indicates a simple, reusable mechanism to extract relevant information.

While it is certainly possible to create custom transformations, you will likely want to reuse existing transformations and the markup conventions they rely on. As an example, the Dublin Core Metadata Initiative (DCMI) is the ubiquitous, canonical RDF vocabulary for describing publication metadata. To extract it, you may use the XSL file—dc-extract.xsl—that is specified in the link statement for the transformation. To enable this extraction, mark up your HTML with conventions such as this:

<meta name="DC.Date" content="2008-01-03" /> <meta name="DC.Creator" content="Brian Sletten" /> <meta name="DC.Description" content="This article is about why the new Okkervil River album rocks." />

Then apply this transformation to the document itself. The publisher can specify the transformation through a link statement such as this one:

<link href="http://www.w3.org/2000/06/dc-extract/dc-extract.xsl" rel="transformation"/>

Other dialects function similarly. Investigate their profiles to see how you might specify metadata for different transformations.

GRDDL-Enabled Agents
While there is currently no direct support for GRDDL in any major browser, that situation is likely to change in the near future. Until then, it is not at all difficult to put a GRDDL-aware proxy in between your browser and GRDDL-enabled pages, which the Piggy Bank FireFox extension from MIT's SIMILE Project does.

The rest of this article will demonstrate this scenario by using NetKernel, a dual-license open source development environment from 1060 Research Limited that you can use as a proxy for handling GRDDL extraction on the fly. You certainly do not need NetKernel to perform this task; however, it offers a very efficient environment for doing so. As you will see, quite a lot gets done in less than 200 lines of code. Feel free to translate the walkthrough into whatever language you prefer to use. (See the sidebar "Installing NetKernel and the Proxy Example" for more details on getting this proxy running.)

As an example of a GRDDL-able page, take a look at a human-friendly, bio web page (see Figure 1). It includes a photograph, background information, a few current projects, and some news items. This page is fine for human digestion, but as you can see when looking at the page source, it is a fine page for software agents as well.

Figure 1. Worthy of GRDDL: This bio web page offers a good example of a GRDDL-able page that includes a photograph, background information, and some news items set up well for software agents.
Looking at the source, you can see that this is a GRDDL-able document that announces five transformation links being served up locally:

<link href="../../transforms/grokFOAF.xsl" rel="transformation"/> <link href="../../transforms/grokCC.xsl" rel="transformation"/> <link href="../../transforms/grokGeoURL.xsl" rel="transformation"/> <link href="../../transforms/dc-extract.xsl" rel="transformation"/> <link href="../../transforms/home2rss.xsl" rel="transformation"/>

These stylesheets will extract The Friend of a Friend (FOAF) project (social networking), Creative Commons (license information), geocoding information, Dublin Core metadata, and RSS feed information from the page.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.