devxlogo

Gleaning Information From Embedded Metadata

Gleaning Information From Embedded Metadata

ne of the fundamental visions of the semantic web is the ability to provide improved technologies for machine-processable data. The current web is a swell place for people, but absent a series of open, global standards for metadata, it is difficult to imagine the interoperability necessary to link software, data, and documents in all their various forms. (Note that only the standards are required to be global; the specific terms, relationships, and concepts can be as diverse as the communities they reflect.)

While the vision eventually spirals off into spiders, agents, and bots, you do not have to go quite that far to imagine how vitally useful this automated data processability will be. Right now, the only real metadata available everywhere is the address of the documents you browse and the date and time when you did so. You can collect this data in your browser or on sites like del.icio.us to find them at a future point through tags that you create. You are externalizing the metadata about the document either into a taxonomy (that is, browser history menus) or through keyword tags. It is the browsing experience that directly provides the where and when.

These two fundamental pieces of information are important, but the entire spectrum of expressible metadata offers a compelling promise of semi-automated data gathering that is just now beginning to be appreciated. Imagine passively tracking who wrote each of the pages you visit and where these authors work and live, what they are interested in, and who they know. Consider the efficiency of looking back at what you have perused and being told which documents are Creative Commons licensed in ways that allow you to directly mine what you have read as long as you attribute accordingly. Or, how about hitting a band’s web page and capturing when they are going to be playing in your town?

One of the biggest complaints about this vision, however, is that critics do not believe people will be willing to put in the effort to produce and maintain quality metadata. Their complaint is that without a solid foundation, the whole house of cards will fall or fail to emerge in the first place. While sites like del.icio.us, Flickr, and similar folksonomy-based approaches—and the rampant success of Atom/RSS feeds—seem to disprove these concerns, for the purposes of this article the assumption is that at least some publishers will be willing to do so.

The question is, how do you go about embedding this information into your web pages?

HTML and XHTML traditionally have had only modest support for metadata tags. They also have structural guidelines that make directly adding metadata more difficult than you might expect. Historically, developers and publishers have played some clever games to put domain-specific metadata into HTML by using microformats—for more information on microformats, see the article, “Discover Microformats for Embedding Semantics” (DevX, July 4, 2007). While useful, these specific formats fail to support an open-ended metadata language like the Resource Description Framework (RDF), which allows the use, reuse, and mixture of open-ended vocabulary spaces. You cannot ignore microformats, and you shouldn’t, because they have been adopted successfully and extensively, but they simply do not paint a complete picture.

The World Wide Web Consortium (W3C) is working on including richer metadata support in HTML/XHTML with emerging standards such as RDF with attributes (RDFa), embedded RDF (eRDF), and so on. These standards allow more specific metadata to be attached to different structural and presentation elements, which provides a unified information resource. Avoiding data duplication or forking information resources into text/data and metadata are key goals of these efforts, which are currently in the works and will likely result in very compelling strategies to solve this problem. Now it is time to take a look at what is available and widely usable.

GRDDL Support
Gleaning Resource Descriptions from Dialects of Languages (GRDDL, pronounced griddle) offers a solution to the embedded metadata problem in a flexible, inclusive, and forward-compatible way. It allows the extraction of standard forms of metadata (RDF) from a variety of sources within a document. People usually associate XHTML with GRDDL (as will this article), but it is worth noting that GRDDL is useful for extracting standardized RDF metadata from other XML structures as well.

GRDDL theoretically supports a series of naming conventions and standard transformations, but it does not require everyone to agree to particular markup strategies. It allows you to normalize metadata extraction from documents using RDFa, microformats, eRDF, or even custom mark-up schemes. The trick is to identify the document as a GRDDL-aware source by specifying an HTML metadata profile:

      My Cool Document     .     .

This profile indicates to any GRDDL-aware agents that the standard GRDDL profile applies. Anyone wishing to extract metadata from the document should identify any relevant tags with a rel attribute of transformation and apply it to the document itself. This approach avoids the conventional problem of screen scraping, where the client has to figure out how to extract information. With GRDDL, the publisher indicates a simple, reusable mechanism to extract relevant information.

While it is certainly possible to create custom transformations, you will likely want to reuse existing transformations and the markup conventions they rely on. As an example, the Dublin Core Metadata Initiative (DCMI) is the ubiquitous, canonical RDF vocabulary for describing publication metadata. To extract it, you may use the XSL file—dc-extract.xsl—that is specified in the link statement for the transformation. To enable this extraction, mark up your HTML with conventions such as this:

Then apply this transformation to the document itself. The publisher can specify the transformation through a link statement such as this one:

Other dialects function similarly. Investigate their profiles to see how you might specify metadata for different transformations.

GRDDL-Enabled Agents
While there is currently no direct support for GRDDL in any major browser, that situation is likely to change in the near future. Until then, it is not at all difficult to put a GRDDL-aware proxy in between your browser and GRDDL-enabled pages, which the Piggy Bank FireFox extension from MIT’s SIMILE Project does.

The rest of this article will demonstrate this scenario by using NetKernel, a dual-license open source development environment from 1060 Research Limited that you can use as a proxy for handling GRDDL extraction on the fly. You certainly do not need NetKernel to perform this task; however, it offers a very efficient environment for doing so. As you will see, quite a lot gets done in less than 200 lines of code. Feel free to translate the walkthrough into whatever language you prefer to use. (See the sidebar “Installing NetKernel and the Proxy Example” for more details on getting this proxy running.)

As an example of a GRDDL-able page, take a look at a human-friendly, bio web page (see Figure 1). It includes a photograph, background information, a few current projects, and some news items. This page is fine for human digestion, but as you can see when looking at the page source, it is a fine page for software agents as well.


Figure 1. Worthy of GRDDL: This bio web page offers a good example of a GRDDL-able page that includes a photograph, background information, and some news items set up well for software agents.

Looking at the source, you can see that this is a GRDDL-able document that announces five transformation links being served up locally:

These stylesheets will extract The Friend of a Friend (FOAF) project (social networking), Creative Commons (license information), geocoding information, Dublin Core metadata, and RSS feed information from the page.

Configure the Proxy
If you have downloaded NetKernel, installed the modules as described in the sidebar, and started NetKernel, you should be able to configure your browser to use an HTTP proxy at localhost (or 127.0.0.1) on port 8082. For example, in Firefox, select Tools ? Options, select Advanced, select the Network tab, and then click the Settings… button. From there, select “Manual proxy configuration” and enter the values shown previously. Note that this proxy is intended for demonstration purposes only at this point and should not be left on when you are done experimenting. Do not forget to set it back to “Direct connection to the Internet” or whatever is the default setting for your browser.

With the proxy activated, when you reload the page you should see some diagnostic print statements in the console window from which you started NetKernel. If you wait a few moments, you should see a big block of RDF dumped to the screen. On slow networks, this result may take some time.

Author’s Note: You may need to use Shift-click to force the page’s reload.

Under the hood is that there is a NetKernel HTTP fulcrum listening on port 8082. If you look at the /modules/active-proxy-0.0.2/active-proxy-services/module.xml file, you will see a rewrite rule that routes everything to a BeanShell script:

      (.*)    active:beanshell+operator@ffcpl:/resource/main.bsh+base@$1        

Without going into too many details of this script, NetKernel interprets the script as part of the request handling and fetches the specified resource by making a call to the HTTP Client module:

req=context.createSubRequest("active:httpGet");req.addArgument("url", url);res=context.issueSubRequest(req);

In a proxy scenario, you do not want to block on returning the result to the client, so you cannot harvest the metadata synchronously. If you determine that you want to look for GRDDL transformations, do so asynchronously:

req=context.createSubRequest();req.setURI("active:beanshell");req.addArgument("operator", "harvest.bsh");req.addArgument("url", url);context.issueAsyncSubRequest(req);

The results are currently not captured here, although you could certainly imagine wanting to do so at this level.

At the moment, the harvest.bsh script only passes the resource reference to the GRDDL module, which routes the request to the /modules/grddl-0.0.2/grddlAccessor.bsh script. This script receives an argument called operand that represents the resource to mine for metadata.

Listing 1 shows the majority of the functionality for this example, including the grddlAccessor.bsh file. This script first asks the NetKernel environment to source the document (fetch it from the URL, and make it available for processing) and to interpret the URL argument as a java.net.URI instance. This latter step will allow local URI references to be contextualized against the source URI later:

sourceDoc = context.sourceAspect("this:param:operand", IAspectSAX.class);sourceURL = context.getThisRequest().getArgument("operand");sourceURI = new java.net.URI(sourceURL);

Run the getTransformsXML.xsl style sheet over your sourceDoc instance to obtain your list of transformations to apply. If you are interested in only certain types of transformations, you could restrict the extraction to just those. Listing 2 represents the contents of this style sheet.

uriList = syncTransform(  sourceDoc, "ffcpl:/getTransformsXML.xsl", null);

This function does a synchronous XSLT conversion and captures the result:

  http://www.w3.org/1999/xhtml  http://www.w3.org/2003/g/data-view  ../../transforms/grokFOAF.xsl  ../../transforms/grokCC.xsl  ../../transforms/grokGeoURL.xsl  ../../transforms/dc-extract.xsl  ../../transforms/home2rss.xsl

The script then sources an RDF document template that will be used to accumulate any RDF statements that are extracted for each transformation.

// Create the RDF templaterdf = context.sourceAspect(  "ffcpl:/template.rdf", IAspectXDA.class);

Transform Discovery
The IAspectXDA interface provides the ability to iterate over XML nodes selected from an XPath. It counts how many transforms were discovered, and then creates arrays for asynchronous future handles and results:

count = Integer.valueOf(uriXda.eval("count(/uris/transform)").  getStringValue()).intValue();handles = new INKFAsyncRequestHandle[count];results = new IURRepresentation[count];IXDAReadOnlyIterator transforms=uriXda.readOnlyIterator(  "/uris/transform" );while( transforms.hasNext() ) {  transforms.next();  nextTransform = transforms.getText(".", true);      // a java.lang.String  // Get the canonical URI for the transform.  nextTransformURI = sourceURI.resolve(nextTransform);     // a java.net.URI  // Perform the XSL transformations asynchronously.  handles[idx++] = asyncTransform(sourceDoc, nextTransformURI.toString(), null);}

To reduce the amount of time required to harvest the results, the transformation requests do not block until all the requests have been issued and the results are ready to be captured:

// join on the resultsfor(int i = 0; i 

After all the results are available, they are accumulated into the resource created previously from the RDF template:

for(int i = 0; i 

Here is the style sheet that performs this accumulation:

            

Finally, the accumulated results are run through the xmltidy accessor (more information on accessors is available here) and then tagged with an appropriate MIME type:

// clean up the XMLreq=context.createSubRequest("active:xmltidy");req.addArgument("operand", rdf);rdf=context.issueSubRequest(req);// create responseresponse=context.createResponseFrom(rdf);response.setMimeType("application/rdf+xml");context.setResponse(response);

The harvest.bsh script does not do anything useful with the results, but you could store them in an RDF triple store such as Mulgara, which was discussed in the article, "Storing and Using RDF in Mulgara" (DevX, August 30, 2007).

While some of the NetKernel concepts in the example may seem a little strange, hopefully it is obvious that passively harvesting RDF metadata with GRDDL profiles is not a difficult task. (Go ahead and dig deeper into the NetKernel's powerful resource-oriented environment! View the documentation page at http://localhost:1060/ep+name@app_fulcrum_backend_documents after you have started NetKernel.)

There is still a bit of a bootstrap problem to get GRDDL used more extensively around the web. After people see how useful and easy this process can be, it seems like it will only be a matter of time until the tools will enable you to take advantage of metadata embedded in browsed documents. With users already willing to provide quality metadata through tags on social-oriented web sites, if the bar is lowered on how to discover and reuse terms from formal vocabularies, it seems likely that people will do so for semantic markup just like they do for presentation markup.

Additional Related Resources

devx-admin

Share the Post: