Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Gleaning Information From Embedded Metadata : Page 3

Put GRDDL-enabled agents to the task of extracting valuable information from machine-processable metadata embedded in documents, courtesy of prevailing semantic web standards.

Configure the Proxy
If you have downloaded NetKernel, installed the modules as described in the sidebar, and started NetKernel, you should be able to configure your browser to use an HTTP proxy at localhost (or on port 8082. For example, in Firefox, select Tools ⇒ Options, select Advanced, select the Network tab, and then click the Settings... button. From there, select "Manual proxy configuration" and enter the values shown previously. Note that this proxy is intended for demonstration purposes only at this point and should not be left on when you are done experimenting. Do not forget to set it back to "Direct connection to the Internet" or whatever is the default setting for your browser.

With the proxy activated, when you reload the page you should see some diagnostic print statements in the console window from which you started NetKernel. If you wait a few moments, you should see a big block of RDF dumped to the screen. On slow networks, this result may take some time.

Author's Note: You may need to use Shift-click to force the page's reload.

Under the hood is that there is a NetKernel HTTP fulcrum listening on port 8082. If you look at the <NetKernel-install-dir>/modules/active-proxy-0.0.2/active-proxy-services/module.xml file, you will see a rewrite rule that routes everything to a BeanShell script:

<rewrite> <rule> <match>(.*)</match> <to>active:beanshell+operator@ffcpl:/resource/main.bsh+base@$1 </to> </rule> </rewrite>

Without going into too many details of this script, NetKernel interprets the script as part of the request handling and fetches the specified resource by making a call to the HTTP Client module:

req=context.createSubRequest("active:httpGet"); req.addArgument("url", url); res=context.issueSubRequest(req);

In a proxy scenario, you do not want to block on returning the result to the client, so you cannot harvest the metadata synchronously. If you determine that you want to look for GRDDL transformations, do so asynchronously:

req=context.createSubRequest(); req.setURI("active:beanshell"); req.addArgument("operator", "harvest.bsh"); req.addArgument("url", url); context.issueAsyncSubRequest(req);

The results are currently not captured here, although you could certainly imagine wanting to do so at this level.

At the moment, the harvest.bsh script only passes the resource reference to the GRDDL module, which routes the request to the <NetKernel-install-dir>/modules/grddl-0.0.2/grddlAccessor.bsh script. This script receives an argument called operand that represents the resource to mine for metadata.

Listing 1 shows the majority of the functionality for this example, including the grddlAccessor.bsh file. This script first asks the NetKernel environment to source the document (fetch it from the URL, and make it available for processing) and to interpret the URL argument as a java.net.URI instance. This latter step will allow local URI references to be contextualized against the source URI later:

sourceDoc = context.sourceAspect("this:param:operand", IAspectSAX.class); sourceURL = context.getThisRequest().getArgument("operand"); sourceURI = new java.net.URI(sourceURL);

Run the getTransformsXML.xsl style sheet over your sourceDoc instance to obtain your list of transformations to apply. If you are interested in only certain types of transformations, you could restrict the extraction to just those. Listing 2 represents the contents of this style sheet.

uriList = syncTransform( sourceDoc, "ffcpl:/getTransformsXML.xsl", null);

This function does a synchronous XSLT conversion and captures the result:

<uris xmlns:dataview="http://www.w3.org/2003/g/data-view#"> <rootNamespace>http://www.w3.org/1999/xhtml</rootNamespace> <profile>http://www.w3.org/2003/g/data-view</profile> <transform>../../transforms/grokFOAF.xsl</transform> <transform>../../transforms/grokCC.xsl</transform> <transform>../../transforms/grokGeoURL.xsl</transform> <transform>../../transforms/dc-extract.xsl</transform> <transform>../../transforms/home2rss.xsl</transform> </uris>

The script then sources an RDF document template that will be used to accumulate any RDF statements that are extracted for each transformation.

// Create the RDF template rdf = context.sourceAspect( "ffcpl:/template.rdf", IAspectXDA.class);

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.