Login | Register   
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

The Art of Narrative and the Semantic Web : Page 2

As the Internet continues to evolve, Semantic Web technologies are beginning to emerge, but widespread adoption is likely to still be two to three years out.


advertisement
So what does this have to do with journalism? Ultimately, it may in fact determine what journalism is in the twenty-first century. In a sense, much of the publishing function of most newspapers and magazines have been taken over by content management systems, and in the last half decade such CMSs are increasingly web facing "blogging" type platforms. While such a shift from stand-alone word processing documents (such as those of Microsoft Word or Open Office Write) to using a web page for story entry may seem like a step backwards in terms of functionality, in practice such a shift represents the increasing move towards the document/data merger.

Most web content management systems store their stories as records rather than formal documents, and as such are able to store, in addition to the text content of the articles themselves, both attribute driven metadata -- who wrote it, when was it written, who was the publisher and so forth -- and categorization information. Typically in the latter case the user chooses a category term from a drop down list or related interface widget, though in some "open" taxonomies" they can also add one or more terms. Such systems can also maintain multiple sets of categories -- one, say, for the type of story (general news, bio, event coverage, op-ed and so forth), a second devoted to a particular topic (baseball, basketball, the Olympics, etc.), a third perhaps focusing on location of the event.

Note that there is a certain degree of blurring that goes on between what consists an attribute vs. what constitutes a category term. In general a good working definion of a category term is a combination of a category "name-space" and a corresponding sequence of words that collectively can partition a set of entries into a distinct group. A street address isn't a good category term, but city and state (or city and province) may very well be, and "story type" almost certainly is.



This categorization is critical because it makes it easier to organize resources, especially in multiple dimensions. Once you have a system of organization in place, it also makes it easier to locate them (or, put another way, you are enhancing discovery). Moreover, by creating web-based records one upshot of this is that each record will end up with its own address, its own URL. Put another way, categorization makes it possible to create "feeds" of article content that are all semantically similar to one another in that they share the categorization; they are related articles.

An RSS or Atom "news" feed traditionally provides one such categorization feed - a list of resources (with associated metadata content and links for each entry) usually associated with a given blog author or publishing site, sorted by descending time order to present the most recently published entry first and then other entries in reverse chronological order. However, while in practive this "category" is hard-wired, the next generation of distributed web applications are shifting to a mode where you can filter such feeds by category (or search term, which is just another type of category) and even change the sort order. Applications such as XQuery -- a special query language specifically designed for working with XML - may very well supplant more traditional web application languages such as Ruby or PHP for generting such specialized feeds.

For instance, consider the sports journalism feeds again. In this particular case, a particular publisher may provide a feed consisting of all stories in "blog" order. However, increasingly, it is becoming possible to customize such feeds -- provide only articles that deal with a particular sport, a particular team, or a particular player, either by passing a category term (a query) or by binding that query to a specific URL. Keep in mind that such feeds consist of collections of references (a link plus enough metadata to describe that link) though there's nothing stopping a particular feed provider from giving the entire content of the article as part of that metadata.

Categorization can be accomplished manually -- the writer of an article could add in each of these properties from drop-down menus or folksonomy lists or similar mechanisms -- but as the history of the Internet illustrates, once categorization moves beyond a few distinct categories, the likelihood that you'll have all of the participants in the system tag for you at a deep level remains very limited. For this reason, the process of document enrichment should be viewed especially closely.

With Document Enrichment (DE), the narrative content of a document is run through a semantic analyser which attempts to identify personal names, institutions, famous events, locations and other more specialized terms (such as legal terms for a legal semantic parser or drug names for a medical parser) in the document, then wraps XML content around these terms. OpenCalais (Reuters) is perhaps the largest such effort, but document enrichment has also spawned a lively industry of start-ups and services.

These matched terms are, of course, category terms, with the wrappers identifying the category. This does more than just keyword matching - most such systems actually attempt to build some formal context for the words and then map to the closest category in the semantic engine. Thus, such engines are capable of distinguishing between Paris, France, Paris the Trojan, Paris Hilton and Plaster of Paris.

Document enrichment by itself is of only moderate utility -- you are simply adding attributes to html elements to identify the category of a given word. With CSS, for instance, you could highlight the matched terms by category, visually showing place names compared to personal name. However, such enrichment gains more power when these XML documents are processed afterwards -- you can pull categories out and add them to a general list of categories for the resource in question, you could create links to specific content such as Wikipedia or the online Physicians Desktop Reference (the PDR). In other words, combining enrichment with a repository of content can both create an alternative mechanism for navigation and can also make the given page easier to find with a more localized search engine.

Three Types of Formats

There are currently three types of formats used for document enrichment. The first is essential a proprietary or ad hoc standard -- the DE vendor provides a formal taxonomy system and method for embedding the formats within the initial text sample. The next approach (and one that is actually diminishing in use) is that of microformats: using an agreed upon standard taxonomy for certain domains, such as document publishing (Dublin Core), friendship relationships (Friend of a Friend, or FOAF), address books (vCard), geocoding information (geo) and so forth. The problem with microformats is that they don't always work well in conjunction, and there's no way of encoding deeper relational information via most microformats.

This latter issue lays at the heart of the Resource Description Framework for Attributes, or RDFa, which makes it possible to encode relational information about different resources and resource links. RDFa is actually a variant of the much older W3C RDF language first formulated in 1999, then significantly updated in 2004. With RDFa, you can define multiple distinct categories (also known as name-spaces) with terms in each category g. You can also establish relationships between different parts of a document by using RDF terminology - for instance, indicating that a given introductory paragraph provides a good abstract summary "about" the document (or portion of a document) in question. There's even a specialized language called GRDDL that can take an RDFa encoded document and generate a corresponding RDF document. While comparatively few document enrichment companies have RDFa products on the market, many are moving in that direction, with organizations such as the BBC, NBC News, Time Inc. and Huffington Post among many others now exploring RDFa as a means of encoding such categorization information in the stories that are posted online.

RDF and RDFa are two of a set of related standards that form what is becoming known as the Semantic Web. The idea behind the Semantic Web was first proposed in an article by Tim Berners-Lee for Scientific American in 2001, and the W3C took up the idea as a formal activity by 2005. A related language, the Web Ontology Language (abbreviated as OWL, for some bizarre reason) describes more complex "business objects" and relationships between these objects, to the extent that many people who work heavily in the Semantic Web tend to talk about RDF/OWL as a single concept. The latter, however, seems to be more focused on designing rules of relationships rather than the relationships themselves.

In 2009, Tim Berners-Lee also introduced a new concept called Linked Data. The fundamental idea behind Linked Data is the concept that data object stores (such as a site's news content or the contents of Wikipedia or similar super encyclopedia) could be represented as RDF entities, and then, via a specialized query language called SPARQL, queries could be made on these repositories as if they were databases (which in most cases they are). Keep in mind here that what's being retrieved is the categorization and abstract information about this content, and from there the content itself can be retrieved for additional processing (possibly in conjunction with a language like XQuery, which is better for working with the metadata within a document itself).

Thus, if you were to create a repository of news and biographical information, each encoded with RDFa via a document enrichment facility, then a specialized SPARQL query engine could convert the RDFa into RDF (each of which has a link to the initial resource) and perform queries upon that RDF to retrieve categorization information - articles about the president of the United States or the prime minister of Canada, articles about financial services written in 2009 that focused on credit default swaps, articles focusing on how well rookie draft choices did in the NHL in 2010 and so forth.

What's more, because all of these use the same underlying mechanism for storing and manipulating such information, it becomes possible to work across multiple repositories - such as Wikipedia, Data.gov and the National Archives - with the same basic query, then pass these link sets back to some other process (such as XQuery) for individual processing. As such, Linked Data can effectively treat the entire web (or at least that part of the web that currently using Linked Data methodologies) as a database.

This may become a powerful adjunct (or potentially even a replacement) to a text indexing service such as Google or Bing (to the extent that Google and Microsoft both are investing heavily in Semantic Web research). It has significant applicability to advertising -- by being able to more intelligently determining what a given article or resource is about it makes it far easier to target advertising for the consumer of that resource -- as well as having applicability for the analyst. For instance, if you have something like a financial report written using the eXtnesible Business Reporting Language (or XBRL), an RDFa encoding of this information, mapped into a Linked Data repository, makes it possible to group such financial reports or SEC filings by industry, market cap or employment size (among other possible measures) without having to specifically know initially which reports (and by extension which companies) are already within these groups, making apples to apples comparisons considerably easier.

These are progressive technologies -- XML technologies are now about a decade old, XQuery and XML Database tools are just now really becoming main stream. Semantic Web technologies are beginning to emerge, but widespread adoption is likely to still be two to three years out. However, publishing and journalism are definitely at the forefront of that curve, because these areas in particular are most sensitive to the need to both provide enjoyable news content and the need to make such stories manipulatable and discoverable within the ever increasing sophistication and scope of the web itself. The narrative thread has become a rich, interwoven tapestry, illuminated by brilliant strands of meaning, semantics and abstraction, turning our writings into conversations, and from there into dialogs. It's a good time to be a journalist.


Kurt Cagle is the managing editor for XMLToday.org and a contributing editor for O'Reilly Media. He is currently working on a book about XBRL. Follow him on Twitter at twitter.com/kurt_cagle.
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap