The art of the narrative is one of the strongest threads running through our society and culture, and is in many respects one of the defining traits of humanity. “The story” is more than just a recitation of facts or assertions (whether real or otherwise). A good story is experiential. It puts each of us as listeners into the narrator’s world and frame of mind, let’s us live, vicariously, through the experiences that the narrator had or conceived. In many cases we identify with the protagonist, whether the story is an epic fantasy journey through lost worlds, a sports article talking about the clash between two rival football teams, or the reportage of a major political event. We read meaning into these narratives at many level, from the bald statement of fact to the subtle interplay of analysis, implication, innuendo and metaphor, and it is the richness of these metaphors that give meaning to the work.
Librarians, historians and archivists have spent centuries engaged in an increasingly challenging task. A story, once written, forms a strand of a larger cultural narrative … but only if that story is placed into an overall context and is preserved within that context. The job of a librarian or archivist is traditionally to ascribe such a story to a classification system — which in turn makes it locatable within an archive, as well as to abstract from that story enough metadata to decribe what the story is about. Anyone who has spent days poring through newspaper collections trying to pull together enough information to write a report or make a recommendation can tell you the importance of a good classification system (and good abstractions), or better yet, of multiple classification systems that make it easier to triangulate on that information from multiple directions at once.
So long as the information itself was located in books, newspapers, magazines, annual reports and other publications, the archivist’s job was manageable, albeit barely. A card catalog is, when you think about it, a rather impresive achievement, given the inherent difficulties in setting up sufficiently comprehensive classification taxonomies and then assigning a given resource to that taxonomy, yet a great deal of information still disappeared through the cracks — materials not classified properly or completely, resources that change over time, resources being moved from one location to another without this being noted. It’s perhaps no wonder that librarians are stressed out all the time.
Once such narratives made the jump to the web, however, the whole house of library cards promptly collapsed. At first, the best key to finding a resource on the web was to create a memorable URL, which in turn set off a massive land rush for domain names. The thinking here was that a domain name was, in essence, a brand, and if you had a noteworthy brand, that would translate into lots of web traffic. In the early days of the web , that was probably true, but as the number of such domains grew and as more and more interesting stuff started appearing “off-label”, this particular classification system became very secondary.
The next system was to build directories — after all, this created the same kind of taxonomy system that most people were used to for everything from yellow pages to libraries. Here again, however, the volume of new material proved this form of classification’s undoing — there was no clean mechanism for asserting that a particular document is in fact in a given classification term (indeed it may be in several simultaneously — or not fit cleanly into any bucket in the taxonomy). Moreover, the decision to put a site into such directories typically required human intervention, which meant that after a while (after the novelty of such directories began to pale) such directories became ever further out of date and represented a smaller and smaller percentage of the total content.
Search engines, such as Altavista, Yahoo, and ultimately Google and Bing, represented the next stage of classification, employing a process whereby links within web content were routinely spidered in order to build indexes. Indexing is a powerful tool, because while there are potentially huge variations in search terms, once you create an index phrase you can effectively store multiple links to that same phrase, ordered by frequency of reference and other relevancy measures. Then when another person enters the same phrase, there’s already an extant cache of saved entries that can be served up very quickly, rather than the search having to be performed over the entire set of billions of entries. The search phases become the classification terms.
Challenges From Many Directions
Even this system is being sorely challenged, and the challenges are coming from a variety of directions. Indexing that much content is extraordinarily processor intensive, even when spread across multiple processes, and needs to be accomplished frequently in order to remain even relatively current. This translates into a massive investment into computer systems — Google, for instance, had 450,000 server systems in 2006, and the number may be close to one million as of early 2010, and corresponding issues of energy and heat management (current estimates place Google’s total power demand at 35-40MW, which is the power draw of a moderate sized city).
However, another problem is more troublesome. The web as originally conceived was largely static — web content, once posted, usually didn’t change significantly. However, by 2010, the vast majority of content that is developed on the web falls more properly into the realm of messages rather than documents — Facebook and Twitter notifications, resources generated from rapidly changing databases, documents in which narrative content are embedded within larger data structures, RSS and Atom feeds, KML (ironically, Google Earth and Google Maps) documents and so forth. Thus, a URL no longer contains a static narrative — it contains a constantly changing message. This means that by the time that URL content is indexed, the new content from the URL may have no bearing on the original content.
This is similarly exacerbated by a growing trend towards the deployment of data as a service. In this case data objects (what have traditionally been called business objects) are represented as structured data in XML or JSON, but have many of the same characteristics of more traditional web content — a distinct URL, a clear representation in at least one and potentially many different forms, access via HTTP methods and so forth. They may have long term persistence, but may just as readily be ephemeral, and as such they take on many of the properties of messages as well.
If you look at the history of the build-out of the web, one thing that becomes apparent is that as a new modality of communication comes online, people invariably try to find a real world analog, then use this analog to create a metaphor for interfaces used to interact with that modality. In time, however, the metaphor leaks as the differences between the real and virtual world representations force differences in thinking about such data objects (or, perhaps to illustrate the point, thinking about document objects). By then, the next generation of people working with these document objects understand the distinctions between the virtual objects and the real world analog, and no longer need the metaphor — it in fact becomes a crutch.
A good case in point here was the confusion concerning web page design when web content first emerged in the mid-1990s. The earliest web designers tended to treat such pages as research papers, because the earliest users were academic. At various points web pages were designed as if they were newspaper articles, magazine articles, tv screens and computer programs, but by the early-’00s a formal “methodology” of web design had emerged that recognized that a web page was a web page, and had it’s own rules of design.
Similarly, in the later part of that decade, XML use as a messaging format, a document format, a feed format, a record format and a data format has begun to blur the distinction between these, because each of these are simply metaphors that we pull in from the “real” world for the entities that we’ve created. In practice there’s very little actual difference between each of these — they are just different facets of the same basic underlying representation.So what does this have to do with journalism? Ultimately, it may in fact determine what journalism is in the twenty-first century. In a sense, much of the publishing function of most newspapers and magazines have been taken over by content management systems, and in the last half decade such CMSs are increasingly web facing “blogging” type platforms. While such a shift from stand-alone word processing documents (such as those of Microsoft Word or Open Office Write) to using a web page for story entry may seem like a step backwards in terms of functionality, in practice such a shift represents the increasing move towards the document/data merger.
Most web content management systems store their stories as records rather than formal documents, and as such are able to store, in addition to the text content of the articles themselves, both attribute driven metadata — who wrote it, when was it written, who was the publisher and so forth — and categorization information. Typically in the latter case the user chooses a category term from a drop down list or related interface widget, though in some “open” taxonomies” they can also add one or more terms. Such systems can also maintain multiple sets of categories — one, say, for the type of story (general news, bio, event coverage, op-ed and so forth), a second devoted to a particular topic (baseball, basketball, the Olympics, etc.), a third perhaps focusing on location of the event.
Note that there is a certain degree of blurring that goes on between what consists an attribute vs. what constitutes a category term. In general a good working definion of a category term is a combination of a category “name-space” and a corresponding sequence of words that collectively can partition a set of entries into a distinct group. A street address isn’t a good category term, but city and state (or city and province) may very well be, and “story type” almost certainly is.
This categorization is critical because it makes it easier to organize resources, especially in multiple dimensions. Once you have a system of organization in place, it also makes it easier to locate them (or, put another way, you are enhancing discovery). Moreover, by creating web-based records one upshot of this is that each record will end up with its own address, its own URL. Put another way, categorization makes it possible to create “feeds” of article content that are all semantically similar to one another in that they share the categorization; they are related articles.
An RSS or Atom “news” feed traditionally provides one such categorization feed – a list of resources (with associated metadata content and links for each entry) usually associated with a given blog author or publishing site, sorted by descending time order to present the most recently published entry first and then other entries in reverse chronological order. However, while in practive this “category” is hard-wired, the next generation of distributed web applications are shifting to a mode where you can filter such feeds by category (or search term, which is just another type of category) and even change the sort order. Applications such as XQuery — a special query language specifically designed for working with XML – may very well supplant more traditional web application languages such as Ruby or PHP for generting such specialized feeds.
For instance, consider the sports journalism feeds again. In this particular case, a particular publisher may provide a feed consisting of all stories in “blog” order. However, increasingly, it is becoming possible to customize such feeds — provide only articles that deal with a particular sport, a particular team, or a particular player, either by passing a category term (a query) or by binding that query to a specific URL. Keep in mind that such feeds consist of collections of references (a link plus enough metadata to describe that link) though there’s nothing stopping a particular feed provider from giving the entire content of the article as part of that metadata.
Categorization can be accomplished manually — the writer of an article could add in each of these properties from drop-down menus or folksonomy lists or similar mechanisms — but as the history of the Internet illustrates, once categorization moves beyond a few distinct categories, the likelihood that you’ll have all of the participants in the system tag for you at a deep level remains very limited. For this reason, the process of document enrichment should be viewed especially closely.
With Document Enrichment (DE), the narrative content of a document is run through a semantic analyser which attempts to identify personal names, institutions, famous events, locations and other more specialized terms (such as legal terms for a legal semantic parser or drug names for a medical parser) in the document, then wraps XML content around these terms. OpenCalais (Reuters) is perhaps the largest such effort, but document enrichment has also spawned a lively industry of start-ups and services.
These matched terms are, of course, category terms, with the wrappers identifying the category. This does more than just keyword matching – most such systems actually attempt to build some formal context for the words and then map to the closest category in the semantic engine. Thus, such engines are capable of distinguishing between Paris, France, Paris the Trojan, Paris Hilton and Plaster of Paris.
Document enrichment by itself is of only moderate utility — you are simply adding attributes to html elements to identify the category of a given word. With CSS, for instance, you could highlight the matched terms by category, visually showing place names compared to personal name. However, such enrichment gains more power when these XML documents are processed afterwards — you can pull categories out and add them to a general list of categories for the resource in question, you could create links to specific content such as Wikipedia or the online Physicians Desktop Reference (the PDR). In other words, combining enrichment with a repository of content can both create an alternative mechanism for navigation and can also make the given page easier to find with a more localized search engine.
Three Types of Formats
There are currently three types of formats used for document enrichment. The first is essential a proprietary or ad hoc standard — the DE vendor provides a formal taxonomy system and method for embedding the formats within the initial text sample. The next approach (and one that is actually diminishing in use) is that of microformats: using an agreed upon standard taxonomy for certain domains, such as document publishing (Dublin Core), friendship relationships (Friend of a Friend, or FOAF), address books (vCard), geocoding information (geo) and so forth. The problem with microformats is that they don’t always work well in conjunction, and there’s no way of encoding deeper relational information via most microformats.
This latter issue lays at the heart of the Resource Description Framework for Attributes, or RDFa, which makes it possible to encode relational information about different resources and resource links. RDFa is actually a variant of the much older W3C RDF language first formulated in 1999, then significantly updated in 2004. With RDFa, you can define multiple distinct categories (also known as name-spaces) with terms in each category g. You can also establish relationships between different parts of a document by using RDF terminology – for instance, indicating that a given introductory paragraph provides a good abstract summary “about” the document (or portion of a document) in question. There’s even a specialized language called GRDDL that can take an RDFa encoded document and generate a corresponding RDF document. While comparatively few document enrichment companies have RDFa products on the market, many are moving in that direction, with organizations such as the BBC, NBC News, Time Inc. and Huffington Post among many others now exploring RDFa as a means of encoding such categorization information in the stories that are posted online.
RDF and RDFa are two of a set of related standards that form what is becoming known as the Semantic Web. The idea behind the Semantic Web was first proposed in an article by Tim Berners-Lee for Scientific American in 2001, and the W3C took up the idea as a formal activity by 2005. A related language, the Web Ontology Language (abbreviated as OWL, for some bizarre reason) describes more complex “business objects” and relationships between these objects, to the extent that many people who work heavily in the Semantic Web tend to talk about RDF/OWL as a single concept. The latter, however, seems to be more focused on designing rules of relationships rather than the relationships themselves.
In 2009, Tim Berners-Lee also introduced a new concept called Linked Data. The fundamental idea behind Linked Data is the concept that data object stores (such as a site’s news content or the contents of Wikipedia or similar super encyclopedia) could be represented as RDF entities, and then, via a specialized query language called SPARQL, queries could be made on these repositories as if they were databases (which in most cases they are). Keep in mind here that what’s being retrieved is the categorization and abstract information about this content, and from there the content itself can be retrieved for additional processing (possibly in conjunction with a language like XQuery, which is better for working with the metadata within a document itself).
Thus, if you were to create a repository of news and biographical information, each encoded with RDFa via a document enrichment facility, then a specialized SPARQL query engine could convert the RDFa into RDF (each of which has a link to the initial resource) and perform queries upon that RDF to retrieve categorization information – articles about the president of the United States or the prime minister of Canada, articles about financial services written in 2009 that focused on credit default swaps, articles focusing on how well rookie draft choices did in the NHL in 2010 and so forth.
What’s more, because all of these use the same underlying mechanism for storing and manipulating such information, it becomes possible to work across multiple repositories – such as Wikipedia, Data.gov and the National Archives – with the same basic query, then pass these link sets back to some other process (such as XQuery) for individual processing. As such, Linked Data can effectively treat the entire web (or at least that part of the web that currently using Linked Data methodologies) as a database.
This may become a powerful adjunct (or potentially even a replacement) to a text indexing service such as Google or Bing (to the extent that Google and Microsoft both are investing heavily in Semantic Web research). It has significant applicability to advertising — by being able to more intelligently determining what a given article or resource is about it makes it far easier to target advertising for the consumer of that resource — as well as having applicability for the analyst. For instance, if you have something like a financial report written using the eXtnesible Business Reporting Language (or XBRL), an RDFa encoding of this information, mapped into a Linked Data repository, makes it possible to group such financial reports or SEC filings by industry, market cap or employment size (among other possible measures) without having to specifically know initially which reports (and by extension which companies) are already within these groups, making apples to apples comparisons considerably easier.
These are progressive technologies — XML technologies are now about a decade old, XQuery and XML Database tools are just now really becoming main stream. Semantic Web technologies are beginning to emerge, but widespread adoption is likely to still be two to three years out. However, publishing and journalism are definitely at the forefront of that curve, because these areas in particular are most sensitive to the need to both provide enjoyable news content and the need to make such stories manipulatable and discoverable within the ever increasing sophistication and scope of the web itself. The narrative thread has become a rich, interwoven tapestry, illuminated by brilliant strands of meaning, semantics and abstraction, turning our writings into conversations, and from there into dialogs. It’s a good time to be a journalist.