Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


The Art of Narrative and the Semantic Web

As the Internet continues to evolve, Semantic Web technologies are beginning to emerge, but widespread adoption is likely to still be two to three years out.

The art of the narrative is one of the strongest threads running through our society and culture, and is in many respects one of the defining traits of humanity. "The story" is more than just a recitation of facts or assertions (whether real or otherwise). A good story is experiential. It puts each of us as listeners into the narrator's world and frame of mind, let's us live, vicariously, through the experiences that the narrator had or conceived. In many cases we identify with the protagonist, whether the story is an epic fantasy journey through lost worlds, a sports article talking about the clash between two rival football teams, or the reportage of a major political event. We read meaning into these narratives at many level, from the bald statement of fact to the subtle interplay of analysis, implication, innuendo and metaphor, and it is the richness of these metaphors that give meaning to the work.

Librarians, historians and archivists have spent centuries engaged in an increasingly challenging task. A story, once written, forms a strand of a larger cultural narrative ... but only if that story is placed into an overall context and is preserved within that context. The job of a librarian or archivist is traditionally to ascribe such a story to a classification system -- which in turn makes it locatable within an archive, as well as to abstract from that story enough metadata to decribe what the story is about. Anyone who has spent days poring through newspaper collections trying to pull together enough information to write a report or make a recommendation can tell you the importance of a good classification system (and good abstractions), or better yet, of multiple classification systems that make it easier to triangulate on that information from multiple directions at once.

So long as the information itself was located in books, newspapers, magazines, annual reports and other publications, the archivist's job was manageable, albeit barely. A card catalog is, when you think about it, a rather impresive achievement, given the inherent difficulties in setting up sufficiently comprehensive classification taxonomies and then assigning a given resource to that taxonomy, yet a great deal of information still disappeared through the cracks -- materials not classified properly or completely, resources that change over time, resources being moved from one location to another without this being noted. It's perhaps no wonder that librarians are stressed out all the time.

Once such narratives made the jump to the web, however, the whole house of library cards promptly collapsed. At first, the best key to finding a resource on the web was to create a memorable URL, which in turn set off a massive land rush for domain names. The thinking here was that a domain name was, in essence, a brand, and if you had a noteworthy brand, that would translate into lots of web traffic. In the early days of the web , that was probably true, but as the number of such domains grew and as more and more interesting stuff started appearing "off-label", this particular classification system became very secondary.

The next system was to build directories -- after all, this created the same kind of taxonomy system that most people were used to for everything from yellow pages to libraries. Here again, however, the volume of new material proved this form of classification's undoing -- there was no clean mechanism for asserting that a particular document is in fact in a given classification term (indeed it may be in several simultaneously -- or not fit cleanly into any bucket in the taxonomy). Moreover, the decision to put a site into such directories typically required human intervention, which meant that after a while (after the novelty of such directories began to pale) such directories became ever further out of date and represented a smaller and smaller percentage of the total content.

Search engines, such as Altavista, Yahoo, and ultimately Google and Bing, represented the next stage of classification, employing a process whereby links within web content were routinely spidered in order to build indexes. Indexing is a powerful tool, because while there are potentially huge variations in search terms, once you create an index phrase you can effectively store multiple links to that same phrase, ordered by frequency of reference and other relevancy measures. Then when another person enters the same phrase, there's already an extant cache of saved entries that can be served up very quickly, rather than the search having to be performed over the entire set of billions of entries. The search phases become the classification terms.

Challenges From Many Directions

Even this system is being sorely challenged, and the challenges are coming from a variety of directions. Indexing that much content is extraordinarily processor intensive, even when spread across multiple processes, and needs to be accomplished frequently in order to remain even relatively current. This translates into a massive investment into computer systems -- Google, for instance, had 450,000 server systems in 2006, and the number may be close to one million as of early 2010, and corresponding issues of energy and heat management (current estimates place Google's total power demand at 35-40MW, which is the power draw of a moderate sized city).

However, another problem is more troublesome. The web as originally conceived was largely static -- web content, once posted, usually didn't change significantly. However, by 2010, the vast majority of content that is developed on the web falls more properly into the realm of messages rather than documents -- Facebook and Twitter notifications, resources generated from rapidly changing databases, documents in which narrative content are embedded within larger data structures, RSS and Atom feeds, KML (ironically, Google Earth and Google Maps) documents and so forth. Thus, a URL no longer contains a static narrative -- it contains a constantly changing message. This means that by the time that URL content is indexed, the new content from the URL may have no bearing on the original content.

This is similarly exacerbated by a growing trend towards the deployment of data as a service. In this case data objects (what have traditionally been called business objects) are represented as structured data in XML or JSON, but have many of the same characteristics of more traditional web content -- a distinct URL, a clear representation in at least one and potentially many different forms, access via HTTP methods and so forth. They may have long term persistence, but may just as readily be ephemeral, and as such they take on many of the properties of messages as well.

If you look at the history of the build-out of the web, one thing that becomes apparent is that as a new modality of communication comes online, people invariably try to find a real world analog, then use this analog to create a metaphor for interfaces used to interact with that modality. In time, however, the metaphor leaks as the differences between the real and virtual world representations force differences in thinking about such data objects (or, perhaps to illustrate the point, thinking about document objects). By then, the next generation of people working with these document objects understand the distinctions between the virtual objects and the real world analog, and no longer need the metaphor -- it in fact becomes a crutch.

A good case in point here was the confusion concerning web page design when web content first emerged in the mid-1990s. The earliest web designers tended to treat such pages as research papers, because the earliest users were academic. At various points web pages were designed as if they were newspaper articles, magazine articles, tv screens and computer programs, but by the early-'00s a formal "methodology" of web design had emerged that recognized that a web page was a web page, and had it's own rules of design.

Similarly, in the later part of that decade, XML use as a messaging format, a document format, a feed format, a record format and a data format has begun to blur the distinction between these, because each of these are simply metaphors that we pull in from the "real" world for the entities that we've created. In practice there's very little actual difference between each of these -- they are just different facets of the same basic underlying representation.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date