Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Build Knowledge Through a Taxonomy of Classifications : Page 4

Learn the natural progression of classification systems and their associated taxonomies to help users find information and manage resources.

Syndication as Classification
There is another form of classification, though it's not based on vocabularies but instead is based on time. When a user comes to your site, chances are pretty good that what they are seeking is novelty. This reason doesn't mean that they're looking for whoopee cushions and joy buzzers (unless it's a site about gag gifts). Rather, the information that people are looking for is generally highly biased toward that which is new or different.

One of the things that many companies and organizations discovered very quickly in the early years of the web was that people didn't go to a site if the content never changed, no matter how tasteful (or tasteless, for that matter) the content was. If your content didn't visibly change with every visit, people would absorb the content once, maybe note that nothing's changed the second time they visited, and then would never return.

This behavior is one of the major reasons why syndication feeds are becoming more dominant as the way that people get news and why visits to web sites not driven by syndication are drying up. An RSS or Atom feed is the ultimate expression of organization of information by time. It represents a bundle of content that provides enough of a synopsis of recent articles to provide some sense of content (that is, an abstract), perhaps with an associated keywords set tied in through some other taxonomy, and together with links to that content to provide a better, more comprehensive "view" of the article. You can think of such organizational schemes as syndication taxonomies.

Syndicated taxonomies bring an interesting, added dimension to the organization of your site for a few important reasons. First, you can "aggregate" other news feeds so that they all fall within a single taxonomic term—for instance, blogs from a number of different writers that all write on the semantic web could be aggregated together into a single Semantic Web category.

Beyond that, it is possible to take syndicated content from many sites and apply filters to extract links only to articles that fit a certain category. For instance, on the the XML News Network site, you'll find a section where employment listings are filtered out from several large job boards that include such keywords as "semantic web," "XForms," "XQuery," and so forth. While clearly indicating the origin of these resources is important, at the same time this filtering makes it possible to provide a single point from which people can come and look for jobs in a specific category, and it can also provide an Atom feed of this same content so that others can see these same lists within their own RSS news readers (or incorporate them into their own web sites).

The Responsibility of the Semantic Web
The semantic web has garnered a great deal of press recently, and to a great extent much of this attention focuses on one of the principal goals that Tim Berners-Lee set a number of years ago: to make it possible for machines to understand one another without existing in islands of proprietary formats and standards.

This goal is both noble and necessary, but all too often the practitioners of semantic web technologies have lost sight of the fact that understanding is still, fundamentally, an aspect of human cognition. Classification and search engines illustrate this problem clearly. The role of an indexer in a traditional search engine is to create what amounts to a computer-generated abstraction layer that attempts to provide relevance based on the lexical content of the article.

The problem with this approach comes from the fact that such lexical content may not mention the topic under discussion, or it may mention it in such a way that the relevant term(s) occur comparatively seldom. Moreover, lexical analysis can be an expensive proposition, as the number of synonyms and relationships tend to increase geometrically with the number of terms described, and not all terms are limited to a single word. This problem is well known to makers of speech-to-text systems, who have to turn tokens associated with probabilistic interpretations of waveforms into text content.

A second problem that emerges in search engines is that they are geared fundamentally toward HTML documents. If you have an XML document such as an invoice or résumé online, the fact that it is an invoice or résumé is not at all obvious if you are looking only at text content. Typically, such XML doesn't usually contain the relevant metatags for generating metadata (in no small part because such metadata should be determinable from the element namespace and subordinate XML element names). This determination will become more important as the number of XML documents on the web increases.

Additionally, there is one of the more vexing aspects of any metadata system: the degree of metadata associated with a given site can often prove to be larger or more complex than the system itself. This issue is one reason why creating RDF documents for describing web pages tends to be a nonstarter—the time to create and maintain such documents can very quickly exceed the time available to work on a given site and consume an inordinate amount of resources. (Google's energy demands have grown so large that it is now seriously impacting the energy available to the rest of the San Francisco area.) For this reason, the best systems for maintaining any form of categorization should be those where the documents themselves contain just enough metadata to be self describing. This approach is, and should be, the semantic web's next major challenge.

Semantics—meaning—is tied fundamentally into classification, and yet one of the more relevant points about most taxonomic systems is that once you have identified a term, you typically also have an action that needs to be performed—for example, clicking a taxonomy term will bring you to a page showing all documents relevant to that term, or encountering a term in a process shuttles the content being described into processor B instead of processor A. In other words, many taxonomies are also intentional systems. Such intent, however, cannot be necessarily inferred from the semantics, which is part of the reason why it is so difficult to make true natural-language computer systems.

Languages such as RDFa (RDF for attributes) may be able to help solve this particular conundrum. RDFa works by using attributes within a given document (HTML or XML) that is clearly tied into the underlying metadata of the HTML content. RDFa is an outgrowth of a movement to create microformats within HTML documents to provide descriptions of certain types of content, and many of the more sophisticated tagging concepts used by sites such as Flickr, Digg, Technorati, Del.icio.us, and others ultimately depend on semiautomated systems that introduce such tags into web content as attribute values. Unfortunately, microformats are limited to a small set of taxonomies that generally favor social networking systems.

The approach that RDFa has taken is to provide a framework that makes it possible for people to use these taxonomies if they want to, but doesn't limit them just to these taxonomies. And yet RDFa still makes it possible for an RDFa-aware program to read and process the document itself. It is likely that the recent W3C HTML 5 activity will incorporate RDFa as a core part of HTML 5, such that it becomes possible to link not just to a document but to a particular section or phrase in the document and assign to that a classification that doesn't necessarily have to depend on the vagaries of a given search engine processor.

Not only will this solution serve to make the metadata about a document (and its parts) far more tightly granulated and referencable, but it also will make it easier to turn such classification into a relevant navigational system, regardless of the type of the document.

More research is needed to turn RDF relationships into navigable paths of their own. In a world of static documents and static links, navigation can be defined straightforwardly as the process of replacing the old content with new content, but there are more relationships than simply, "I am a link...follow me." Relationships shape the classification space (especially in dynamically generated documents where the primary mechanism of navigation is search), and as such the relationship matrix introduced by the semantic web will out of necessity require that we think about the user interfaces that describe these interfaces and ask the degree to which we can or should introduce intent into programmatic semantics.

Classification ultimately affects anyone who works with the web, whether the person is a producer of content, a web developer, a webmaster, or a user trying to find relevant information. Understanding the natural progression of such classification systems and their associated taxonomies can go a long way toward being able to both find the information that you need and managing the resources that you have. Ultimately, we need those nice, new, semantically enabled web systems to be able to not answer, "What's it all about, really?" but to answer, "How can I help you?"

Kurt Cagle is the managing editor for XMLToday.org and a contributing editor for O'Reilly Media. He is currently working on a book about XBRL. Follow him on Twitter at twitter.com/kurt_cagle.
Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.