he dilemma is common. Your web site starts out by design with a few pages of content—a front page, a products page, perhaps something about the company or person putting up the site—and you can generally manage the various links on the site by hand. However, as the site needs to do more (and be more to its audience) its organization begins to unravel. You’re soon spending all your time trying, rather desperately in the end, to stay on top of all the links being added and changed, and your nice, clean navigation system needs a higher level of organization. Single links become primary menus and subordinate submenus. Your user base begins to complain about how difficult it is to navigate around your site, and the number of bad links grows alarmingly.
You need a content management system (CMS), right? Well, yes, to a point. A CMS makes it a little easier to add resources and does a certain level of autoclassification, but at some point you realize that you’re reaching a critical point where people aren’t visiting vital places on your site, and you’re spending all your time in triage mode.
As if that scenario isn’t enough, your company’s marketing director wants you to start developing a community for the site that will be able to post new resources: blogs, pictures, and video links. Soon you find yourself spending your time vetting images and trying to place them in the right folders, and you and your team are trying to keep up with the load that provides navigation to your site. You haven’t seen your wife in far too long, and your child, who was taking baby steps last week, has just matriculated from college.
You’re suffering from classification overload. This very common ailment strikes webmasters, in particular, because they are quite frequently the ones who are often tasked with building site navigation. Yet sometimes it’s worth asking what exactly the users are navigating…it isn’t really the site itself, except perhaps at the most mechanical layer. Instead, when users come to your site, what they are doing is navigating the information space that you have created. They’re seeking categories and topics that are germane to their own interests, and in great part the degree to which they can move from one topic to a related one (for some arbitrary definition of related) determines how difficult it is for them to navigate the site.
Figure 1. Linear Taxonomy: The Rational Link associates an action with a given term in the taxonomy.
Descending the Pyramid: Hierarchical Classifications
Classification is an indispensable part of the process of building knowledge. As such classification systems have long been associated with areas such as library science or biology, but ultimately classification is a critical part of the way that we think about most things. To understand something new and to be able to extract both the similarities and the differences, you have to have some way of comparing it with things that are like it. If you have no basis for comparison, then making the leap to understanding is much harder, if not impossible. Such comparison is why writers create exemplars or why science instruction teaches the solutions to the easy problems first. Without the ability to recognize that a new problem is similar to an old problem (species, economic model, and so on) the problem becomes intractable.
The problem is that classification is, intrinsically, metadata; it is information about a given domain, rather than information that is inherently within that domain. This characteristic means, as one consequence, that there is no one definitive classification system (or taxonomy) that is exclusive to any one piece of information.
An object may have a single label in a linear taxonomy (see Figure 1); for example, calling a small, carnivorous quadriped with fur, a long tail, and a distressing tendency to leave birds on your doorstep a cat is one form of classification, but such a creature can also be called a member of the family Felidae, of the order carnivora, of the class Mammalia, of the phylum Chordata, of the Kingdom Animalia. These membership labels describe groups of increasing size, of which cats are members, and in this particular case each classification is also a member of the next, higher classification. This classification is thus referred to (or is classified) as a hierarchical taxonomy (see Figure 2).
Figure 2. Hierarchical Taxonomy: Parent-child relationships fold the taxonomy, implying scales of categorization. In some taxonomies the containment relationship migrates other actions; in others, it’s possible for parents to maintain their own distinct actions in addition to being containers.
Hierarchies serve to assign common properties to the abstraction at the cost of detailed information from any given term in that taxonomy. Telling you that a cat is a chordate, for instance, informs you that it has a spinal chord that generally runs above or behind the primary organs (in opposition to gravity, generally), and because Phylum Chordata is also embedded within Kingdom Animalia it also inherits those things common to animals; namely, they are unable to generate energy from the sun and as a consequence must derive energy indirectly through the consumption of other living organisms, they are built using organic carbon compounds, and generally they are motile rather than sessile.
However, knowing that a cat is a chordate doesn’t tell you anything to differentiate a cat from a dog or a dinosaur or a carp, which characterizes one of the flaws of hierarchies from a navigational standpoint. The criterion for membership within that classification must be clearly articulated, and should a given definition be established as erroneous, the potential number of members so affected by the classification can be huge (see the sidebar, “Evolving Biological Classifications“).
In the design and development of web sites, the shift to a hierarchy usually starts innocently enough. One of a linear list of pages starts to become a table of contents for other pages in that particular category, and then another linear page gains children, and so on. At some point, this implicit folding gets incorporated into the navigation system. Yet, as the number of items increase, the level of folding within the categorization structure itself increases as well, and what had been a fairly simple, two-tier system begins to become bushy. Note that in pure folder/file structures the categorization typically mirrors the physical arrangement of files in the file system; though, as more of the web is being generated through server processes, that system is beginning to be overtaken by more sophisticated taxonomic schemes.
The third form of classification is network taxonomy (see Figure 3). In this form of classification terms in the taxonomy are defined by a set of keywords. Two objects that have the same term are connected on that term. In the simplest case where each object can have only one term, this classification degenerates into a linear taxonomy. On the other hand, if an object can have multiple terms associated with it, the object becomes a node in a network of taxonomic terms, creating ad hoc “definitions” where clusters of objects have related terms. For instance, a cat may have the taxonomic terms (or keywords) “furry,” “carnivore,” “quadriped,” “tailed,” and so forth; a dog may have the same keywords and, within the extent of the taxonomy, may be considered as part of the same group. However, the clustering breaks down if the terms “says woof” and “says meow” are added to the taxonomy. (It’s worth noting here that a term is not necessarily a single word, as this example illustrates).
Figure 3. Network (Cloud)Taxonomy: In a network (cloud) taxonomy the links among terms are defined less by containment or categorization and more by metaphoric similarity, though this similarity may be defined by frequency of association rather than synonym-based similarity. The boundaries between cloud taxonomies and search consequently can become very amorphous and fuzzy.
Of course, note also that such keyword taxonomies can be turned into hierarchies if a hierarchical name is defined as being the name of all objects that contain a set of keywords from the total networked space of keywords. Thus, an animal might be considered a member of “cat” if it has the keywords “mammal,” “meat-eater,” “quadripedal,” “purrs,” and “fondness for dropping birds on doorsteps.” From a site organization, your “cat” section would then contain content where all articles have these keywords in common, while your “dog” section may be in a related part of the site (the one that has “mammal,” “meat-eater,” and “quadripedal” as a higher-level organization), but one that is differentiated by having “barks” and “chases sticks.” Those articles that have neither (an article about a parrot, for instance) would then fall into the catch-all “other” categorization.
Within networked taxonomies there are two important subdivisions (notice that even taxonomies can be categorized): controlled vocabularies and free vocabularies. Controlled vocabularies assume a fixed set of terms to choose from, usually set up by the taxonomist. This set ensures that the number of clusters of taxonomic terms remains comparatively low, making it possible to create hierarchical labels for such property sets. Many business vocabularies—such as the Universal Business Language (UBL), the XML Business Reporting Language (HBRL), Health Level 7 (HL7), and others—work on this principle of organization, not only enabling clustering but also allowing the cluster names themselves to remain within the overall taxonomy.
Free Vocabularies and Search
Free vocabularies, on the other hand, are so named because there is no fixed vocabulary of terms for organization. Rather, members of a community can introduce their own keywords on an ad hoc basis, and as such the organizational taxonomies become replaced with a taxonomic cloud of related terms. In some cases a taxonomist can assign synonyms that indicate two terms refer to the same meaning, which can help coalesce some meaning out of the cloud. Algorithms can be used to perform this task as well (finding the times where two terms are found together or are correlated with a third, for instance), but this option generally means that navigation shifts out of the hands of the site designer altogether and becomes a synthetic process performed through the actions of hundreds or thousands of community members.
Many social or community networking sites use free vocabularies extensively. Flickr, one of the first sites to use this concept effectively, managed to create a navigational/categorization system built both by searching for specific keywords and by building clouds of related keywords that users could click as links to other such clusters. The site requires a certain degree of contributor participation—users need to spend some time categorizing their images and media resources—but this effort can be rewarded dramatically as it increases the degree of network connectivity that the image has and consequently increases the exposure of that image as people follow that clustering.
Note that free vocabulary sites bear a certain surface similarity to search engines such as Google. The primary difference between them is that most search engines index on words within the content of a given article, while free vocabulary sites index on the user-based editorialization of that same content by the user base. This search type can frequently provide more highly targeted contextual links than straight, brute-force text searches at a considerably smaller processing cost, especially if someone periodically develops synonymous relationships between terms to improve clustering.
Search is a fundamental operation on the web. For most people, search itself seems simple: you type a word, press a button, and a list of the most salient results get returned in a listing. Of course, like all good, simple interfaces the reality behind the button is considerably more complex. It is first of all fundamentally asynchronous; the categorization operations take place long before the button itself gets pressed. Typically, the cycle of classification runs something like this:
- A spider process located on the indexing company’s system is pointed to a particular page and retrieves the content through some form of HTTP reader (such as the Open Source cURL application and libcurl libraries).
- Once loaded, the page is scanned first for links with associated metadata, which are added to a database of additional links to search.
- The second scan, a lexical scan, reads elements to determine both a description of the web page and any keywords that the web page may reference. Here’s the description:
And here are the keywords, which are critical in determining categorization and as such tend to have considerably more weight in search engines than inline text content:
- The third scan removes the HTML tags (and usually what are called “stop-words,” such as “a,” “and,” “the,” and so forth), and each word is indexed in a large database with a link to the document in question.
- The next page on the link stack is popped, and the spider starts on it.
- Additionally, for the page itself, a metadata page indicating both the age and last update status of the page is noted in the database.
- A separate process walks through the collection of indexed page links and performs additional scoring on the page itself that is associated with each of the relevant classification terms. The exact mechanisms used vary greatly based on the searching agency (and is typically kept confidential to keep people from unduly gaming the rankings).
- When a user requests a search phrase, the phrase is parsed, stop-words are removed, command flags are enabled, and then the work of bringing up the results begins. The top-scoring items for the first term are set up in a queue, and all items not satisfying the second term are filtered out (and the weightings are recalculated), and so on for each term.
- Finally, the results are processed to generate the relevant HTML listing pages (determined from some composite of the rankings for each of the individual terms).
This process, like free text taxonomies, has the advantage of being easy to automate, but it replaces clearly defined navigation with a somewhat more probabilistic approach. The user essentially has to guess what the relevant terms are, in essence, and as such this process works best usually as an adjunct navigational system.
Syndication as Classification
There is another form of classification, though it’s not based on vocabularies but instead is based on time. When a user comes to your site, chances are pretty good that what they are seeking is novelty. This reason doesn’t mean that they’re looking for whoopee cushions and joy buzzers (unless it’s a site about gag gifts). Rather, the information that people are looking for is generally highly biased toward that which is new or different.
One of the things that many companies and organizations discovered very quickly in the early years of the web was that people didn’t go to a site if the content never changed, no matter how tasteful (or tasteless, for that matter) the content was. If your content didn’t visibly change with every visit, people would absorb the content once, maybe note that nothing’s changed the second time they visited, and then would never return.
This behavior is one of the major reasons why syndication feeds are becoming more dominant as the way that people get news and why visits to web sites not driven by syndication are drying up. An RSS or Atom feed is the ultimate expression of organization of information by time. It represents a bundle of content that provides enough of a synopsis of recent articles to provide some sense of content (that is, an abstract), perhaps with an associated keywords set tied in through some other taxonomy, and together with links to that content to provide a better, more comprehensive “view” of the article. You can think of such organizational schemes as syndication taxonomies.
Syndicated taxonomies bring an interesting, added dimension to the organization of your site for a few important reasons. First, you can “aggregate” other news feeds so that they all fall within a single taxonomic term—for instance, blogs from a number of different writers that all write on the semantic web could be aggregated together into a single Semantic Web category.
Beyond that, it is possible to take syndicated content from many sites and apply filters to extract links only to articles that fit a certain category. For instance, on the the XML News Network site, you’ll find a section where employment listings are filtered out from several large job boards that include such keywords as “semantic web,” “XForms,” “XQuery,” and so forth. While clearly indicating the origin of these resources is important, at the same time this filtering makes it possible to provide a single point from which people can come and look for jobs in a specific category, and it can also provide an Atom feed of this same content so that others can see these same lists within their own RSS news readers (or incorporate them into their own web sites).
The Responsibility of the Semantic Web
The semantic web has garnered a great deal of press recently, and to a great extent much of this attention focuses on one of the principal goals that Tim Berners-Lee set a number of years ago: to make it possible for machines to understand one another without existing in islands of proprietary formats and standards.
This goal is both noble and necessary, but all too often the practitioners of semantic web technologies have lost sight of the fact that understanding is still, fundamentally, an aspect of human cognition. Classification and search engines illustrate this problem clearly. The role of an indexer in a traditional search engine is to create what amounts to a computer-generated abstraction layer that attempts to provide relevance based on the lexical content of the article.
The problem with this approach comes from the fact that such lexical content may not mention the topic under discussion, or it may mention it in such a way that the relevant term(s) occur comparatively seldom. Moreover, lexical analysis can be an expensive proposition, as the number of synonyms and relationships tend to increase geometrically with the number of terms described, and not all terms are limited to a single word. This problem is well known to makers of speech-to-text systems, who have to turn tokens associated with probabilistic interpretations of waveforms into text content.
A second problem that emerges in search engines is that they are geared fundamentally toward HTML documents. If you have an XML document such as an invoice or résumé online, the fact that it is an invoice or résumé is not at all obvious if you are looking only at text content. Typically, such XML doesn’t usually contain the relevant metatags for generating metadata (in no small part because such metadata should be determinable from the element namespace and subordinate XML element names). This determination will become more important as the number of XML documents on the web increases.
Additionally, there is one of the more vexing aspects of any metadata system: the degree of metadata associated with a given site can often prove to be larger or more complex than the system itself. This issue is one reason why creating RDF documents for describing web pages tends to be a nonstarter—the time to create and maintain such documents can very quickly exceed the time available to work on a given site and consume an inordinate amount of resources. (Google’s energy demands have grown so large that it is now seriously impacting the energy available to the rest of the San Francisco area.) For this reason, the best systems for maintaining any form of categorization should be those where the documents themselves contain just enough metadata to be self describing. This approach is, and should be, the semantic web’s next major challenge.
Semantics—meaning—is tied fundamentally into classification, and yet one of the more relevant points about most taxonomic systems is that once you have identified a term, you typically also have an action that needs to be performed—for example, clicking a taxonomy term will bring you to a page showing all documents relevant to that term, or encountering a term in a process shuttles the content being described into processor B instead of processor A. In other words, many taxonomies are also intentional systems. Such intent, however, cannot be necessarily inferred from the semantics, which is part of the reason why it is so difficult to make true natural-language computer systems.
Languages such as RDFa (RDF for attributes) may be able to help solve this particular conundrum. RDFa works by using attributes within a given document (HTML or XML) that is clearly tied into the underlying metadata of the HTML content. RDFa is an outgrowth of a movement to create microformats within HTML documents to provide descriptions of certain types of content, and many of the more sophisticated tagging concepts used by sites such as Flickr, Digg, Technorati, Del.icio.us, and others ultimately depend on semiautomated systems that introduce such tags into web content as attribute values. Unfortunately, microformats are limited to a small set of taxonomies that generally favor social networking systems.
The approach that RDFa has taken is to provide a framework that makes it possible for people to use these taxonomies if they want to, but doesn’t limit them just to these taxonomies. And yet RDFa still makes it possible for an RDFa-aware program to read and process the document itself. It is likely that the recent W3C HTML 5 activity will incorporate RDFa as a core part of HTML 5, such that it becomes possible to link not just to a document but to a particular section or phrase in the document and assign to that a classification that doesn’t necessarily have to depend on the vagaries of a given search engine processor.
Not only will this solution serve to make the metadata about a document (and its parts) far more tightly granulated and referencable, but it also will make it easier to turn such classification into a relevant navigational system, regardless of the type of the document.
More research is needed to turn RDF relationships into navigable paths of their own. In a world of static documents and static links, navigation can be defined straightforwardly as the process of replacing the old content with new content, but there are more relationships than simply, “I am a link…follow me.” Relationships shape the classification space (especially in dynamically generated documents where the primary mechanism of navigation is search), and as such the relationship matrix introduced by the semantic web will out of necessity require that we think about the user interfaces that describe these interfaces and ask the degree to which we can or should introduce intent into programmatic semantics.
Classification ultimately affects anyone who works with the web, whether the person is a producer of content, a web developer, a webmaster, or a user trying to find relevant information. Understanding the natural progression of such classification systems and their associated taxonomies can go a long way toward being able to both find the information that you need and managing the resources that you have. Ultimately, we need those nice, new, semantically enabled web systems to be able to not answer, “What’s it all about, really?” but to answer, “How can I help you?”