Build Knowledge Through a Taxonomy of Classifications : Page 3
Learn the natural progression of classification systems and their associated taxonomies to help users find information and manage resources.
by Kurt Cagle
Nov 1, 2007
Page 3 of 4
Free Vocabularies and Search Free vocabularies, on the other hand, are so named because there is no fixed vocabulary of terms for organization. Rather, members of a community can introduce their own keywords on an ad hoc basis, and as such the organizational taxonomies become replaced with a taxonomic cloud of related terms. In some cases a taxonomist can assign synonyms that indicate two terms refer to the same meaning, which can help coalesce some meaning out of the cloud. Algorithms can be used to perform this task as well (finding the times where two terms are found together or are correlated with a third, for instance), but this option generally means that navigation shifts out of the hands of the site designer altogether and becomes a synthetic process performed through the actions of hundreds or thousands of community members.
Many social or community networking sites use free vocabularies extensively. Flickr, one of the first sites to use this concept effectively, managed to create a navigational/categorization system built both by searching for specific keywords and by building clouds of related keywords that users could click as links to other such clusters. The site requires a certain degree of contributor participation—users need to spend some time categorizing their images and media resources—but this effort can be rewarded dramatically as it increases the degree of network connectivity that the image has and consequently increases the exposure of that image as people follow that clustering.
Note that free vocabulary sites bear a certain surface similarity to search engines such as Google. The primary difference between them is that most search engines index on words within the content of a given article, while free vocabulary sites index on the user-based editorialization of that same content by the user base. This search type can frequently provide more highly targeted contextual links than straight, brute-force text searches at a considerably smaller processing cost, especially if someone periodically develops synonymous relationships between terms to improve clustering.
Search is a fundamental operation on the web. For most people, search itself seems simple: you type a word, press a button, and a list of the most salient results get returned in a listing. Of course, like all good, simple interfaces the reality behind the button is considerably more complex. It is first of all fundamentally asynchronous; the categorization operations take place long before the button itself gets pressed. Typically, the cycle of classification runs something like this:
A spider process located on the indexing company's system is pointed to a particular page and retrieves the content through some form of HTTP reader (such as the Open Source cURL application and libcurl libraries).
Once loaded, the page is scanned first for links with associated metadata, which are added to a database of additional links to search.
The second scan, a lexical scan, reads <META> elements to determine both a description of the web page and any keywords that the web page may reference. Here's the description:
<META name=description content="This story is about classification"/>
And here are the keywords, which are critical in determining categorization and as such tend to have considerably more weight in search engines than inline text content:
The third scan removes the HTML tags (and usually what are called "stop-words," such as "a," "and," "the," and so forth), and each word is indexed in a large database with a link to the document in question.
The next page on the link stack is popped, and the spider starts on it.
Additionally, for the page itself, a metadata page indicating both the age and last update status of the page is noted in the database.
A separate process walks through the collection of indexed page links and performs additional scoring on the page itself that is associated with each of the relevant classification terms. The exact mechanisms used vary greatly based on the searching agency (and is typically kept confidential to keep people from unduly gaming the rankings).
When a user requests a search phrase, the phrase is parsed, stop-words are removed, command flags are enabled, and then the work of bringing up the results begins. The top-scoring items for the first term are set up in a queue, and all items not satisfying the second term are filtered out (and the weightings are recalculated), and so on for each term.
Finally, the results are processed to generate the relevant HTML listing pages (determined from some composite of the rankings for each of the individual terms).
This process, like free text taxonomies, has the advantage of being easy to automate, but it replaces clearly defined navigation with a somewhat more probabilistic approach. The user essentially has to guess what the relevant terms are, in essence, and as such this process works best usually as an adjunct navigational system.