ontent searching has moved into the spotlight over the past decade. A technology that was once the province of academics and researchers has become mainstream business via successes by Alta Vista, Yahoo, Google, and MSN. But sadly, the race to provide relevant content in response to search queries written by the general public has eclipsed the importance of verbal precision. As a result, carefully worded Boolean search queries ("Enter a string, using parentheses, AND
to refine your query") have been superseded by either the disingenuously simplistic "Ask us a question" (for example, Ask Jeeves' site www.ask.com
, pictured in Figure 1); or by poorly conceived attempts to dumb down the process of defining a multi-level search (for example, Lycos' search engine
depicted in Figure 2, or Microsoft's search panel
, shown in Figure 3).
|Figure 1. Ask Jeeves. The figure shows the dangerously simplistic Ask Jeeves interface.|
|Figure 2. The Lycos Search Engine: A poorly conceived attempt to dumb down the process of defining a multi-level search.|
The popularity of Boolean queries and the consequent growth of these convoluted interfaces may have occurred because most users are disinclined to refine their primitive search queries sufficiently to remain relevant against the Internet's increasingly voluminous document base. Consequently, search engines have concentrated on implementing sophisticated strategies such as neural networks to return appropriate content. It's also possible that Boolean searching, with its attendant operators and rules of precedence, is too overtly technical to find favor among the general public. But whatever the reason, there are times when there's simply no substitute for an unambiguous set of inclusion (AND), exclusion (NOT) and equivalence (OR) rules. Neural nets work very well when generalized content matching is your aim, but when you simply need to locate a particular document among a collection of related documents, Boolean searching is both more straightforward and more rigorous.
|Figure 3. Microsoft's Search Interface: Another convoluted multi-level search interface.|
For example, while a query like the following might not seem very complicated, the search engines we've mentioned would be incapable of encoding it unambiguously using their publicly supplied interfaces.
hobbits AND (dwarves OR (wizards AND elves))
This is because, though they may support the concepts of inclusion, exclusion and equivalence at a high level, they don't allow you to define explicit hierarchical relationships among the various tokens making up your query.
In this article you'll see the basic principles underlying Boolean search engines, and develop a reusable framework which you can use to make Boolean searches against a largely arbitrary document base.