Add Boolean Searches to Your .NET Applications

ontent searching has moved into the spotlight over the past decade. A technology that was once the province of academics and researchers has become mainstream business via successes by Alta Vista, Yahoo, Google, and MSN. But sadly, the race to provide relevant content in response to search queries written by the general public has eclipsed the importance of verbal precision. As a result, carefully worded Boolean search queries (“Enter a string, using parentheses, AND, OR or NOT to refine your query”) have been superseded by either the disingenuously simplistic “Ask us a question” (for example, Ask Jeeves’ site www.ask.com, pictured in Figure 1); or by poorly conceived attempts to dumb down the process of defining a multi-level search (for example, Lycos’ search engine depicted in Figure 2, or Microsoft’s search panel, shown in Figure 3).

?
Figure 1. Ask Jeeves. The figure shows the dangerously simplistic Ask Jeeves interface.
?
Figure 2. The Lycos Search Engine: A poorly conceived attempt to dumb down the process of defining a multi-level search.

The popularity of Boolean queries and the consequent growth of these convoluted interfaces may have occurred because most users are disinclined to refine their primitive search queries sufficiently to remain relevant against the Internet’s increasingly voluminous document base. Consequently, search engines have concentrated on implementing sophisticated strategies such as neural networks to return appropriate content. It’s also possible that Boolean searching, with its attendant operators and rules of precedence, is too overtly technical to find favor among the general public. But whatever the reason, there are times when there’s simply no substitute for an unambiguous set of inclusion (AND), exclusion (NOT) and equivalence (OR) rules. Neural nets work very well when generalized content matching is your aim, but when you simply need to locate a particular document among a collection of related documents, Boolean searching is both more straightforward and more rigorous.

?
Figure 3. Microsoft’s Search Interface: Another convoluted multi-level search interface.

For example, while a query like the following might not seem very complicated, the search engines we’ve mentioned would be incapable of encoding it unambiguously using their publicly supplied interfaces.

   hobbits AND (dwarves OR (wizards AND elves))

This is because, though they may support the concepts of inclusion, exclusion and equivalence at a high level, they don’t allow you to define explicit hierarchical relationships among the various tokens making up your query.

In this article you’ll see the basic principles underlying Boolean search engines, and develop a reusable framework which you can use to make Boolean searches against a largely arbitrary document base.

Validating Boolean Search Queries
Boolean searching requires three major steps. First, you need to validate the query string on which the search will be based for syntactic sense. The validation process ensures that operators and tokens occur in the proper order, and that parentheses are correctly matched. Second, you need to transform the query into a search tree; and finally, you need to traverse the search tree once for each candidate document to determine if the candidate fulfills the specified search criteria. Following these tasks, the attached sample project divides the workload among a set of interrelated classes.

The QueryBuilder class validates a search query and builds an associated search tree. The code within its Validate method isn’t complicated, and a single-pass validation with last-token memory (see Sidebar 1) was sufficient to identify and prohibit the following cases:

  1. Consecutive operators (e.g. “&|”)
  2. Misplaced operators (e.g. “&)”)
  3. Mismatched parentheses (e.g. “(()”)
  4. Incorrect ordering of parentheses (e.g. “)(“)
  5. Text token followed by open parenthesis (e.g. “hobbits (“)
  6. Mismatched inverted commas around string literals
  7. Negation of compound term (e.g. “!(X & Y)”)

If Validate isn’t happy with the query string, it returns false, and things end there; otherwise, you can go ahead and build a tree. In fact, while Validate gets called automatically whenever you call BuildTree, it’s also exposed as a standalone method to give you the chance to reject queries before they get to that stage.

You may wonder why I opted to prohibit the negation of compound terms (rule 7). The straight answer is, I couldn’t decide which of its two possible interpretations made the most intuitive sense. In other words, faced with a query like “!(X | Y)”, should we return a match if one or both of X and Y don’t occur within some particular document” (for example, “!X | !Y“), or return a match only if we find neither X nor Y within some particular document (simply negate the result of “X | Y“) Both approaches seemed reasonable; accordingly, to avoid any ambiguity, I decided attempts to use negation distributively should be flagged as syntactic errors at the outset, forcing you to be explicit about what you mean.

Constructing a Search Tree
The code within the BuildTree method is more interesting (see Listing 1). There may be more canonical solutions to the problem of decomposing a Boolean search string, but I decided just to go with intuition. Plainly, because some particular query strings may be satisfied in multiple ways, a tree structure of some kind is an appropriate encoding, where any descendant path from root to leaf constitutes a solution. Within such a tree, it therefore made sense that “AND” operations should be represented by a child, and “OR” operations by a sibling. Finally, opening and closing parentheses would move the current insertion point down or up the tree respectively.

It seemed elegant, and, in fact, it works?some of the time. The problem is, the solution as stated assumes that all children are leaves at the time they’re introduced into the tree. But because parentheses affect the tree’s insertion point, this isn’t necessarily the case. For example, consider the following query:

   (hobbits | elves) & ambidextrous

You could represent this with the following tree:

      Depth 1       Depth 2       Depth 3         ambidextrous                    (                                  hobbits                                  elves                    )

However, if you parse it according to the rules shown so far, you will actually end up with a different tree:

      Depth 1       Depth 2       Depth 3         (                    hobbits                    elves      )      ambidextrous

This tree may look semantically identical to its predecessor, but it has an implementation problem which becomes clear when you try to read the two trees from top to bottom as well as left to right. The first tree correctly indicates that to navigate the tree successfully you must locate the token “ambidextrous” and one of that token’s children must be either “hobbits” or “elves.” The second tree suggests you can navigate from root to leaf if you can locate either “hobbits” or “elves”, or you can find the token “ambidextrous” anywhere in the document.

The problem occurs because when the parser encounters the pair of tokens “& ambidextrous” it wants to add an additional level to the tree (because children represent “AND” operators and siblings represent “OR” operators). But if it just inserted another node at depth 1, that node would have no direct parent/child relationship with either “hobbits” or “elves”; accordingly it wouldn’t correctly represent the search query. The solution is to insert the new node at depth 0 rather than depth 1, so that it becomes a parent of the node at depth 1. Because this, in turn, is a parent of the nodes at depth 2, everything will now work correctly. To cover this case, I amended the “AND” insertion rule as follows: “Insert the current item as a new child if there are no existing children; otherwise insert it as a parent.”

Doing things this way does mean that tokens aren’t necessarily inserted into the tree at the same level they’re encountered when parsing the query string, but because the point is to construct a tree where any particular path through the tree satisfies the search conditions, this isn’t a problem. It also means that you can build the tree in a single pass, which makes for faster, simpler code.

Indexing Documents
Having built a search tree, the next task is to search through the document base for candidates that satisfy its constraints. The interface class IDocument provides an abstract description of a searchable object. Classes that implement IDocument must expose a Name property for identification purposes, and must be able to search their content for a specified token. Accordingly, the IDocument interface consists of the following pair of methods:

      string Name()      bool Find(string str)

IDocument introduces an additional level of abstraction, for several reasons:

  • It allows the search engine to remain loosely coupled with the search content, meaning that content does not need to be text-based; it simply needs to be searchable.
  • It facilitates simultaneous searching of disparate media?for example, Word documents, RSS news feeds, SOAP packets or Windows help files.

For demonstration purposes, I’ve provided only two concrete implementations of IDocument. These are the lightweight FileDocument and XMLDocument classes, both of which lazily load their respective documents on first access, and simply call System.String’s IndexOf method to search for string-based tokens on request.

Determining whether a document matches the search criteria means walking the search tree, looking for an unbroken path of descent from root to leaf. For pattern matching, you’re interested only in nodes that contain string token data, so the search code silently skips over tree nodes that represent opening and closing parentheses, and instead moves straight on to their children.

Putting it all together
The sample source code is relatively self-explanatory. Figure 4 shows the final class hierarchy, although the demo class TestSearch (within the same namespace) is certainly the easiest way to get a feel for how the various pieces fit together.

?
Figure 4. Final Class Hierarchy: The figure shows the final class hierarchy for the sample search project.

TestSearch is a vanilla console application; it indexes any XML files present in the project’s source directory (I’ve supplied a few test files to get you started), and lets you enter arbitrarily complex Boolean search strings at the command line along the lines of:

   hobbits & (dwarves | (wizards & !elves))

If you break the search engine’s syntax rules, it tells you so. If you adhere to them, it tells you how many documents match your search. Here’s the basic process to search a set of documents:

      string search_query = ....;               QueryBuilder builder = new          QueryBuilder(search_query);                     if (builder.Validate())      {         // Query is valid, so build a tree         QueryTree tree = builder.BuildTree();            // Build list of resources to index         IDocument[] docs = ...            // Retrieve matching documents         IDocument[] matches = tree.GetMatches(docs);            // Do things with document matches      }

I hope you’ll find it easy to incorporate this search capability into your own projects?Boolean searching is something I find hard to live without.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: