Browse DevX
Sign up for e-mail newsletters from DevX


Add Boolean Searches to Your .NET Applications : Page 2

Writing your own search engine isn't as difficult as you might think. If you customarily wrestle with Explorer's seemingly unconfigurable "Search for Files and Folders" or squint helplessly for light at the end of the dark tunnel that is Visual Studio.NET's integrated help system, then you'll probably understand where the impetus for this article originated.




Building the Right Environment to Support AI, Machine Learning and Deep Learning

Validating Boolean Search Queries
Boolean searching requires three major steps. First, you need to validate the query string on which the search will be based for syntactic sense. The validation process ensures that operators and tokens occur in the proper order, and that parentheses are correctly matched. Second, you need to transform the query into a search tree; and finally, you need to traverse the search tree once for each candidate document to determine if the candidate fulfills the specified search criteria. Following these tasks, the attached sample project divides the workload among a set of interrelated classes.

The QueryBuilder class validates a search query and builds an associated search tree. The code within its Validate method isn't complicated, and a single-pass validation with last-token memory (see Sidebar 1) was sufficient to identify and prohibit the following cases:

  1. Consecutive operators (e.g. "&|")
  2. Misplaced operators (e.g. "&)")
  3. Mismatched parentheses (e.g. "(()")
  4. Incorrect ordering of parentheses (e.g. ")(")
  5. Text token followed by open parenthesis (e.g. "hobbits (")
  6. Mismatched inverted commas around string literals
  7. Negation of compound term (e.g. "!(X & Y)")
If Validate isn't happy with the query string, it returns false, and things end there; otherwise, you can go ahead and build a tree. In fact, while Validate gets called automatically whenever you call BuildTree, it's also exposed as a standalone method to give you the chance to reject queries before they get to that stage.

You may wonder why I opted to prohibit the negation of compound terms (rule 7). The straight answer is, I couldn't decide which of its two possible interpretations made the most intuitive sense. In other words, faced with a query like "!(X | Y)", should we return a match if one or both of X and Y don't occur within some particular document" (for example, "!X | !Y"), or return a match only if we find neither X nor Y within some particular document (simply negate the result of "X | Y") Both approaches seemed reasonable; accordingly, to avoid any ambiguity, I decided attempts to use negation distributively should be flagged as syntactic errors at the outset, forcing you to be explicit about what you mean.

Constructing a Search Tree
The code within the BuildTree method is more interesting (see Listing 1). There may be more canonical solutions to the problem of decomposing a Boolean search string, but I decided just to go with intuition. Plainly, because some particular query strings may be satisfied in multiple ways, a tree structure of some kind is an appropriate encoding, where any descendant path from root to leaf constitutes a solution. Within such a tree, it therefore made sense that "AND" operations should be represented by a child, and "OR" operations by a sibling. Finally, opening and closing parentheses would move the current insertion point down or up the tree respectively.

It seemed elegant, and, in fact, it works—some of the time. The problem is, the solution as stated assumes that all children are leaves at the time they're introduced into the tree. But because parentheses affect the tree's insertion point, this isn't necessarily the case. For example, consider the following query:

(hobbits | elves) & ambidextrous

You could represent this with the following tree:

Depth 1 Depth 2 Depth 3 ambidextrous ( hobbits elves )

However, if you parse it according to the rules shown so far, you will actually end up with a different tree:

Depth 1 Depth 2 Depth 3 ( hobbits elves ) ambidextrous

This tree may look semantically identical to its predecessor, but it has an implementation problem which becomes clear when you try to read the two trees from top to bottom as well as left to right. The first tree correctly indicates that to navigate the tree successfully you must locate the token "ambidextrous" and one of that token's children must be either "hobbits" or "elves." The second tree suggests you can navigate from root to leaf if you can locate either "hobbits" or "elves", or you can find the token "ambidextrous" anywhere in the document.

The problem occurs because when the parser encounters the pair of tokens "& ambidextrous" it wants to add an additional level to the tree (because children represent "AND" operators and siblings represent "OR" operators). But if it just inserted another node at depth 1, that node would have no direct parent/child relationship with either "hobbits" or "elves"; accordingly it wouldn't correctly represent the search query. The solution is to insert the new node at depth 0 rather than depth 1, so that it becomes a parent of the node at depth 1. Because this, in turn, is a parent of the nodes at depth 2, everything will now work correctly. To cover this case, I amended the "AND" insertion rule as follows: "Insert the current item as a new child if there are no existing children; otherwise insert it as a parent."

Doing things this way does mean that tokens aren't necessarily inserted into the tree at the same level they're encountered when parsing the query string, but because the point is to construct a tree where any particular path through the tree satisfies the search conditions, this isn't a problem. It also means that you can build the tree in a single pass, which makes for faster, simpler code.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date