Browse DevX
Sign up for e-mail newsletters from DevX


Use Semantic Language Tools to Better Understand User Intentions : Page 3

Leverage the power of WordNet to create applications that can more meaningfully interpret English language input.




Building the Right Environment to Support AI, Machine Learning and Deep Learning

Finding Lexically and Phonetically Similar Words
So far we have assumed that user searches have been well formed, even if they do not directly match inventoried items. But as with any free-form input method, user searches in the storefront application might be misspelled. A misspelling could be due to an accidental transposing of characters or could result from a user trying to phonetically spell a word by sounding it out. Next you will see how to enhance the storefront application to be forgiving in such cases of user error.

To do so you will need to familiarize yourself with another Java WordNet tool, Jawbone. JWNL is well-suited for doing precise types of searches and for navigating the resulting data structures. Unfortunately, JWNL is not so good at considering non-exact searches, and that is where Jawbone comes into the picture. Jawbone searches are filter based, and the algorithm for filtering results is pluggable and highly extensible. This makes Jawbone good at searching for misspelled words, but it's also less efficient than JWNL. Both tools have their own strengths and weaknesses, as highlighted by the various scenarios portrayed in this article.

To search for lexically similar words in Jawbone you will use the SimilarFilter. The SimilarFilter uses the Levenshtein distance between two words to gauge their lexical similarity. Without going into too much detail, it is sufficient to summarize the Levenshtein distance as a score that represents the number of operations required to translate one word to another. This algorithm picks up the transposition, addition, or omission of characters in a word.

Similar to searches for lexically similar words in Jawbone, phonetic searches are also performed via a filter. The filter for phonetic searches is the SoundFilter and it functions by assigning words an index that approximates the sounds produced by the consonants in the words. The algorithm for computing this index is known as the Soundex algorithm.

Test cases for defining the functioning of lexically and phonetically similar searches in the storefront application are shown below:

// src/test/java/com/devx/storefront/StorefrontTest.java ... @Test public void testSearch_ExactNotFound_LexicallySimiliarFound() { store.addItem(new Item("pants")); Set<Item> matchingItems = store.search("pbnts"); assertEquals(1, matchingItems.size()); assertTrue(matchingItems.contains(new Item("pants"))); } @Test public void testSearch_ExactNotFound_PhoneticallySimiliarFound() { store.addItem(new Item("trouser")); Set<Item> matchingItems = store.search("trouzer"); assertEquals(1, matchingItems.size()); assertTrue(matchingItems.contains(new Item("trouser"))); } ...

As in the previous examples, the Storefront class delegates the more interesting lexical operations to a Jawbone-based dictionary. The pertinent methods are shown below:

// src/main/java/com/devx/storefront/JawboneDictionary.java ... public Set<String> lookupLexicallySimiliarWords( String lexicalForm) { return searchTermsAndPackageTerms(new SimilarFilter(lexicalForm, true, 2), lexicalForm); } public Set<String> lookupPhoneticallySimiliarWords( String lexicalForm) { return searchTermsAndPackageTerms(new SoundFilter(lexicalForm, true), lexicalForm); } private Set<String> searchTermsAndPackageTerms( TermFilter filter, String lexicalForm) { Set<String> words = new HashSet<String>(); Iterator<IndexTerm> termIterator = dictionary.getIndexTermIterator(100, filter); while (termIterator.hasNext()) { IndexTerm term = termIterator.next(); if (!lexicalForm.equals(term.getLemma())) { words.add(term.getLemma()); } } return words; } ...

As you can see in the preceding code, lexical and phonetical filtering are similar operations in Jawbone. In fact, one of the strengths of Jawbone is the ease with which you can substitute additional filtering strategies. But the trade-off is that these filter-based searches are significantly slower than the index-based searches of JWNL.

Figure 1 earlier in the article demonstrated lexical searches performed in the Storefront application.

The last few examples illustrate how to create a much more sophisticated search engine that's more forgiving in response to user searches. All the searches so far have been in the plural form, but shouldn't a search for "pant" return the inventoried item "pants?" Of course—and to accomplish this you will need to familiarize yourself with the concept of morphology. As the word implies, it has to do with the forms that words can take. For example, a noun can have a plural form and a verb can have a past-tense form. Morphology processors differ in their level of sophistication. The simple Storefront application uses JWNL's default morphology processor to convert words into their root form before performing the search.

Using JWNL this requires only a simple extension to the examples illustrated earlier as the following code fragment illustrates:

// src/main/java/com/devx/storefront/JwnlDictionary.java ... public Set<String> lookupMorphologicallySimilarLexicalForms( String lexicalForm) { Set<String> forms = new HashSet<String>(); List baseForms = dictionary.getMorphologicalProcessor(). lookupAllBaseForms(POS.NOUN, lexicalForm); for (Object baseForm : baseForms) { forms.add(baseForm.toString()); } return forms; } ...

Notice the calls to retrieve the morphologicalProcessor from the dictionary and to lookupAllBaseForms to identify all possible root forms of the search term.

Throughout this article you have been exposed to several methods and tools for performing lexical functions on free-form user input. I hope that this has demonstrated how easy it is to take advantage of the conceptualization of the English language provided by WordNet and has given you some ideas for implementing these tools in your own applications. A lexical understanding of input is of course only the tip of the iceberg in terms of the types of analysis that can be performed within the broader category of natural language processing, but as you have seen just a few simple techniques can dramatically improve the ability of your applications to interpret user input.

Rod Coffin is an agile technologist at Semantra, helping to develop an innovative natural language ad hoc reporting platform. He has many years of experience mentoring teams on enterprise Java development and agile practices and has written several articles on a range of topics from Aspect-Oriented Programming to EJB 3.0. Rod is a frequent speaker at user groups and technology conferences and can be contacted via his home page.
Thanks for your registration, follow us on our social networks to keep up-to-date