Login | Register   
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Implement Document Storage and Search on Google Java App Engine : Page 3

Learn how to use the Java version of the Google App Engine by implementing search and document storage on a Java web application running on the platform.


advertisement

The IndexToken Model Class

The IndexToken class implements search on top of JDO. This class works in two modes: index whole words or index both whole words and prefixes. You set the mode with a constant at the top of the source file:

package com.kbsportal.model; import java.util.ArrayList; import java.util.Collections; import java.util.Comparator; import java.util.HashMap; import java.util.List; import javax.jdo.PersistenceManager; import javax.jdo.annotations.IdGeneratorStrategy; import javax.jdo.annotations.IdentityType; import javax.jdo.annotations.Index; import javax.jdo.annotations.PersistenceCapable; import javax.jdo.annotations.Persistent; import javax.jdo.annotations.PrimaryKey; import com.kbsportal.persistence.PMF; import com.kbsportal.util.NoiseWords; import com.kbsportal.util.Pair; import com.kbsportal.util.SearchResult; @PersistenceCapable(identityType=IdentityType.APPLICATION) public class IndexToken { static boolean MATCH_PARTIAL_WORDS = true; // package visibility

Setting this flag to true enables matching of word prefixes, which provides functionality that is close to automated spelling correction for search terms.

This is a good time to look at how to build index tokens (optionally including word prefix tokens) and how to assign a relevance factor for each index token. Here is the code (from the bottom of the IndexToken.java source file, implemented as a separate local class to make it easier to reuse in other projects):



class StringPrefix { public List getPrefixes(String str) { List ret = new ArrayList(); String[] toks = str.toLowerCase().split("[\\ \\.\\,\\:\\;\\(\\)\\-\\[\\]!]"); for (String s : toks) { if (!(NoiseWords.checkFor(s))) { if (!IndexToken.MATCH_PARTIAL_WORDS) { // exact words only ret.add(new Pair(s, 1f)); } else { // or, also match word prefixes int len = s.length(); if (len > 2) { ret.add(new Pair(s, 1f)); if (len > 3) { int start_index = 1 + (len / 2); for (int i = start_index; i < len; i++) { ret.add(new Pair(s.substring(0, i), (0.25f * (float) i) / (float) len)); } } } } } } return ret; } }

The method getPrefixes converts a text string to lowercase and tokenizes it. It makes sure that each token is not in a noise word (or "stop word"), and adds non-noise words to a return list. Each token added to the return list is assigned a relevance factor: 1.0 for whole-word tokens. If you are adding prefix tokens, their relevance factor is calculated as the fraction of word characters in the prefix token (and this is scaled down by 25 percent). The effect of this technique is to make matches on partial prefix tokens much less relevant to search results.

Application Idea
You could implement more complete spelling correction using Peter Norvig's spelling correction algorithm. You could generate misspelling permutations and instances of IndexToken with relatively low relevance factors. I have a Java implementation of Norvig's algorithm in Chapter 9 of my book "Practical Artificial Intelligence Programming in Java" (PDF).
Alternative Implementation Suggestion
I am using the code from this sample application in a larger application that requires popup word-completion hints; the stored prefixes do "double duty." This article focuses on JDO for document storage and search, but you could simply use a JavaScript library like Prototype or GWT to implement popup suggestion lists. Alternatively, you could store just word stems as instances of the IndexToken class. Click here for a Java word stemmer.

The class Pair is implemented in the package com.kbsportal.util, which also implements two other utility classes: NoiseWords and SearchResults. The implementation of these classes is not of interest here. Glance at the source files to explore them further.

To complete the IndexToken implementation and build the rest of the sample web application, you need JDO APIs, starting with the annotations on class properties:

@PrimaryKey @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY) private Long id; @Persistent @Index private String textToken; @Persistent private String documentUri; @Persistent private Float ranking;

The @Persistent annotations mark class parameters to be saved to the datastore when an object is saved. The optional value for valueStrategy specifies that you want the datastore to generate values for the class parameter ID for you. The annotation @PrimaryKey lets the DataNucleus tools know that this class parameter is the primary key for looking up objects of this class in the datastore.

Author's Note: You usually fetch objects by primary key. However, in this case, you will look up instances of the IndexToken class based on the value of the parameter textToken. You cannot use the parameter textToken as the primary key, because you will generally have many instances of IndexToken that have the same value, but reference different instances of the class Document in the datastore.

The following method takes a document ID (a URI for the document) and the text from a document, and creates instances of the class IndexToken that reference this document:

public static void indexString(String document_id, String text) { PersistenceManager pm = PMF.get().getPersistenceManager(); List lp = new StringPrefix().getPrefixes(text); for (Pair p : lp) { if (p.str.length() > 0 && !Character.isDigit(p.str.charAt(0))) { pm.makePersistent(new IndexToken(document_id, p.str, p.f)); } } }

This code uses the class StringPrefix. It also uses the utility class PMF (which you will learn more about shortly) to get an instance of the App Engine persistence manager. This is similar to a JDBC connection object.

The last interesting thing to look at in the IndexToken implementation is the static method search:

public static List search(String query) { List<SearchResult> ret = new ArrayList<SearchResult>(); PersistenceManager pm = PMF.get().getPersistenceManager(); String [] tokens = query.toLowerCase().split(" "); HashMap matches = new HashMap();

This method returns a list of instances of the SearchResult class. The query string is converted to lower case and tokenized. For each token, you will once again use the StringPrefix class to calculate prefix tokens (and the original word), which you will use to look up documents containing these tokens:

for (String token : tokens) { List lp = new StringPrefix().getPrefixes(token); for (Pair p : lp) { String q2 = "select from " + IndexToken.class.getName() + " where textToken == '" + p.str + "'"; @SuppressWarnings("unchecked") List itoks = (List) pm.newQuery(q2).execute();

This query string may look like standard SQL, but it's not; it's JDO Query Language (JDOQL). Instead of selecting from a database table name, as in SQL, you select from a Java class name persisted in the datastore. TextToken is a persisted parameter of class IndexToken. This JDOQL query asks for all IndexToken instances in the datastore that have a textToken parameter equal to the token (full word or prefix token) you are currently processing. (For a more complete introduction to JDOQL, see the section "Introducing JDOQL" in the Java App Engine documentation.)

The rest of the implementation for the search method is fairly simple. You store all document matches, with the ranking value for the text token in the map matches:

for (IndexToken it : itoks) { Float f = matches.get(it.getDocumentUri()); if (f == null) f = 0f; f += it.getRanking(); matches.put(it.getDocumentUri(), f); } } }

Now that you have calculated matches between the tokenized search terms (optionally including prefix tokens), you have a set of document URIs (the keys to the map matches) and the corresponding ranking value summed for each matched document. All that is left to do is to use the datastore to retrieve the matched documents (because you want their titles to display in the search results), and to sort the matched documents in decreasing ranking order:

for (String s : matches.keySet()) { String q2 = "select from " + Document.class.getName() + " where uri == '" + s + "'"; @SuppressWarnings("unchecked") List itoks = (List) pm.newQuery(q2).execute(); if (!itoks.isEmpty()) { int num_words = itoks.get(0).getNumWords(); ret.add(new SearchResult(s, matches.get(s) / (float)(num_words), itoks.get(0).getTitle())); } } Collections.sort(ret, new ValueComparator()); return ret; }

The class ValueComparator is defined locally in the source file IndexToken.java and is used simply to sort the results:

static class ValueComparator implements Comparator { public int compare(SearchResult o1, SearchResult o2) { return (int)((o2.score - o1.score) * 100); } }

Dealing with the Persistent Datastore: Class PMF

The code in the PMF class is copied from Google's documentation. This class creates a single private instance of PersistenceManagerFactory and reuses it:

package com.kbsportal.persistence; import javax.jdo.JDOHelper; import javax.jdo.PersistenceManagerFactory; public final class PMF { private static final PersistenceManagerFactory pmfInstance = JDOHelper.getPersistenceManagerFactory("transactions-optional"); private PMF() {} public static PersistenceManagerFactory get() { return pmfInstance; } }



Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap