devxlogo

Implement Document Storage and Search on Google Java App Engine

Implement Document Storage and Search on Google Java App Engine

hy should a Java web developer accept Google’s recent invitation to use the Java version of its App Engine platform? A few of the most compelling reasons are:

  • Google provides free hosting to web applications that receive fewer than 50 million page visits per month. You can convert App Engine into a paid-for service when you go over the free use quota.
  • Google provides scalability with no additional development effort when you work within the limitations of the App Engine platform (both the Java and Python versions).
  • The Admin web interface allows you to inspect runtime errors in the error logs, browse your application’s datastores, check the application’s performance (for sample, the statistics on request-processing times), and generally monitor your deployed applications. Google’s Admin web application compares favorably with Amazon’s excellent EC2 Admin web console.
  • You can move your applications to your own server and run them there using the App Engine SDK. You will lose the scalability of a hosted application, however.
  • Because you use standard APIs (with limitations) to develop Java App Engine applications, you can move to other deployment platforms with relatively little work. Unfortunately, the opposite is not as easy: if you code to a large subset of J2EE APIs, rely on a relational database, etc., then porting to the App Engine may require a lot of effort.

Those who have written web applications with the J2EE software stack might find the limitations of App Engine off-putting at first, but the reward is reduced server costs. If you cannot live within these constraints and need greater scalability, then you might want to consider Amazon’s EC2 services (I use both App Engine and EC2).

This article teaches Java developers how to use Google App Engine. It demonstrates how to implement search and document storage on a Java web application running on the platform. Along the way, the article explores useful techniques and application ideas that augment what is already in the Java App Engine documentation.

What You Need
Eclipse or IntelliJ IDEA
If you don’t have an App Engine account, then sign up here.*
Download and install the App Engine SDK for local development.
Install the Eclipse plugin or IntelliJ IDEA plugin for Java App Engine.
*You can experiment with App Engine for a while without an account and just run your experiments on your laptop using the App Engine SDK.
Figure 1. Roadmap of Files in the Sample Project: Here are the sample project files as an Eclipse Java App Engine project.

Roadmap for the Sample Web Application

Many Java developers use Lucene (or some framework based on Lucene) to implement search. However, using a Lucene in-memory index model with App Engine doesn’t make sense in a production environment. The sample application in this article provides an alternative search implementation.

App Engine’s persistent datastore is very efficient, but it does not use a relational model, and it will not work with Object Relational Mapping (ORM) frameworks such as Hibernate. However, App Engine does provide support for some standard persistence APIs such as Java Data Object (JDO), Java Persistence API (JPA), and JCache. The sample application uses JDO to implement data persistence.

The application is deployed and running here. Anyone using this deployed sample can delete all the persistent data and start over, so information that you add may not be available the next time you look at it.

Author’s Note: This application demonstrates the use of JDO and one approach to implementing search on top of JDO. It is not a complete web application with support for multiple users, user accounts, etc.

Figure 1 shows the sample project files as an Eclipse Java App Engine project. The sections to follow go into some detail about the model classes in the package com.kbsportal.model and the persistence utility class PMF in the package com.kbsportal.persistence. Because the Java utility classes in the package com.kbsportal.util are simple Java classes with nothing specific to the Java App Engine, they will not be discussed at all. You can take a quick look at the source code to learn more about them and about the JSP files (in the directory war/WEB-INF). A few of the embedded Java code snippets in the JSP files will be discussed.

Using JDO for Data Persistence

JDO is an older Java API for persisting Java objects. Originally, JDO required developers to write and maintain XML files that specifically mapped data attributes in Java classes to persistent storage. Google uses the DataNucleus tools to automate this process. You need only supply annotations in your Java model classes and the DataNucleus tools automatically maintain the correct mappings for you. Using either the Eclipse or the IntelliJ IDEA App Engine plugin support, the DataNucleus tools automatically run in the background for you whenever you edit a persisted model class.

Warning: Don’t get caught up using JDO and App Engine only to run into compatibility problems with your existing persistent datastores when you modify your model classes. When you develop locally with Eclipse, simply delete the file in WEBAPP/war/WEB-INF/appengine-generated/local_db.bin. If you have a deployed web application and change the model classes, then you sometimes will need to delete existing indexes for the application using the App Engine console for the deployed application.

The following sections describe implementing two persisted classes and discuss JDO-specific code and techniques as needed.

The Document Model Class

The combination of the Eclipse or IntelliJ IDEA App Engine plugin, JDO, and the DataNucleus tools is very easy to operate. You should be able to design and implement your model files and add the required annotations without any problems. Just watch for any error messages when the DataNucleus tools run in the background.

Before beginning the implementation of the persisted classes, take a look at the model class below, which represents documents in the system. The class definition imports the required JDO classes (in practice, you can start coding and let your IDE fill in these import statements for you). The first annotation added to the Document model class states that the class is persistent. The identity type is declared as APPLICATION to allow you to assign IDs for objects yourself as the objects are created and persisted. If you wanted to have the datastore assign object IDs, then you would specify the identity type as DATASTORE.

package com.kbsportal.model;import javax.jdo.annotations.IdentityType;import javax.jdo.annotations.PersistenceCapable;import javax.jdo.annotations.Persistent;import javax.jdo.annotations.PrimaryKey;@PersistenceCapable(identityType=IdentityType.APPLICATION)public class Document {

The code declares the member variable uri as the primary key for finding Document objects in the datastore. This JDO retrieval primary key is set to the URI for the document. The sample document store developed in this article uses this primary key in a class IndexToken (which the next section discusses further). The code also specifies that the member variables title, content, and numWords should be saved in persistent storage:

@PrimaryKey private String uri;  @Persistent private String title;  @Persistent private String content;  @Persistent private int numWords;

The rest of the class definition contains no JDO-specific annotations:

public Document(String uri, String title, String content) {    super();    setContent(content);    this.title = title;    this.key = uri;  }  public String getUri() { return key; }  public String getTitle() { return title; }  public void setTitle(String title) { this.title = title; }  public String getContent() { return content; }  public void setContent(String content) {    this.content = content;    this.numWords = content.split("[\ \.\,\:\;!]").length;    System.out.println("** numWords = " + numWords + " content: "+content);  }  public int getNumWords() { return numWords; }}

Notice the size limit placed on the content string; Google datastore limits String properties to no more than 500 characters. (Use com.google.appengine.api.datastore.Text for unlimited-size text properties.)

The IndexToken Model Class

The IndexToken class implements search on top of JDO. This class works in two modes: index whole words or index both whole words and prefixes. You set the mode with a constant at the top of the source file:

package com.kbsportal.model;import java.util.ArrayList;import java.util.Collections;import java.util.Comparator;import java.util.HashMap;import java.util.List;import javax.jdo.PersistenceManager;import javax.jdo.annotations.IdGeneratorStrategy;import javax.jdo.annotations.IdentityType;import javax.jdo.annotations.Index;import javax.jdo.annotations.PersistenceCapable;import javax.jdo.annotations.Persistent;import javax.jdo.annotations.PrimaryKey;import com.kbsportal.persistence.PMF;import com.kbsportal.util.NoiseWords;import com.kbsportal.util.Pair;import com.kbsportal.util.SearchResult;@PersistenceCapable(identityType=IdentityType.APPLICATION)public class IndexToken {  static boolean MATCH_PARTIAL_WORDS = true;  // package visibility

Setting this flag to true enables matching of word prefixes, which provides functionality that is close to automated spelling correction for search terms.

This is a good time to look at how to build index tokens (optionally including word prefix tokens) and how to assign a relevance factor for each index token. Here is the code (from the bottom of the IndexToken.java source file, implemented as a separate local class to make it easier to reuse in other projects):

class StringPrefix {  public List getPrefixes(String str) {    List ret = new ArrayList();    String[] toks = str.toLowerCase().split("[\ \.\,\:\;\(\)\-\[\]!]");    for (String s : toks) {      if (!(NoiseWords.checkFor(s))) {        if (!IndexToken.MATCH_PARTIAL_WORDS) { // exact words only          ret.add(new Pair(s, 1f));        } else { // or, also match word prefixes          int len = s.length();          if (len > 2) {            ret.add(new Pair(s, 1f));            if (len > 3) {              int start_index = 1 + (len / 2);              for (int i = start_index; i < len; i++) {                ret.add(new Pair(s.substring(0, i), (0.25f * (float) i) / (float) len));              }            }          }        }      }    }    return ret;  }}

The method getPrefixes converts a text string to lowercase and tokenizes it. It makes sure that each token is not in a noise word (or "stop word"), and adds non-noise words to a return list. Each token added to the return list is assigned a relevance factor: 1.0 for whole-word tokens. If you are adding prefix tokens, their relevance factor is calculated as the fraction of word characters in the prefix token (and this is scaled down by 25 percent). The effect of this technique is to make matches on partial prefix tokens much less relevant to search results.

Application Idea
You could implement more complete spelling correction using Peter Norvig's spelling correction algorithm. You could generate misspelling permutations and instances of IndexToken with relatively low relevance factors. I have a Java implementation of Norvig's algorithm in Chapter 9 of my book "Practical Artificial Intelligence Programming in Java" (PDF).
Alternative Implementation Suggestion
I am using the code from this sample application in a larger application that requires popup word-completion hints; the stored prefixes do "double duty." This article focuses on JDO for document storage and search, but you could simply use a JavaScript library like Prototype or GWT to implement popup suggestion lists. Alternatively, you could store just word stems as instances of the IndexToken class. Click here for a Java word stemmer.

The class Pair is implemented in the package com.kbsportal.util, which also implements two other utility classes: NoiseWords and SearchResults. The implementation of these classes is not of interest here. Glance at the source files to explore them further.

To complete the IndexToken implementation and build the rest of the sample web application, you need JDO APIs, starting with the annotations on class properties:

@PrimaryKey  @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)  private Long id;  @Persistent @Index private String textToken;  @Persistent private String documentUri;  @Persistent private Float ranking;

The @Persistent annotations mark class parameters to be saved to the datastore when an object is saved. The optional value for valueStrategy specifies that you want the datastore to generate values for the class parameter ID for you. The annotation @PrimaryKey lets the DataNucleus tools know that this class parameter is the primary key for looking up objects of this class in the datastore.

Author's Note: You usually fetch objects by primary key. However, in this case, you will look up instances of the IndexToken class based on the value of the parameter textToken. You cannot use the parameter textToken as the primary key, because you will generally have many instances of IndexToken that have the same value, but reference different instances of the class Document in the datastore.

The following method takes a document ID (a URI for the document) and the text from a document, and creates instances of the class IndexToken that reference this document:

public static void indexString(String document_id, String text) {    PersistenceManager pm = PMF.get().getPersistenceManager();    List lp = new StringPrefix().getPrefixes(text);    for (Pair p : lp) {        if (p.str.length() > 0 && !Character.isDigit(p.str.charAt(0))) {          pm.makePersistent(new IndexToken(document_id, p.str, p.f));        }    }      }

This code uses the class StringPrefix. It also uses the utility class PMF (which you will learn more about shortly) to get an instance of the App Engine persistence manager. This is similar to a JDBC connection object.

The last interesting thing to look at in the IndexToken implementation is the static method search:

public static List search(String query) {    List ret = new ArrayList();    PersistenceManager pm = PMF.get().getPersistenceManager();    String [] tokens = query.toLowerCase().split(" ");    HashMap matches = new HashMap();

This method returns a list of instances of the SearchResult class. The query string is converted to lower case and tokenized. For each token, you will once again use the StringPrefix class to calculate prefix tokens (and the original word), which you will use to look up documents containing these tokens:

for (String token : tokens) {      List lp = new StringPrefix().getPrefixes(token);      for (Pair p : lp) {        String q2 = "select from " + IndexToken.class.getName() + "  where textToken == '" + p.str + "'";        @SuppressWarnings("unchecked")        List itoks = (List) pm.newQuery(q2).execute();

This query string may look like standard SQL, but it's not; it's JDO Query Language (JDOQL). Instead of selecting from a database table name, as in SQL, you select from a Java class name persisted in the datastore. TextToken is a persisted parameter of class IndexToken. This JDOQL query asks for all IndexToken instances in the datastore that have a textToken parameter equal to the token (full word or prefix token) you are currently processing. (For a more complete introduction to JDOQL, see the section "Introducing JDOQL" in the Java App Engine documentation.)

The rest of the implementation for the search method is fairly simple. You store all document matches, with the ranking value for the text token in the map matches:

for (IndexToken it : itoks) {          Float f = matches.get(it.getDocumentUri());          if (f == null) f = 0f;          f += it.getRanking();          matches.put(it.getDocumentUri(), f);        }      }        }

Now that you have calculated matches between the tokenized search terms (optionally including prefix tokens), you have a set of document URIs (the keys to the map matches) and the corresponding ranking value summed for each matched document. All that is left to do is to use the datastore to retrieve the matched documents (because you want their titles to display in the search results), and to sort the matched documents in decreasing ranking order:

for (String s : matches.keySet()) {      String q2 = "select from " + Document.class.getName() + "  where uri == '" + s + "'";      @SuppressWarnings("unchecked")      List itoks = (List) pm.newQuery(q2).execute();      if (!itoks.isEmpty()) {        int num_words = itoks.get(0).getNumWords();        ret.add(new SearchResult(s, matches.get(s) / (float)(num_words), itoks.get(0).getTitle()));      }    }    Collections.sort(ret, new ValueComparator());    return ret;  }

The class ValueComparator is defined locally in the source file IndexToken.java and is used simply to sort the results:

static class ValueComparator implements Comparator {    public int compare(SearchResult o1, SearchResult o2) {      return (int)((o2.score - o1.score) * 100);    }  }

Dealing with the Persistent Datastore: Class PMF

The code in the PMF class is copied from Google's documentation. This class creates a single private instance of PersistenceManagerFactory and reuses it:

package com.kbsportal.persistence;import javax.jdo.JDOHelper;import javax.jdo.PersistenceManagerFactory;public final class PMF {    private static final PersistenceManagerFactory pmfInstance =        JDOHelper.getPersistenceManagerFactory("transactions-optional");    private PMF() {}    public static PersistenceManagerFactory get() {        return pmfInstance;    }}

JSPs for the Sample Web Application

Usually, when I start writing a new application using JSPs, I embed snippets of Java code in the JSPs and eventually factor common code snippets out into custom JSP tag libraries, add extra behavior to model classes, and so on. I have not performed this cleanup process for the sample web application.

The first JSP file you will look at, index.jsp, lists all documents in the system. It also contains some optional debug code (that I usually leave commented out) to list all instances of the IndexToken class (see Figure 2). The first part of index.jsp includes the required classes, defines the HTML headers, and includes menu.jsp—a single include file for page navigation:

<%@ page import="javax.jdo.*, java.util.*,     com.kbsportal.model.*,com.kbsportal.persistence.PMF" %><%@ page language="java" contentType="text/html; charset=ISO-8859-1"    pageEncoding="ISO-8859-1"%>KBSportal Java App Engine Search Demo<%@ include file="menu.jsp" %>
Figure 2. List All Documents: Debug code lists all instances of the IndexToken class and shows some index tokens.

You have already seen sample JDOQL queries in the implementation of the IndexToken class. Here, the query returns all objects in the document class:

All documents:

<% PersistenceManager pm = PMF.get().getPersistenceManager(); Query query = pm.newQuery(Document.class); try { List results = (List) query.execute(); if (results.iterator().hasNext()) { for (Document d : results) { System.out.println("key: "+d.getUri() + ", title: "+d.getTitle());%>

<%=d.getTitle()%>

<%=d.getContent()%>

<% } } } finally { query.closeAll(); } %>

This sample uses a query object instead of a string JDOQL query so as not to restrict query results for this JSP. However, if you wanted to return only documents with a specific title, you could filter query results using this code:

String title_to_find = "Dogs and Cats"   query.setFilter("title == " + title_to_find);

The bottom part of the file index.jsp contains some debug code that you may want to enable when you experiment with this sample web application. This debug code is almost identical to the above code snippet, except that it queries all instances of the IndexToken class:

query = pm.newQuery(IndexToken.class);   try {       List results = (List) query.execute();       if (results.iterator().hasNext()) {           for (IndexToken indexToken : results) {
Figure 3. Form for Adding a Document to the Datastore: A JSP implements an HTML input form for entering new "documents" into the system.

The file new_document.jsp implements an HTML input form for entering new "documents" into the system (see Figure 3). The following code snippet (from new_document.jsp) checks to see if form data is present in the page request. If so, it adds an instance of class Document to the datastore:

<%  String url = request.getParameter("url");  String title = request.getParameter("title");  String text = request.getParameter("text");  if (url!=null && title!=null && text!=null) {   PersistenceManager pm =       PMF.get().getPersistenceManager();   try {     Document doc = new Document(url, title, text);     pm.makePersistent(doc);     IndexToken.indexString(doc.getUri(), doc.getTitle() +         " " + doc.getContent());   } finally {     pm.close();   }  }%>

The makePersistent method is called directly to save the document to the datastore. The static method IndexToken.indexString adds each search token created from the document's title and contents to the datastore.

Figure 4. Remove All Documents and Index Tokens from the Datastore: The sample application calls for a simple way to clear out all test "documents" added to the datastore.

Because this sample application is publicly hosted on Google's hosting service, it calls for a simple way to clear out all test "documents" added to the datastore. The JSP file delete_all.jsp deletes all documents and index tokens from the datastore (see Figure 4).

PersistenceManager pm = PMF.get().getPersistenceManager();   Query query = pm.newQuery(Document.class);  try {    List results = (List)        query.execute();    if (results.iterator().hasNext()) {        for (Document d : results) {            pm.deletePersistent(d);        }    }  } finally {    query.closeAll();  }   query = pm.newQuery(IndexToken.class);  try {    List results = (List) query.execute();    if (results.iterator().hasNext()) {      for (IndexToken indexToken : results) {          pm.deletePersistent(indexToken);      }    }  } finally {    query.closeAll();  }

The JSP file search.jsp contains an HTML form for entering a search query (see Figure 5). Here is the code that performs the search operation:

String query = "";   String results = "Results:
"; Object obj = request.getParameter("search"); if (obj != null) { query = "" + obj; List hits = IndexToken.search(query); for (SearchResult hit : hits) { results += "

" + hit + "

"; } }
Figure 5. Search Results: The JSP file search.jsp contains an HTML form for entering a search query.

The toString method added to the SearchResults class formats the search results:

public String toString() { return url +    " - " + score + ": " + title; }

A low-Cost Deployment Option

The Java App Engine platform offers a no-cost (or low cost for busy web sites) deployment option for Java developers. While the App Engine may not be the best deployment platform for some web applications, it is well worth the time to experiment with it and add another deployment option to your developer toolbox.

devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist