devxlogo

Bad at Grammar? Cheat with Java Linguistics Tools

Bad at Grammar? Cheat with Java Linguistics Tools

any of us have found ways to forget about grammar since finishing school and entering the “real world” of business and commerce. Unfortunately the division between academia and the business world isn’t quite so cut and dry. The way we use language and words in day-to-day life greatly influences how people perceive and treat us. The same is true of software applications; people have increasingly higher expectations for computer software. More and more, they expect human-computer interactions to be steeped in the fluid language and thought processes of humans rather than the linear, rigid, procedures of computers. For example, people now expect search engines to understand English words, resolve spelling errors, and deal with plurality. They expect word processors to catch and even correct their grammar errors, and call-center systems to understand their speech.

This article introduces several types of linguistic processing techniques and tools that you can use in your applications to help bridge the gap between the literal world of computers and the fluid logic of humans. The landscape of linguistic processing is so broad that one article can’t possibly cover everything, but this article may prompt you to notice areas in your own applications where applying a little linguistic expertise can have a big impact on users’ experience.

These techniques stem from an area of research called computational linguistics, which seeks to apply computational (primarily statistically-based) processing techniques to natural language. Computational linguistics is a broad field, with many subfields that can benefit business-oriented applications today, such as tagging parts of speech, parsing, sentence detection, phrase chunking, pluralization, and many more. The remainder of this article shows you how to build three examples that illustrate interesting things you can do by using three Java language technologies:

  • Text classification (using LingPipe)
  • Sentence identification (using OpenNLP)
  • Pluralization (using Inflector)

Classifying Text Automatically
Text classification techniques aim to classify arbitrary text passages into an appropriate pre-determined category. For example, you can use text classification to programmatically identify the language in which a text is written, its topic, sentiment (i.e. inflammatory, reasoned, etc.), or to identify a possible author. Most text classification techniques involve applying statistical methods to a training corpus (a set of known texts used for training systems) to develop a model for determining the most likely category of future text passages.

To understand how these techniques can be useful in practical ways, consider the common [email protected] email address that many companies make available for customers to contact them with support requests. When companies receive these requests, an operator must analyze them and route the messages to the appropriate department. That’s a lot of work. Now, suppose an automated agent could monitor the mailbox, determine the most likely topic of the email, and route it to the right people. Further, suppose that agent could continually learn from its experience classifying and routing emails, providing ever-higher levels of accuracy. It is easy to see how such an agent could drive down costs while increasing customer satisfaction.

As an example, this article describes how to build a Java application that classifies email messages sent to mailing lists for two open-source projects: Maven and Grails. Messages extracted from both lists pass through an email classifier application like the one described here that automatically classifies each message by its linguistic similarities to previous messages. The classifier’s goal is to determine which mailing list each email came from automatically.

The techniques and math behind text classification are fairly complicated and involve a bunch of Greek letters, but luckily, there are Java libraries that abstract this complexity and provide simple interfaces into their capabilities. For this example you’ll see how to use Lingpipe to classify email messages. If you really are interested in some of the algorithms and math behind text classification consult the Related Resources in the left column of this article for some links where you can learn more.

Twenty messages each from the Maven and Grails mailing list serve as sample data. Ten messages from each list will be used to train the email classifier, while the twenty remaining messages will be used to test whether or not the classifier can correctly classify the messages as being related to Maven or Grails. The files are named according to their function and stored for use by the downloadable sample email classification application that accompanies this article as shown below:

   src/test/resources      /grails         /test            grails-test-1.txt            ...            grails-test-10.txt         /train            grails-train-1.txt            ...            grails-train-10.txt      /maven         /test            maven-test-1.txt            ...            maven-test-10.txt         /train            maven-train-1.txt            ...            maven-train-10.txt 

Now consider the following unit test that validates the expected behavior of the email classifier that you’ll develop shortly:

   src/test/java/com/devx/language/classification/EmailClassifierTest.java   public class EmailClassifierTest   {      private EmailClassifier classifier;         @Before      public void setUp() throws IOException      {         classifier = new EmailClassifier("grails", "maven");            for (int i = 1; i <= 10; i++)         {            classifier.trainCategory("grails",                this.getClass().getResourceAsStream(               "/grails/train/grails-train-" + i + ".txt"));         }            for (int i = 1; i <= 10; i++)         {            classifier.trainCategory("maven",                this.getClass().getResourceAsStream(               "/maven/train/maven-train-" + i + ".txt"));         }      }         @Test      public void testClassify() throws IOException      {         assertEquals(10, countRightsForCategory(classifier, "grails"));         assertEquals(10, countRightsForCategory(classifier, "maven"));      }         private int countRightsForCategory(         EmailClassifier classifier, String category)         throws IOException      {         int rights = 0;         for (int i = 1; i <= 10; i++)         {            String classification = classifier.classify(this.getClass().               getResourceAsStream("/" + category + "/test/" +                category + "-test-" + i + ".txt"));            if (classification.equals(category))            {               rights++;            }         }         return rights;      }   }

Notice how the setUp method of the test class first trains the EmailClassifier object by repeatedly invoking its trainCategory method and passing in each training email along with its expected classification. Then, the testClassify method invokes the classify method on the classifier with each of the test emails, storing the total number of correctly classified emails in the rights variable. In this case all 20 test documents can be correctly categorized as being related to their respective mailing lists! And this is particularly impressive if you consider how little content some of the test messages contain. For example the grails-test-5.txt message contains only the following question:

   "If I want to find the first or last row, how will the syntax look?"

Of course, it's unreasonable to expect 100% classification accuracy with any probability-based tool, but with an adequately thorough training corpus, tools such as LingPipe can do very well. The code for the EmailClassifier object is very straightforward because the LingPipe DynamicLMClassifier object does much of the heavy lifting as shown below:

      src/main/java/com/devx/language/classification/EmailClassifier.java   /**    * Classifies email by the most likely category. This class     * uses a probabilistic approach comparing the email to     * a training corpus.    */   public class EmailClassifier {      private DynamicLMClassifier classifier;         public EmailClassifier(String... categories) {         classifier = DynamicLMClassifier.createNGramProcess(categories, 6);      }         /**       * Trains the classifier with the specified category and        * training text. This can be called repeatedly with the        * same category to strengthen the classifier's accuracy.       */      public void trainCategory(String category,          InputStream trainingStream) throws IOException {         classifier.train(category, readStreamIntoString(trainingStream));      }         /**       * Returns the most likely category for the email based        * on the training text.       */      public String classify(InputStream emailStream) throws IOException {         Classification classification = classifier.classify(            readStreamIntoString(emailStream));         return classification.bestCategory();      }         private CharSequence readStreamIntoString(InputStream stream)          throws IOException {         char[] characterArray = Streams.toCharArray(stream, "UTF-8");         return String.valueOf(characterArray);      }   }

LingPipe can apply several different types of classifiers, but this example uses a language-model based classifier using N-grams. Other classifiers offered by LingPipe include a k-nearest-neighbor, a naive Bayes classifier, and a kernel-based perceptron classifier. Of course, the algorithms behind these classifiers are all very technical in nature and beyond the scope of this article.

LingPipe can do many other types of linguistics processing. You can visit this web site for more information.

Identifying Sentences and Comparing Texts
For some types of linguistic analysis it's helpful to break down a series of sentences into individual sentences. In other words, if you have a bunch of text strung together in a paragraph how can you identify the individual sentences in the text?

As an example suppose you want to write a program that is capable of identifying plagiarism, or similarity in two texts, measured as the number of sentences that are shared by those two texts. To do this, you will first need to be able to break down all the sentences contained in the text. Consider the following test case:

   public class SimpleSentenceIdentifierTest   {      @Test      public void identifySimpleSentencesInText()      {         String text = "Joe is a sales person. " +             "This company sells software. " +             "You are ready to sell software for this company.";         String[] sentences = SimpleSentenceIdentifier.getSentences(text);         assertNotNull(sentences);         assertEquals(3, sentences.length);         assertEquals("Joe is a sales person.", sentences[0]);         assertEquals("This company sells software.", sentences[1]);         assertEquals("You are ready to sell software for this company.",             sentences[2]);      }   }

With that test written here's some code to make it pass:

   public class SimpleSentenceIdentifier   {      public static String[] getSentences(String text)      {         // split the string at the periods         String[] results = text.split("\.");            for (int x = 0; x < results.length; x++)         {            // add back the period for each sentence.            results[x] = results[x].trim() + ".";             }            return results;      }   }

Wow, that was easy! Why would you ever need anything more than that for identifying sentences? You could easily add support for question marks and exclamation points! Slow down. You must consider the nuances of the language. The preceding approach will work only at the most superficial level of sentence identification in text?but if that's all you need, it might be satisfactory.

But what happens when you want to go a little bit deeper, when you want pretty good accuracy in more than just the simplest cases? Consider these simple extensions to the previous test:

   @Test   public void identifySentencesWithEllipses()   {      String text =          "But what happens when you want to go a little bit deeper...";      String[] sentences = SimpleSentenceIdentifier.getSentences(text);      assertNotNull(sentences);      assertEquals(1, sentences.length);      assertEquals(         "But what happens when you want to go a little bit deeper...",          sentences[0]); // FAILS!   } // Fails ! -- additional periods in ellipsis are discarded      @Test   public void identifySentencesWithMrTitle()   {      String text =          "What happens when you want pretty good accuracy in more " +          "than just    the most simple of cases, Mr. Programmer?";      String[] sentences = SimpleSentenceIdentifier.getSentences(text);      assertNotNull(sentences);      assertEquals(1, sentences.length);      assertEquals("What happens when you want pretty good accuracy "  +          "in more than just the simplest cases, Mr. Programmer?",          sentences[0]); // FAILS!   } // Fails ! -- the period in Mr. is considered the end of a sentence

As you can see, these two tests would fail.

Identifying a sentence is something that becomes easier with an understanding of language and punctuation rules. Your brain looks for familiar patterns in text as you read it. What you know about a language and the rules and exceptions of that language help you interpret what is written in effective ways.

You could enhance the SimpleSentenceIdentifier class shown earlier to understand and apply more of those rules and patterns when it is parsing text, but you would have to code them all for each case. Identifying that the period in "Mr." is not a sentence boundary, that three periods in a row represent an ellipsis, or even that the initials "E" and "B" in "E.B. Huntington" are not sentences themselves are enhancements you could make by improving the algorithm. But don't take that track. If you find yourself needing to discover valid sentences at this level, there are alternatives to coding rules for all the cases.

One of those alternatives is the SentenceDetector included in OpenNLP Tools. OpenNLP is an umbrella project that includes several projects related to linguistics. The SentenceDetector included with OpenNLP tools uses a maximum entropy (see the sidebar "Maximum Entropy" for more explanation) algorithm that is trained on a corpus of text extracted from the Wall Street Journal. A corpus is a historic digital copy that has been annotated with language information.

How could you apply maximum entropy to sentence splitting? Try to think of some characteristics that you can use to identify when a period truly denotes the end of the sentence and when it's being used for some other purpose. (Hint: As one example, you can look at the capitalization of the letter immediately preceding the period, as in "H.G. Wells). The great part is that you don't have to worry about how much the characteristic affects the outcome?just that it has could have some effect. Leave the weighting of the characteristics to the maximum entropy algorithm!

Given this background, here's an example of how you might put it to work in a plagiarism-detection application. Listing 1 contains some tests that express the expectations, while Listing 2 shows one possible implementation to satisfy the behavior defined in the unit tests in Listing 1.

Now that you have some working code, you can graft on a user interface that can compare freely available online book texts (from Project Gutenberg for example) for similarity. Figure 1 shows a simple command-line UI in action. You can get the code for this user interface in the downloadable source attached to this article.

 
Figure 1. The Plagiarism Detector in Action: The plagiarism detector correctly identifies identical sentences that occur in two different works by Mark Twain.

As you can see, the devil is in the details when it comes to sentence identification. Identifying places where periods, question marks, or exclamation points exist is not always enough to confidently slice a chunk of text up into sentences. Fortunately, many projects out there can help when you're looking to use language more realistically in your programming projects.

Pluralizing English Nouns
Another interesting and increasingly common need for linguistics in software applications is pluralization. The popularization of pluralization has been partly driven by the success of domain-specific languages (DSLs) and Ruby on Rails. Rails uses pluralization to help programs read more naturally and more closely represent the domains they model. For example, the following code shows how you might model the relationship between firms and clients using Rail's Active Record object-relational mapping (ORM) technology:

   class Firm < ActiveRecord::Base     has_many   :clients   end      class Client < ActiveRecord::Base     belongs_to :firm   end

The conventions of Rails allow the framework to deduce that because the Firm class has many clients there is a one-to-many relationship between the Firm class and the Client classes. One Java tool that can perform similar pluralization is the java.net Inflector project.

Pluralization of English words is one of those problems where the 80 percent case is easy but the cost for the remaining 20 percent is exponentially more expensive because of English's many irregularities. You can pluralize most English nouns by adding an "s." For example, chair --> chairs, program?programs, etc. Unfortunately, it's not as easy to create a rule for such irregular pluralizations as child --> children. The Inflector project is largely based on the pluralization algorithm published in the paper "An Algorithmic Approach to English Pluralization" by Damian Conway. This paper identifies three categories of English plurals: universal, suffix-based, and exceptional.

Universal plurals cover the common case of simply adding an "s" to the base noun form of a word to form its plural. The following unit tests illustrate Inflector properly pluralizing nouns that fall into the universal plural category.

      @Test   public void testPluralize_Universal_Book()   {      assertEquals("books", Noun.pluralOf("book"));   }      @Test   public void testPluralize_Universal_Article()   {      assertEquals("articles", Noun.pluralOf("article"));   }

Suffix-based pluralizations are still regular, in the sense that they are predictable, but they specify different plural endings based on the suffix of the base word. For example, English words ending in "ch" are made plural by adding an "es" to the end, such as batch --> batches. The following unit tests show Inflector pluralizing nouns using suffix-based pluralization:

      @Test   public void testPluralize_SuffixBased_Woman()   {      assertEquals("women", Noun.pluralOf("woman"));   }      @Test   public void testPluralize_SuffixBased_Box()   {      assertEquals("boxes", Noun.pluralOf("box"));   }

Lastly, the exceptional processing of pluralization involves working with those irregular English words that just don't follow any general rule. Out of the box, Inflector doesn't handle 100 percent of English irregular words correctly but does provide a framework for handling user-specified pluralizations. The following unit tests show some irregular nouns that Inflector handles both correctly and incorrectly:

      @Test      public void testPluralize_Irregular_Correct_Datum()      {         assertEquals("data", Noun.pluralOf("datum"));      }         @Test      public void testPluralize_Irregular_Correct_Man()      {         assertEquals("men", Noun.pluralOf("man"));      }      @Test      public void testPluralize_Irregular_NotCorrect_Matrix()      {         assertNotEquals("matrices", Noun.pluralOf("matrix"));      }         @Test      public void testPluralize_Irregular_NotCorrect_Thief()      {         assertNotEquals("thieves", Noun.pluralOf("thief"));      }

This article showed several simple applications of linguistic technology to automatically classify text, to identify sentences, and to pluralize words in hopes of inspiring you with some ideas on how you might apply linguistic technology to your applications.

devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist