Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Bad at Grammar? Cheat with Java Linguistics Tools : Page 2

Use these linguistics tools to make Java applications your English teacher would be proud of.

Classifying Text Automatically
Text classification techniques aim to classify arbitrary text passages into an appropriate pre-determined category. For example, you can use text classification to programmatically identify the language in which a text is written, its topic, sentiment (i.e. inflammatory, reasoned, etc.), or to identify a possible author. Most text classification techniques involve applying statistical methods to a training corpus (a set of known texts used for training systems) to develop a model for determining the most likely category of future text passages.

To understand how these techniques can be useful in practical ways, consider the common support@mycompany.com email address that many companies make available for customers to contact them with support requests. When companies receive these requests, an operator must analyze them and route the messages to the appropriate department. That's a lot of work. Now, suppose an automated agent could monitor the mailbox, determine the most likely topic of the email, and route it to the right people. Further, suppose that agent could continually learn from its experience classifying and routing emails, providing ever-higher levels of accuracy. It is easy to see how such an agent could drive down costs while increasing customer satisfaction.

As an example, this article describes how to build a Java application that classifies email messages sent to mailing lists for two open-source projects: Maven and Grails. Messages extracted from both lists pass through an email classifier application like the one described here that automatically classifies each message by its linguistic similarities to previous messages. The classifier's goal is to determine which mailing list each email came from automatically.

The techniques and math behind text classification are fairly complicated and involve a bunch of Greek letters, but luckily, there are Java libraries that abstract this complexity and provide simple interfaces into their capabilities. For this example you'll see how to use Lingpipe to classify email messages. If you really are interested in some of the algorithms and math behind text classification consult the Related Resources in the left column of this article for some links where you can learn more.

Twenty messages each from the Maven and Grails mailing list serve as sample data. Ten messages from each list will be used to train the email classifier, while the twenty remaining messages will be used to test whether or not the classifier can correctly classify the messages as being related to Maven or Grails. The files are named according to their function and stored for use by the downloadable sample email classification application that accompanies this article as shown below:

src/test/resources /grails /test grails-test-1.txt ... grails-test-10.txt /train grails-train-1.txt ... grails-train-10.txt /maven /test maven-test-1.txt ... maven-test-10.txt /train maven-train-1.txt ... maven-train-10.txt

Now consider the following unit test that validates the expected behavior of the email classifier that you'll develop shortly:

src/test/java/com/devx/language/classification/EmailClassifierTest.java public class EmailClassifierTest { private EmailClassifier classifier; @Before public void setUp() throws IOException { classifier = new EmailClassifier("grails", "maven"); for (int i = 1; i <= 10; i++) { classifier.trainCategory("grails", this.getClass().getResourceAsStream( "/grails/train/grails-train-" + i + ".txt")); } for (int i = 1; i <= 10; i++) { classifier.trainCategory("maven", this.getClass().getResourceAsStream( "/maven/train/maven-train-" + i + ".txt")); } } @Test public void testClassify() throws IOException { assertEquals(10, countRightsForCategory(classifier, "grails")); assertEquals(10, countRightsForCategory(classifier, "maven")); } private int countRightsForCategory( EmailClassifier classifier, String category) throws IOException { int rights = 0; for (int i = 1; i <= 10; i++) { String classification = classifier.classify(this.getClass(). getResourceAsStream("/" + category + "/test/" + category + "-test-" + i + ".txt")); if (classification.equals(category)) { rights++; } } return rights; } }

Notice how the setUp method of the test class first trains the EmailClassifier object by repeatedly invoking its trainCategory method and passing in each training email along with its expected classification. Then, the testClassify method invokes the classify method on the classifier with each of the test emails, storing the total number of correctly classified emails in the rights variable. In this case all 20 test documents can be correctly categorized as being related to their respective mailing lists! And this is particularly impressive if you consider how little content some of the test messages contain. For example the grails-test-5.txt message contains only the following question:

"If I want to find the first or last row, how will the syntax look?"

Of course, it's unreasonable to expect 100% classification accuracy with any probability-based tool, but with an adequately thorough training corpus, tools such as LingPipe can do very well. The code for the EmailClassifier object is very straightforward because the LingPipe DynamicLMClassifier object does much of the heavy lifting as shown below:

src/main/java/com/devx/language/classification/EmailClassifier.java /** * Classifies email by the most likely category. This class * uses a probabilistic approach comparing the email to * a training corpus. */ public class EmailClassifier { private DynamicLMClassifier<NGramProcessLM> classifier; public EmailClassifier(String... categories) { classifier = DynamicLMClassifier.createNGramProcess(categories, 6); } /** * Trains the classifier with the specified category and * training text. This can be called repeatedly with the * same category to strengthen the classifier's accuracy. */ public void trainCategory(String category, InputStream trainingStream) throws IOException { classifier.train(category, readStreamIntoString(trainingStream)); } /** * Returns the most likely category for the email based * on the training text. */ public String classify(InputStream emailStream) throws IOException { Classification classification = classifier.classify( readStreamIntoString(emailStream)); return classification.bestCategory(); } private CharSequence readStreamIntoString(InputStream stream) throws IOException { char[] characterArray = Streams.toCharArray(stream, "UTF-8"); return String.valueOf(characterArray); } }

LingPipe can apply several different types of classifiers, but this example uses a language-model based classifier using N-grams. Other classifiers offered by LingPipe include a k-nearest-neighbor, a naive Bayes classifier, and a kernel-based perceptron classifier. Of course, the algorithms behind these classifiers are all very technical in nature and beyond the scope of this article.

LingPipe can do many other types of linguistics processing. You can visit this web site for more information.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date