Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Bad at Grammar? Cheat with Java Linguistics Tools : Page 3

Use these linguistics tools to make Java applications your English teacher would be proud of.

Identifying Sentences and Comparing Texts
For some types of linguistic analysis it's helpful to break down a series of sentences into individual sentences. In other words, if you have a bunch of text strung together in a paragraph how can you identify the individual sentences in the text?

As an example suppose you want to write a program that is capable of identifying plagiarism, or similarity in two texts, measured as the number of sentences that are shared by those two texts. To do this, you will first need to be able to break down all the sentences contained in the text. Consider the following test case:

public class SimpleSentenceIdentifierTest { @Test public void identifySimpleSentencesInText() { String text = "Joe is a sales person. " + "This company sells software. " + "You are ready to sell software for this company."; String[] sentences = SimpleSentenceIdentifier.getSentences(text); assertNotNull(sentences); assertEquals(3, sentences.length); assertEquals("Joe is a sales person.", sentences[0]); assertEquals("This company sells software.", sentences[1]); assertEquals("You are ready to sell software for this company.", sentences[2]); } }

With that test written here's some code to make it pass:

public class SimpleSentenceIdentifier { public static String[] getSentences(String text) { // split the string at the periods String[] results = text.split("\\."); for (int x = 0; x < results.length; x++) { // add back the period for each sentence. results[x] = results[x].trim() + "."; } return results; } }

Wow, that was easy! Why would you ever need anything more than that for identifying sentences? You could easily add support for question marks and exclamation points! Slow down. You must consider the nuances of the language. The preceding approach will work only at the most superficial level of sentence identification in text—but if that's all you need, it might be satisfactory.

But what happens when you want to go a little bit deeper, when you want pretty good accuracy in more than just the simplest cases? Consider these simple extensions to the previous test:

@Test public void identifySentencesWithEllipses() { String text = "But what happens when you want to go a little bit deeper..."; String[] sentences = SimpleSentenceIdentifier.getSentences(text); assertNotNull(sentences); assertEquals(1, sentences.length); assertEquals( "But what happens when you want to go a little bit deeper...", sentences[0]); // FAILS! } // Fails ! -- additional periods in ellipsis are discarded @Test public void identifySentencesWithMrTitle() { String text = "What happens when you want pretty good accuracy in more " + "than just the most simple of cases, Mr. Programmer?"; String[] sentences = SimpleSentenceIdentifier.getSentences(text); assertNotNull(sentences); assertEquals(1, sentences.length); assertEquals("What happens when you want pretty good accuracy " + "in more than just the simplest cases, Mr. Programmer?", sentences[0]); // FAILS! } // Fails ! -- the period in Mr. is considered the end of a sentence

As you can see, these two tests would fail.

Identifying a sentence is something that becomes easier with an understanding of language and punctuation rules. Your brain looks for familiar patterns in text as you read it. What you know about a language and the rules and exceptions of that language help you interpret what is written in effective ways.

You could enhance the SimpleSentenceIdentifier class shown earlier to understand and apply more of those rules and patterns when it is parsing text, but you would have to code them all for each case. Identifying that the period in "Mr." is not a sentence boundary, that three periods in a row represent an ellipsis, or even that the initials "E" and "B" in "E.B. Huntington" are not sentences themselves are enhancements you could make by improving the algorithm. But don't take that track. If you find yourself needing to discover valid sentences at this level, there are alternatives to coding rules for all the cases.

One of those alternatives is the SentenceDetector included in OpenNLP Tools. OpenNLP is an umbrella project that includes several projects related to linguistics. The SentenceDetector included with OpenNLP tools uses a maximum entropy (see the sidebar "Maximum Entropy" for more explanation) algorithm that is trained on a corpus of text extracted from the Wall Street Journal. A corpus is a historic digital copy that has been annotated with language information.

How could you apply maximum entropy to sentence splitting? Try to think of some characteristics that you can use to identify when a period truly denotes the end of the sentence and when it's being used for some other purpose. (Hint: As one example, you can look at the capitalization of the letter immediately preceding the period, as in "H.G. Wells). The great part is that you don't have to worry about how much the characteristic affects the outcome—just that it has could have some effect. Leave the weighting of the characteristics to the maximum entropy algorithm!

Given this background, here's an example of how you might put it to work in a plagiarism-detection application. Listing 1 contains some tests that express the expectations, while Listing 2 shows one possible implementation to satisfy the behavior defined in the unit tests in Listing 1.

Now that you have some working code, you can graft on a user interface that can compare freely available online book texts (from Project Gutenberg for example) for similarity. Figure 1 shows a simple command-line UI in action. You can get the code for this user interface in the downloadable source attached to this article.

Figure 1. The Plagiarism Detector in Action: The plagiarism detector correctly identifies identical sentences that occur in two different works by Mark Twain.
As you can see, the devil is in the details when it comes to sentence identification. Identifying places where periods, question marks, or exclamation points exist is not always enough to confidently slice a chunk of text up into sentences. Fortunately, many projects out there can help when you're looking to use language more realistically in your programming projects.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date