Parsing text is a common, complex operation. For example, your application might need to allow users to enter text and then break the text into separate words or sentences for processing. On the surface, this task seems easy. In the case of sentence parsing, for instance, it may appear that you can separate sentences simply by searching for the period (.) character. One problem with this approach is that characters other than the period can be used to end a sentence, such as question marks (?) or exclamation marks (!). In addition, periods have other uses, such as representing decimal points. To make matters worse, a different language may use an entirely different set of characters for sentence termination, or could use these same characters in a different way. Fortunately, the java.text.BreakIterator class provides some powerful parsing capabilities in a language-independent manner. This sample code illustrates how you can use the BreakIterator to parse a string on a per-sentence basis:
import java.text.*;public class parseit { public static void main(String[] args) { String sentence; String text = "John Smith stopped by earlier " + "to say 'Happy birthday!' Aren't " + "you and he the same age? He and " + "his wife have 2.5 children."; BreakIterator bi = BreakIterator.getSentenceInstance(); bi.setText(text); int index = 0; while (bi.next() != BreakIterator.DONE) { sentence = text.substring(index, bi.current()); System.out.println("Sentence: " + sentence); index = bi.current(); } // while (bi.next() != BreakIterator.DONE) } // public static void main()} // public class parseit
Running this program produces this output:
Sentence: John Smith stopped by earlier to say 'Happy birthday!'Sentence: Aren't you and he the same age?Sentence: He and his wife have 2.5 children.
The BreakIterator class also provides static getCharacterInstance(), getWordInstance, and getLineInstance() methods. These methods return BreakIterator instances that allow you to parse at the character, word, and line level, respectively.