devxlogo

Use BreakIterator to Parse Text

Use BreakIterator to Parse Text

Parsing text is a common, complex operation. For example, your application might need to allow users to enter text and then break the text into separate words or sentences for processing. On the surface, this task seems easy. In the case of sentence parsing, for instance, it may appear that you can separate sentences simply by searching for the period (.) character. One problem with this approach is that characters other than the period can be used to end a sentence, such as question marks (?) or exclamation marks (!). In addition, periods have other uses, such as representing decimal points. To make matters worse, a different language may use an entirely different set of characters for sentence termination, or could use these same characters in a different way. Fortunately, the java.text.BreakIterator class provides some powerful parsing capabilities in a language-independent manner. This sample code illustrates how you can use the BreakIterator to parse a string on a per-sentence basis:

 import java.text.*;public class parseit {	public static void main(String[] args) {		String sentence;		String text = "John Smith stopped by earlier " +					"to say 'Happy birthday!' Aren't " +					"you and he the same age? He and " +					"his wife have 2.5 children.";		BreakIterator bi = BreakIterator.getSentenceInstance();		bi.setText(text);		int index = 0;		while (bi.next() != BreakIterator.DONE) {			sentence = text.substring(index, bi.current());			System.out.println("Sentence: " + sentence);			index = bi.current();		}  //  while (bi.next() != BreakIterator.DONE)	}  //  public static void main()}  //  public class parseit

Running this program produces this output:

 Sentence: John Smith stopped by earlier to say 'Happy birthday!'Sentence: Aren't you and he the same age?Sentence: He and his wife have 2.5 children.

The BreakIterator class also provides static getCharacterInstance(), getWordInstance, and getLineInstance() methods. These methods return BreakIterator instances that allow you to parse at the character, word, and line level, respectively.

devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist