Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Bad at Grammar? Cheat with Java Linguistics Tools : Page 4

Use these linguistics tools to make Java applications your English teacher would be proud of.

Pluralizing English Nouns
Another interesting and increasingly common need for linguistics in software applications is pluralization. The popularization of pluralization has been partly driven by the success of domain-specific languages (DSLs) and Ruby on Rails. Rails uses pluralization to help programs read more naturally and more closely represent the domains they model. For example, the following code shows how you might model the relationship between firms and clients using Rail's Active Record object-relational mapping (ORM) technology:

class Firm < ActiveRecord::Base has_many :clients end class Client < ActiveRecord::Base belongs_to :firm end

The conventions of Rails allow the framework to deduce that because the Firm class has many clients there is a one-to-many relationship between the Firm class and the Client classes. One Java tool that can perform similar pluralization is the java.net Inflector project.

Pluralization of English words is one of those problems where the 80 percent case is easy but the cost for the remaining 20 percent is exponentially more expensive because of English's many irregularities. You can pluralize most English nouns by adding an "s." For example, chair --> chairs, program?programs, etc. Unfortunately, it's not as easy to create a rule for such irregular pluralizations as child --> children. The Inflector project is largely based on the pluralization algorithm published in the paper "An Algorithmic Approach to English Pluralization" by Damian Conway. This paper identifies three categories of English plurals: universal, suffix-based, and exceptional.

Universal plurals cover the common case of simply adding an "s" to the base noun form of a word to form its plural. The following unit tests illustrate Inflector properly pluralizing nouns that fall into the universal plural category.

@Test public void testPluralize_Universal_Book() { assertEquals("books", Noun.pluralOf("book")); } @Test public void testPluralize_Universal_Article() { assertEquals("articles", Noun.pluralOf("article")); }

Suffix-based pluralizations are still regular, in the sense that they are predictable, but they specify different plural endings based on the suffix of the base word. For example, English words ending in "ch" are made plural by adding an "es" to the end, such as batch --> batches. The following unit tests show Inflector pluralizing nouns using suffix-based pluralization:

@Test public void testPluralize_SuffixBased_Woman() { assertEquals("women", Noun.pluralOf("woman")); } @Test public void testPluralize_SuffixBased_Box() { assertEquals("boxes", Noun.pluralOf("box")); }

Lastly, the exceptional processing of pluralization involves working with those irregular English words that just don't follow any general rule. Out of the box, Inflector doesn't handle 100 percent of English irregular words correctly but does provide a framework for handling user-specified pluralizations. The following unit tests show some irregular nouns that Inflector handles both correctly and incorrectly:

@Test public void testPluralize_Irregular_Correct_Datum() { assertEquals("data", Noun.pluralOf("datum")); } @Test public void testPluralize_Irregular_Correct_Man() { assertEquals("men", Noun.pluralOf("man")); } @Test public void testPluralize_Irregular_NotCorrect_Matrix() { assertNotEquals("matrices", Noun.pluralOf("matrix")); } @Test public void testPluralize_Irregular_NotCorrect_Thief() { assertNotEquals("thieves", Noun.pluralOf("thief")); }

This article showed several simple applications of linguistic technology to automatically classify text, to identify sentences, and to pluralize words in hopes of inspiring you with some ideas on how you might apply linguistic technology to your applications.

Rod Coffin is an agile technologist at Semantra, helping to develop an innovative natural language ad hoc reporting platform. He has many years of experience mentoring teams on enterprise Java development and agile practices and has written several articles on a range of topics from Aspect-Oriented Programming to EJB 3.0. Rod is a frequent speaker at user groups and technology conferences and can be contacted via his home page.
Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date