Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Create Intelligent E-mail Filters with JavaMail and Classifier4j : Page 3

Tired of the limitations and annoying false positives with commercial spam filters? Classifier4J is an open source Java library that will let you build custom applications that read e-mails and other types of text documents, separating the wheat from the chaff exactly the way you intend.

Simple Text Classification
Classifier4J includes a lot of libraries for text classification. The first that we will look at is the SimpleClassifier, which is a straightforward matching. The code below shows how to use this to establish a probability score that the e-mail is a spam. It is determined to be a spam (or not) simply based on the presence of the word 'Belgium.' (If you are familiar with the Hitch Hikers Guide to the Galaxy you will know why this word is appropriate. For a full explanation, you can visit this rude words guide).

public static double checkSpam(String strEmailBody) { double dClassification = 0.0; try { SimpleClassifier classifier = new SimpleClassifier(); classifier.setSearchWord( "Belgium" ); dClassification = classifier.classify(strEmailBody); } catch(Exception e) { e.printStackTrace(); } return dClassification; }

In the download, this function is called by the GetMail function, so when an e-mail is downloaded and bundled into a string, it is passed to this function, and the spam score is determined. As this is a very simple case, the score will either be 0.0 or 1.0, with 0.0 being legitimate e-mail and 1.0 being the spam side of the continuum. In the program, anything with a spam score of >0.7 will be considered a spam. Note that this is case-sensitive. An e-mail with the word 'belgium' will score 0.0, and one with the word 'Belgium' will score 1.0. Should you want to make it case insensitive, you would have to check against a converted version of strEmailBody, i.e. to check the lower-case version of that string for 'belgium,' or the upper case version for 'BELGIUM.'

Bayesian Classification
The simple classifier above is great for getting started, but once you want to get into some more detailed classification, you will need to use the Bayesian one. Thankfully, this is very simple to use, with all the complex statistical analysis done for you under the hood.

This is a very simple case of how a Bayesian filter can be used:

IWordsDataSource wds = new SimpleWordsDataSource(); wds.addMatch("Belgium"); wds.addMatch("Vogon"); wds.addMatch("Devx"); IClassifier classifier = new BayesianClassifier(wds); dReturn = classifier.classify(strEmailBody);

The filter is initialized with a words data source, which in turn is set up with three words as a match. This sample, while simple, is ultimately useless as the Bayesian filter has very little on which to base its judgments. To properly use a Bayesian filter it has to be trained with a large data set of words that match the context as well as words that don't match the context. In the real world, a lot of words match both contexts.

To make this a little clearer, consider the word 'the.' It appears in just about every e-mail that is spam and non-spam alike. However 'millionaire' is more likely to appear in a spam. A full spam-filtering application is constantly trained by what is spam and what isn't (valid and invalid, respectively), gaining intelligence as it goes. Thus, when it receives an incoming mail it uses its experience with previous ones to determine whether the mail is spam or not.

You can train a Bayesian filter in Classifier4J using the ITrainableClassifier interface. A full example demonstrating this is available in the Classifier4J optional distribution download, which is available in the src/java/net/sf/classifier4J/demo path. This demonstration takes as input text files that have already been deemed valid or invalid as a method of training the filter. The example then trains the Bayesian filter to use these input files as stimuli in determining the relevance of another file. It should be relatively straightforward to adapt the mail application used here to constantly retrain the filter on incoming mail and to use that to increase your chances of filtering out all your spam.

Auto-Summarizing with Classifier4J
In addition to classifying your incoming e-mail, this application can also summarize the contents. For example you could expand the application to fish through your inbox, summarize the contents of the mail, and send the summary as a new e-mail somewhere else, perhaps to your cell phone or other mobility tool.

Summarizing with Classifier4J couldn't be easier: You simply create a class from ISummarizer, pass it the string to be summarized and the number of sentences you want in the summary. It does the rest, returning you a string.

The code below, which is available in the download, shows the getSummary method in action.

public static String getSummary(String strEmailBody, int nSentences) { ISummariser summ = new SimpleSummariser(); String strSumm = summ.summarise(strEmailBody,nSentences); return strSumm;

} In the application, ISummarizer is called, the e-mail body is sent to it and a request is made for a summary of the e-mail in three sentences. Here is the code:

String strSumm = getSummary(strEmail,3);

To test and demonstrate this application, I looked up an old DevX article of mine, cut and pasted the entire first page into the body of an e-mail, and sent it to myself. When the Java application downloaded that e-mail it summarized it nicely. Here is the summary result:

The invention of database driver methodologies such as JDBC and ODBC led to applications being loosely coupled with their back end databases, allowing best-of-breed databases to be chosen—and swapped out when necessary—without any ill-effect on the user interface. Similarly, the decoupling of data and presentation in HTML—by using XML for the data and XSLT for the presentation of data—has led to much innovation and flexibility, not least of which is the ability to deliver a document as data in XML and deliver custom styling for that document with different XSLTs. A runtime engine would be present on the desktop, and servers would be able to deliver the GUI to the browser with an XML document.

This is a summary of exactly 1001 words of text, into three sentences containing 115 words of text, and still keeping a pretty good handle on what the article was about. Very impressive indeed!

With more and more information bombarding your inbox, your instant messenger, your telephone, your television, and every other media device every day, technology that can understand the context of such information is a massive area of potential growth. The obvious application is in spam filtering, but there are many other useful ways of leveraging this ability. How about an intelligent agent that monitors incoming news stories from a news feed, finds the ones that are most likely to interest you, summarizes them, and sends them to your mobile device? Or one that reads movie reviews, scanning them for characteristics that interest you, and e-mails you relevant plot summaries?

The options are endless, and the Classifier4J open source library is the toolkit that will allow you to start writing these applications. This article has merely scratched the surface of what can be done using Classifier4j with an e-mail interface—the rest is up to you!

Laurence Moroney is a freelance enterprise architect who specializes in designing and implementing service-oriented applications and environments using .NET, J2EE, or (preferably) both. He has authored books on .NET and Web services security, and more than 30 professional articles. A former Wall Street architect, and security analyst, he also dabbles in journalism, reporting for professional sports. You can find his blog at http://www.philotic.com/blog.
Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date