Create Intelligent E-mail Filters with JavaMail and Classifier4j

Create Intelligent E-mail Filters with JavaMail and Classifier4j

s more and more information fills our lives and clutters our inboxes, our ability to effectively read, filter, and process this information manually declines hand-in-hand. There is only so much time that we can spend at it. The trend shows no signs of abating, despite the best efforts of many individuals and companies in the industry. By all accounts, things are going to get worse.

Enter intelligent filters, ones that not only look for certain keywords that don’t need to be reprinted here, but that also attempt to determine the sentiment of text. In other words, filters that can read an e-mail and statistically figure out what it is about and whether it interests you or not based on a set of parameters that you define. Many modern spam filters do this, training themselves on the mail that you specify is or isn’t spam. These tools are getting better by the day but they aren’t foolproof. For example, false positives are a frequent problem.

Classifier4J is an open source Java library designed just for this purpose, that is, for classifying text. (It is available from Sourceforge at http://classifier4j.sourceforge.net.) It has an implementation of a Bayesian classifier?a statistical methodology for calculating the probability of a given hypothesis being true (based on Bayes theorem; see http://www.paulgraham.com/better.html for a good implementation outline). A Bayesian classifier is typically used in evaluating the contents of text for a given subject matter. The classic example is in determining if an e-mail is a spam or not.

In this article I will build a simple POP3 client using the JavaMail API, which has lots of very cool features that allow you to build your own mail applications that use IMAP, POP3 and SMTP. Check the Sun documentation for in-depth details. This client will pull e-mails from your POP3 box and pass them through the classifier4J libraries to classify their contents, determine their spam relevance, and even do an automatic summary of their contents!

To get started, you first need to get and use the JavaMail API. This is available from Sun. (The source code in this article uses version 1.3.1). You will also need the JavaBeans Activation Framework (JAF), which is a dependency of JavaMail.

Once you have downloaded and installed these packages, you are ready to build your first e-mail client. You will need to have a POP3 e-mail account, and the username, login, and server name details associated with that account.

Building Your First E-mail Client
This application will be a very simple console application; you can expand it later into something more complex. One cool idea is to build it into a servlet-based application that gives you a hosted e-mail client, such as Hotmail or Yahoo!, but with a built-in spam filter and classifier.

Here is the application:

package com.devx.jmail;import javax.mail.*;import javax.mail.internet.*;import java.util.*;import java.io.*;import net.sf.classifier4J.*;public class MailReader{  public static void main(String[] args)  {    try    {      String popServer="yourpopserveraddress";      String popUser="yourpopusername";      String popPassword="yourpoppassword";      GetMail(popServer, popUser, popPassword);    }    catch (Exception e)    {      e.printStackTrace();    }    System.exit(0);  }

The values for popServer, popUser, and popPassword should be the correct values for your POP3 account or the application won’t work. As you can see this is a very simple console application that doesn’t do much (yet!), and the GetMail function is the workhorse.

Getting Mail
The JavaMail API is huge, and has way too much depth to go into detail here, so for the purposes of this example you’ll be doing the simplest thing possible: logging in, scanning the inbox for contents, and downloading a copy of those contents. You can view the full function GetMail in the download, but the snippets that handle the heavy lifting are shown here:

store.connect(popServer, popUser, popPassword);      folder = store.getDefaultFolder();      if (folder == null) throw new Exception("No default folder");      folder = folder.getFolder("INBOX");      if (folder == null) throw new Exception("No POP3 INBOX");      folder.open(Folder.READ_ONLY);      Message[] msgs = folder.getMessages();      for (int nMsg = 0; nMsg < msgs.length; nMsg++)      {        strEmail = buildMessage(msgs[nMsg]);       }

After setting up a JavaMail store object, you connect to it using the ServerName, UserName, and Password parameters. Should this work and not throw an exception (the code in Listing 2 should be writing a try..catch clause), you will be able to get the default folder associated with the store. If there is no default folder, then an exception will be thrown.

Every POP3 account has an 'INBOX' folder containing incoming mail. If that folder exists, it is opened, and an array of Message objects is read from it. This is the list of all mail that is currently in your inbox. Don't worry, reading the mail won't delete it from your inbox as the folder is opened as 'READ_ONLY.'

You then loop through all of these Message objects and build a string out of the Message using the 'buildMessage' function. This function is available in its entirety in the download, but the key parts of it are shown here:

InputStream is = messagePart.getInputStream();  BufferedReader reader=new BufferedReader(new InputStreamReader(is));  String thisLine=reader.readLine();  while (thisLine!=null)  {    strReturn +=thisLine;    thisLine=reader.readLine();  }

A POP3 e-mail message is made up of a number of entities, including the sender name, sender address, subject, and body. The body can be made up of a number of parts and may include attachments. The buildMessage function gets all these entities and simply appends them all to a string that it returns to the caller.

The important part of the message for our example is the e-mail body. This can be numerous lines of text, so the messagePart object (which is built from the e-mail body, see the full function) exposes an InputStream that you can use to read it line by line. This is used to create a BufferedReader, which then reads in the e-mail body.

You now have a simple e-mail client that logs in to your POP3 box, gets the mail from your inbox, downloads them one by one, and converts them into a string that can be used for classification and summarization with Classifier4J.

Simple Text Classification
Classifier4J includes a lot of libraries for text classification. The first that we will look at is the SimpleClassifier, which is a straightforward matching. The code below shows how to use this to establish a probability score that the e-mail is a spam. It is determined to be a spam (or not) simply based on the presence of the word 'Belgium.' (If you are familiar with the Hitch Hikers Guide to the Galaxy you will know why this word is appropriate. For a full explanation, you can visit this rude words guide).

public static double checkSpam(String strEmailBody)  {    double dClassification = 0.0;    try    {      SimpleClassifier classifier = new SimpleClassifier();      classifier.setSearchWord( "Belgium" );      dClassification = classifier.classify(strEmailBody);    }    catch(Exception e)    {      e.printStackTrace();    }    return dClassification;  }

In the download, this function is called by the GetMail function, so when an e-mail is downloaded and bundled into a string, it is passed to this function, and the spam score is determined. As this is a very simple case, the score will either be 0.0 or 1.0, with 0.0 being legitimate e-mail and 1.0 being the spam side of the continuum. In the program, anything with a spam score of >0.7 will be considered a spam. Note that this is case-sensitive. An e-mail with the word 'belgium' will score 0.0, and one with the word 'Belgium' will score 1.0. Should you want to make it case insensitive, you would have to check against a converted version of strEmailBody, i.e. to check the lower-case version of that string for 'belgium,' or the upper case version for 'BELGIUM.'

Bayesian Classification
The simple classifier above is great for getting started, but once you want to get into some more detailed classification, you will need to use the Bayesian one. Thankfully, this is very simple to use, with all the complex statistical analysis done for you under the hood.

This is a very simple case of how a Bayesian filter can be used:

IWordsDataSource wds = new SimpleWordsDataSource();      wds.addMatch("Belgium");      wds.addMatch("Vogon");      wds.addMatch("Devx");      IClassifier classifier = new BayesianClassifier(wds);      dReturn = classifier.classify(strEmailBody); 

The filter is initialized with a words data source, which in turn is set up with three words as a match. This sample, while simple, is ultimately useless as the Bayesian filter has very little on which to base its judgments. To properly use a Bayesian filter it has to be trained with a large data set of words that match the context as well as words that don't match the context. In the real world, a lot of words match both contexts.

To make this a little clearer, consider the word 'the.' It appears in just about every e-mail that is spam and non-spam alike. However 'millionaire' is more likely to appear in a spam. A full spam-filtering application is constantly trained by what is spam and what isn't (valid and invalid, respectively), gaining intelligence as it goes. Thus, when it receives an incoming mail it uses its experience with previous ones to determine whether the mail is spam or not.

You can train a Bayesian filter in Classifier4J using the ITrainableClassifier interface. A full example demonstrating this is available in the Classifier4J optional distribution download, which is available in the src/java/net/sf/classifier4J/demo path. This demonstration takes as input text files that have already been deemed valid or invalid as a method of training the filter. The example then trains the Bayesian filter to use these input files as stimuli in determining the relevance of another file. It should be relatively straightforward to adapt the mail application used here to constantly retrain the filter on incoming mail and to use that to increase your chances of filtering out all your spam.

Auto-Summarizing with Classifier4J
In addition to classifying your incoming e-mail, this application can also summarize the contents. For example you could expand the application to fish through your inbox, summarize the contents of the mail, and send the summary as a new e-mail somewhere else, perhaps to your cell phone or other mobility tool.

Summarizing with Classifier4J couldn't be easier: You simply create a class from ISummarizer, pass it the string to be summarized and the number of sentences you want in the summary. It does the rest, returning you a string.

The code below, which is available in the download, shows the getSummary method in action.

public static String getSummary(String strEmailBody, int nSentences)  {    ISummariser summ = new SimpleSummariser();    String strSumm = summ.summarise(strEmailBody,nSentences);    return strSumm; 

}In the application, ISummarizer is called, the e-mail body is sent to it and a request is made for a summary of the e-mail in three sentences. Here is the code:

String strSumm = getSummary(strEmail,3); 

To test and demonstrate this application, I looked up an old DevX article of mine, cut and pasted the entire first page into the body of an e-mail, and sent it to myself. When the Java application downloaded that e-mail it summarized it nicely. Here is the summary result:

The invention of database driver methodologies such as JDBC and ODBC led to applications being loosely coupled with their back end databases, allowing best-of-breed databases to be chosen?and swappedout when necessary?without any ill-effect on the user interface. Similarly, the decoupling of data and presentation in HTML?by using XML for the data and XSLT for the presentation of data?has led to much innovation and flexibility, not least of which is the ability to deliver a document as data in XML and deliver custom styling for that document with different XSLTs. A runtime engine would be present on the desktop, and servers would be able to deliver the GUI to the browser with an XML document.

This is a summary of exactly 1001 words of text, into three sentences containing 115 words of text, and still keeping a pretty good handle on what the article was about. Very impressive indeed!

With more and more information bombarding your inbox, your instant messenger, your telephone, your television, and every other media device every day, technology that can understand the context of such information is a massive area of potential growth. The obvious application is in spam filtering, but there are many other useful ways of leveraging this ability. How about an intelligent agent that monitors incoming news stories from a news feed, finds the ones that are most likely to interest you, summarizes them, and sends them to your mobile device? Or one that reads movie reviews, scanning them for characteristics that interest you, and e-mails you relevant plot summaries?

The options are endless, and the Classifier4J open source library is the toolkit that will allow you to start writing these applications. This article has merely scratched the surface of what can be done using Classifier4j with an e-mail interface?the rest is up to you!

devx-admin

devx-admin

Share the Post:
Poland Energy Future

Westinghouse Builds Polish Power Plant

Westinghouse Electric Company and Bechtel have come together to establish a formal partnership in order to design and construct Poland’s inaugural nuclear power plant at

EV Labor Market

EV Industry Hurting For Skilled Labor

The United Auto Workers strike has highlighted the anticipated change towards a future dominated by electric vehicles (EVs), a shift which numerous people think will

Soaring EV Quotas

Soaring EV Quotas Spark Battle Against Time

Automakers are still expected to meet stringent electric vehicle (EV) sales quotas, despite the delayed ban on new petrol and diesel cars. Starting January 2023,

Affordable Electric Revolution

Tesla Rivals Make Bold Moves

Tesla, a name synonymous with EVs, has consistently been at the forefront of the automotive industry’s electric revolution. The products that Elon Musk has developed

Poland Energy Future

Westinghouse Builds Polish Power Plant

Westinghouse Electric Company and Bechtel have come together to establish a formal partnership in order to design and construct Poland’s inaugural nuclear power plant at the Lubiatowo-Kopalino site in Pomerania.

EV Labor Market

EV Industry Hurting For Skilled Labor

The United Auto Workers strike has highlighted the anticipated change towards a future dominated by electric vehicles (EVs), a shift which numerous people think will result in job losses. However,

Soaring EV Quotas

Soaring EV Quotas Spark Battle Against Time

Automakers are still expected to meet stringent electric vehicle (EV) sales quotas, despite the delayed ban on new petrol and diesel cars. Starting January 2023, more than one-fifth of automobiles

Affordable Electric Revolution

Tesla Rivals Make Bold Moves

Tesla, a name synonymous with EVs, has consistently been at the forefront of the automotive industry’s electric revolution. The products that Elon Musk has developed are at the forefront because

Sunsets' Technique

Inside the Climate Battle: Make Sunsets’ Technique

On February 12, 2023, Luke Iseman and Andrew Song from the solar geoengineering firm Make Sunsets showcased their technique for injecting sulfur dioxide (SO₂) into the stratosphere as a means

AI Adherence Prediction

AI Algorithm Predicts Treatment Adherence

Swoop, a prominent consumer health data company, has unveiled a cutting-edge algorithm capable of predicting adherence to treatment in people with Multiple Sclerosis (MS) and other health conditions. Utilizing artificial

Personalized UX

Here’s Why You Need to Use JavaScript and Cookies

In today’s increasingly digital world, websites often rely on JavaScript and cookies to provide users with a more seamless and personalized browsing experience. These key components allow websites to display

Geoengineering Methods

Scientists Dimming the Sun: It’s a Good Thing

Scientists at the University of Bern have been exploring geoengineering methods that could potentially slow down the melting of the West Antarctic ice sheet by reducing sunlight exposure. Among these

why startups succeed

The Top Reasons Why Startups Succeed

Everyone hears the stories. Apple was started in a garage. Musk slept in a rented office space while he was creating PayPal with his brother. Facebook was coded by a

Bold Evolution

Intel’s Bold Comeback

Intel, a leading figure in the semiconductor industry, has underperformed in the stock market over the past five years, with shares dropping by 4% as opposed to the 176% return

Semiconductor market

Semiconductor Slump: Rebound on the Horizon

In recent years, the semiconductor sector has faced a slump due to decreasing PC and smartphone sales, especially in 2022 and 2023. Nonetheless, as 2024 approaches, the industry seems to

Elevated Content Deals

Elevate Your Content Creation with Amazing Deals

The latest Tech Deals cater to creators of different levels and budgets, featuring a variety of computer accessories and tools designed specifically for content creation. Enhance your technological setup with

Learn Web Security

An Easy Way to Learn Web Security

The Web Security Academy has recently introduced new educational courses designed to offer a comprehensible and straightforward journey through the intricate realm of web security. These carefully designed learning courses

Military Drones Revolution

Military Drones: New Mobile Command Centers

The Air Force Special Operations Command (AFSOC) is currently working on a pioneering project that aims to transform MQ-9 Reaper drones into mobile command centers to better manage smaller unmanned

Tech Partnership

US and Vietnam: The Next Tech Leaders?

The US and Vietnam have entered into a series of multi-billion-dollar business deals, marking a significant leap forward in their cooperation in vital sectors like artificial intelligence (AI), semiconductors, and

Huge Savings

Score Massive Savings on Portable Gaming

This week in tech bargains, a well-known firm has considerably reduced the price of its portable gaming device, cutting costs by as much as 20 percent, which matches the lowest

Cloudfare Protection

Unbreakable: Cloudflare One Data Protection Suite

Recently, Cloudflare introduced its One Data Protection Suite, an extensive collection of sophisticated security tools designed to protect data in various environments, including web, private, and SaaS applications. The suite

Drone Revolution

Cool Drone Tech Unveiled at London Event

At the DSEI defense event in London, Israeli defense firms exhibited cutting-edge drone technology featuring vertical-takeoff-and-landing (VTOL) abilities while launching two innovative systems that have already been acquired by clients.

2D Semiconductor Revolution

Disrupting Electronics with 2D Semiconductors

The rapid development in electronic devices has created an increasing demand for advanced semiconductors. While silicon has traditionally been the go-to material for such applications, it suffers from certain limitations.

Cisco Growth

Cisco Cuts Jobs To Optimize Growth

Tech giant Cisco Systems Inc. recently unveiled plans to reduce its workforce in two Californian cities, with the goal of optimizing the company’s cost structure. The company has decided to

FAA Authorization

FAA Approves Drone Deliveries

In a significant development for the US drone industry, drone delivery company Zipline has gained Federal Aviation Administration (FAA) authorization, permitting them to operate drones beyond the visual line of

Mortgage Rate Challenges

Prop-Tech Firms Face Mortgage Rate Challenges

The surge in mortgage rates and a subsequent decrease in home buying have presented challenges for prop-tech firms like Divvy Homes, a rent-to-own start-up company. With a previous valuation of

Lighthouse Updates

Microsoft 365 Lighthouse: Powerful Updates

Microsoft has introduced a new update to Microsoft 365 Lighthouse, which includes support for alerts and notifications. This update is designed to give Managed Service Providers (MSPs) increased control and

Website Lock

Mysterious Website Blockage Sparks Concern

Recently, visitors of a well-known resource website encountered a message blocking their access, resulting in disappointment and frustration among its users. While the reason for this limitation remains uncertain, specialists