Apache Mahout is an open source project from the Apache Software Foundation (ASF) with the primary goal of creating a machine learning algorithm.
Introduced by a group of developers from the Apache Lucene project, Apache Mahout aims to:
- Build and support a community of users or contributors so that access to the source code for the framework is not limited to a small group of developers.
- Focus on the practical problems, rather than unseen or unproved issues.
- Provide appropriate documentation.
Features of Apache Mahout
Apache Mahout comes with an array of features and functionalities that are especially useful when we talk about clustering and collaborative filtering. The most important features are listed below:
Setting up Apache Mahout
Setting up Apache Mahout is very simple and can be carried out in the following steps:
- Step 1 – In order to setup Apache Mahout, we should have the following installed:
- JDK 1.6 or higher
- Ant 1.7 or higher
- Maven 2.9 or higher – In case we want to build from the source code
- Step 2 – Unzip the file, sample.zip and copy the contents in some folder say "apache-mahout-examples".
- Step 3 – Go to the folder – "apache-mahout-examples" and run the following:
The last step downloads the Wikipedia files and compiles the code.
The Recommendation engine is a subclass of information filtering systems that can predict the rating or preferences a user can give to an item. Mahout provides tools and techniques that are helpful to build recommendation engines using the Tastelibrary, with which we can build a fast and flexible Collaborative Filtering engine. Taste consists of the following five primary components that work with users, items and preferences:
- Data Model – This is used as a storage system for users, items and also preferences.
- User Similarity – This is an interface used to define the similarity between two users.
- Item Similarity – An interface that is used to define the similarity between two items.
- Recommender – An interface that is used to provide recommendations.
- User Neighborhood – An interface that is used to compute and calculate a neighborhood of users of the same category that can be used by the Recommenders.
Using these components and their implementations, we can build a complex recommendation system. This recommendation engine can be used in both real time recommendations and offline recommendations. Real time recommendations can handle users up to few thousands while the offline recommendations can handle users in much higher count.
Mahout supports many clustering mechanisms. These algorithms are written in MapReduce. Each of these algorithms has their own set of goals and criteria. The important ones are listed below:
- Canopy – This is the most fast clustering algorithm used to create initial seeds for other clustering algorithms.
- k - Means or Fuzzy k - means – This algorithm creates k clusters based on the distance of the items from the centre of the previous iteration.
- Mean – Shift – This algorithm doesn't require any prior information about the number of clusters. This can produce an arbitrary cluster that can be increased or decreased as per our need.
- Dirichlet – This algorithm creates clusters by combining one or more cluster models. Thus we get an advantage to select the best possible one from a number of clusters.
Out of the four algorithms listed above, the most commonly used is the k – means algorithm. With any clustering algorithm, we must follow these steps:
- Prepare the input. If required, convert the text into numeric representation.
- Execute the algorithm of your choice by using any of the Hadoop ready programs available in Mahout.
- Properly evaluate the results.
- Iterate these steps if required.
Apache Mahout supports the following two approaches to categorize or classify the content. These are mainly based on Bayesian statistics:
- The first approach is straight forward MapReduce enabled Naive Bayes classifier. Classifiers of this category are known to be fast and accurate despite having the assumption that the data is completely independent. These classifiers break down when the size of the data goes up or data becomes interdependent. Naive Bayes classifier is a two-part process which keeps a track of the features or simply words that associated with a document. This step is known as training which also creates a model by looking at examples of already classified content. The second step, known as classification, uses the model that is created during the training and the content of a new, unseen document. Hence, in order to run Mahout's classifier, we first need to train the model and then use the model to classify new content.
- The second approach, which is also known as Complementary Naive Bayes, tries to rectify some of the issues with the Naive Bayes approach and still maintains the simplicity and speed offered by Naive Bayes.
Running the Naive Bayes Classifier
The Naive Bayes Classifier requires the following ant targets in order to execute:
- ant prepare-docs – This prepares the set of documents that are required for training.
- ant prepare-test-docs – This prepares the set of documents that are required for testing.
- ant train – Once the training and tests data are set, we need to run the TrainClassifier class using the target – "ant train".
- ant test – Once the above targets are executed successfully, we need to run this target that takes the sample input documents and tries to classify them based on the model that was created while training.
This article has explored why Apache Mahout is widely used for text classification using machine learning algorithms. The technology is still growing and can be used for different types of application development.
About the Author
Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.