Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Integrating Spring Batch and MongoDB for ETL Over NoSQL

Step-by-step instructions for running an ETL batch job with Spring Batch and MongoDB.


advertisement

In today's enterprises, I deal with applications which are interactive or which run in batch mode. Interactive applications are like Web applications in that they require user input. By contrast, applications which need to start once and end after completing their required jobs are called batch applications. They do not require frequent manual interventions. Batch applications process huge amounts of data. They include any ETL tool that extracts, transforms and loads data through batch processes.

Through this article, I plan to showcase an ETL framework leveraging the advantages of Spring Batch and MongoDB, which gives us the flavor of batch load over the NoSQL databases. Here I give a step by step demonstration of integrating Spring Batch with MongoDB.

Why Batch Processing?

The main advantage of batch applications is that they do not require any manual intervention. As a result, they can be scheduled to run at times when resources aren't being utilized. As an example, I'll look at an ETL tool which runs in batch mode to analyze weblogs. Several logs need to be parsed on a daily basis to fetch the required useful information. The input files are extracted and processed to obtain the required information, and the output data gets loaded to a database. This whole process is carried out in batch mode.



Batch processes mainly deal with huge amounts of data where a series of programs runs to meet the required objective. These programs can run one after the other, or they can run in parallel to speed up the execution, depending on requirements. Batch processing allows sharing of the resources; these processes are executed primarily towards the end of day when costly resources would otherwise sit idle.

Why Spring Batch?

The Spring Batch framework is designed to cater to batch applications that run on a daily basis in enterprise organizations. It helps to leverage the benefits of the Spring framework along with the advance services. Spring Batch is mainly used to process huge volume of data. It offers better performance and is highly scalable using different optimization and partition techniques. It also provides advantage over logging/tracing, transaction management, job processing statistics, job restart, steps, and resource management. By using the Spring programming model, I can write the business logic and let the framework take care of infrastructure.

Spring Batch includes three components: batch application, batch execution environment and batch infrastructure.

The Application component contains all the batch jobs and custom code written using Spring Batch.

The Core component contains the core runtime classes necessary to launch and control a batch job. It includes things such as a JobLauncher, Job, and Step implementations. Both Application and Core are built on top of a common infrastructure.

The Infrastructure contains readers, writers and services which are used both by application and the core framework itself. They include things like ItemReader, ItemWriter and MongoTemplate. To use the Spring Batch framework, I need only to configure and customize the XML files. All existing core services should be easy to replace or extend, without any impact to the infrastructure layer.

Why MongoDB?

RDBMS has ruled the storage space for decades, so why do I suddenly need NoSQL?

In a certain set of industries, storage and managing such huge data became a challenge, and traditional RDBMSes could not cope with the needs. Then the NoSQL databases came into place. As the name suggests, NoSQL does not have any query language, and the database does not have any fixed table schema. These databases generally store the data as key-value pair, big table, document store, graphs etc. They are open source, distributed and scale out unlike the relational databases. They seamlessly take advantage of new nodes and were designed with low-cost hardware in mind. They provide high scalability, better performance, easy replication, and greater optimization in data querying and insertions.

MongoDB is one such NoSQL database which is open source and document-oriented. Instead of storing data in tables, as in any relational database, MongoDB stores structure ddata as JSON-like documents with dynamic schemas. MongoDB gives support for ad-hoc queries, indexing, data replication and load balancing. It can be used as a file system and users can take advantage of its replication and load balancing to store the files on multiple servers.

Spring Batch – MongoDB Integration

Now I'll demonstrate integration of Spring Batch with MongoDB. First, I plan to upload a huge input data file to a MongoDB database.

For this there are multiple steps involved.

Step 1: Splitting the data file

As the input data file is pretty huge, I can split it before loading. If I try to load the huge file sequentially, it will be very time consuming. Therefore, I split the huge file in small parts. The huge file can be split using any File Splitter logic and multiple parts can be loaded to different servers present in the cluster so that the different file parts can be loaded in parallel for faster execution.

Here is a sample code for the FileSplitter that takes the path of data file and the number of parts I want to create for that file. It also requires you to designate the output folder where you want to store the files split parts. Ideally, I assume that the number of parts will be same as the number of servers present in the cluster.

First, it creates those many file objects with the output folder path and stores them in fileNamesList. Their corresponding bufferedWriter objects are created and stored in a vector list. Then I read the file line by line and write the data in different split files from that fileNamesList, which I created using their corresponding bufferedWriter objects from the vector list. After all the split files are created, I transfer those files over different server using the FileTransfer class at the same location which I have given in the output folder.

Here I have assumed that there are just two machines in the cluster and two parts are getting created. One part remains at the same server where I run the FileSplitter and the other gets transferred to the machine whose details I give in the FileTransfer class. As of now, I have hard coded the second server details in the FileTransfer class, but I can configure the server details by reading it from the properties files. For example, if the main huge file is employee.txt, then the part will be created in the output folder named employeePart1.txt and employeePart2.txt.

FileSplitter.java

public class FileSplitter {      public static void main(String args[])      { int noParts = new Integer(args[0]); // no. of parts           String inputFile = args[1]; // input file path           String outputFolder = args[2]; // output folder path           List<String> fileNamesList = new ArrayList<String>();           FileInputStream fstream;           fstream = new FileInputStream(inputFile);           // Get the object of DataInputStream           DataInputStream in = new DataInputStream(fstream);           BufferedReader br = new BufferedReader(new InputStreamReader(in));           String strLine;           Vector<BufferedWriter> vList = new Vector<BufferedWriter>();           for(int i=0; i<noParts; i++)           {                int lastIndex = inputFile.lastIndexOf("\\");                int pointIndex = inputFile.indexOf(".");                String justFileName = inputFile.substring(lastIndex, pointIndex);                FileWriter fWriter = new FileWriter(outputFolder + "\\" +  justFileName + "Part"+i+".txt"); String fileName = outputFolder + "\\" + justFileName +                                                   "Part"+i+".txt";                fileNamesList.add(fileName);                BufferedWriter bWriter = new BufferedWriter(fWriter);                vList.add(bWriter);           } //Read File Line By Line           int partCounter = noParts;           int noOfPart = 0;           while ((strLine = br.readLine()) != null)              {                if(noOfPart == partCounter)                     noOfPart = 0;                if(noOfPart < partCounter)                {                     vList.get(noOfPart).write(strLine);                     vList.get(noOfPart).newLine();                }                noOfPart++;           }                for(int j=0; j<noParts; j++)           {                vList.get(j).close();           }           in.close();           FileTransfer ft = new FileTransfer();           for(int j=0; j<noParts; j++)           {                if(Math.IEEEremainder(new Double(j), new Double(2)) == 0)                {                     ft.transferFile(fileNamesList.get(j), outputFolder);                     new File(fileNamesList.get(j)).delete();                }           }      } } FileTransfer.java

public class FileTransfer  {    public void transferFile(String fileName, String outputFolder)     {         String username = "some_username";          String host = "some_host_name_ip_address";           String pwd = "some_pwd";                JSch jsch = null;        Session session = null;        Channel channel = null;        ChannelSftp c = null; jsch = new JSch();            session = jsch.getSession(username, host, 22);           session.setPassword(pwd); java.util.Properties config = new java.util.Properties();          config.put("StrictHostKeyChecking", "no");            session.setConfig(config);           session.connect();           channel = session.openChannel("sftp");            channel.connect();          c = (ChannelSftp) channel; String fsrc = fileName, fdest = outputFolder;         c.put(fsrc, fdest);  c.disconnect();        session.disconnect();    } } 

This is just a sample FileSplitter. I can use several other available logics for splitting the files. Now I move on to my actual integration of Spring Batch with MongoDB to carry out the load process.



Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap