Integrating Spring Batch and MongoDB for ETL Over NoSQL

Integrating Spring Batch and MongoDB for ETL Over NoSQL

In today’s enterprises, I deal with applications which are interactive or which run in batch mode. Interactive applications are like Web applications in that they require user input. By contrast, applications which need to start once and end after completing their required jobs are called batch applications. They do not require frequent manual interventions. Batch applications process huge amounts of data. They include any ETL tool that extracts, transforms and loads data through batch processes.

Through this article, I plan to showcase an ETL framework leveraging the advantages of Spring Batch and MongoDB, which gives us the flavor of batch load over the NoSQL databases. Here I give a step by step demonstration of integrating Spring Batch with MongoDB.

Why Batch Processing?

The main advantage of batch applications is that they do not require any manual intervention. As a result, they can be scheduled to run at times when resources aren’t being utilized. As an example, I’ll look at an ETL tool which runs in batch mode to analyze weblogs. Several logs need to be parsed on a daily basis to fetch the required useful information. The input files are extracted and processed to obtain the required information, and the output data gets loaded to a database. This whole process is carried out in batch mode.

Batch processes mainly deal with huge amounts of data where a series of programs runs to meet the required objective. These programs can run one after the other, or they can run in parallel to speed up the execution, depending on requirements. Batch processing allows sharing of the resources; these processes are executed primarily towards the end of day when costly resources would otherwise sit idle.

Why Spring Batch?

The Spring Batch framework is designed to cater to batch applications that run on a daily basis in enterprise organizations. It helps to leverage the benefits of the Spring framework along with the advance services. Spring Batch is mainly used to process huge volume of data. It offers better performance and is highly scalable using different optimization and partition techniques. It also provides advantage over logging/tracing, transaction management, job processing statistics, job restart, steps, and resource management. By using the Spring programming model, I can write the business logic and let the framework take care of infrastructure.

Spring Batch includes three components: batch application, batch execution environment and batch infrastructure.

The Application component contains all the batch jobs and custom code written using Spring Batch.

The Core component contains the core runtime classes necessary to launch and control a batch job. It includes things such as a JobLauncher, Job, and Step implementations. Both Application and Core are built on top of a common infrastructure.

The Infrastructure contains readers, writers and services which are used both by application and the core framework itself. They include things like ItemReader, ItemWriter and MongoTemplate. To use the Spring Batch framework, I need only to configure and customize the XML files. All existing core services should be easy to replace or extend, without any impact to the infrastructure layer.

Why MongoDB?

RDBMS has ruled the storage space for decades, so why do I suddenly need NoSQL?

In a certain set of industries, storage and managing such huge data became a challenge, and traditional RDBMSes could not cope with the needs. Then the NoSQL databases came into place. As the name suggests, NoSQL does not have any query language, and the database does not have any fixed table schema. These databases generally store the data as key-value pair, big table, document store, graphs etc. They are open source, distributed and scale out unlike the relational databases. They seamlessly take advantage of new nodes and were designed with low-cost hardware in mind. They provide high scalability, better performance, easy replication, and greater optimization in data querying and insertions.

MongoDB is one such NoSQL database which is open source and document-oriented. Instead of storing data in tables, as in any relational database, MongoDB stores structure ddata as JSON-like documents with dynamic schemas. MongoDB gives support for ad-hoc queries, indexing, data replication and load balancing. It can be used as a file system and users can take advantage of its replication and load balancing to store the files on multiple servers.

Spring Batch – MongoDB Integration

Now I’ll demonstrate integration of Spring Batch with MongoDB. First, I plan to upload a huge input data file to a MongoDB database.

For this there are multiple steps involved.

Step 1: Splitting the data file

As the input data file is pretty huge, I can split it before loading. If I try to load the huge file sequentially, it will be very time consuming. Therefore, I split the huge file in small parts. The huge file can be split using any File Splitter logic and multiple parts can be loaded to different servers present in the cluster so that the different file parts can be loaded in parallel for faster execution.

Here is a sample code for the FileSplitter that takes the path of data file and the number of parts I want to create for that file. It also requires you to designate the output folder where you want to store the files split parts. Ideally, I assume that the number of parts will be same as the number of servers present in the cluster.

First, it creates those many file objects with the output folder path and stores them in fileNamesList. Their corresponding bufferedWriter objects are created and stored in a vector list. Then I read the file line by line and write the data in different split files from that fileNamesList, which I created using their corresponding bufferedWriter objects from the vector list. After all the split files are created, I transfer those files over different server using the FileTransfer class at the same location which I have given in the output folder.

Here I have assumed that there are just two machines in the cluster and two parts are getting created. One part remains at the same server where I run the FileSplitter and the other gets transferred to the machine whose details I give in the FileTransfer class. As of now, I have hard coded the second server details in the FileTransfer class, but I can configure the server details by reading it from the properties files. For example, if the main huge file is employee.txt, then the part will be created in the output folder named employeePart1.txt and employeePart2.txt.

public class FileSplitter{     public static void main(String args[])     {int noParts = new Integer(args[0]); // no. of parts          String inputFile = args[1]; // input file path          String outputFolder = args[2]; // output folder path          List fileNamesList = new ArrayList();          FileInputStream fstream;          fstream = new FileInputStream(inputFile);          // Get the object of DataInputStream          DataInputStream in = new DataInputStream(fstream);          BufferedReader br = new BufferedReader(new InputStreamReader(in));          String strLine;          Vector vList = new Vector();          for(int i=0;

public class FileTransfer {   public void transferFile(String fileName, String outputFolder)    {        String username = "some_username";         String host = "some_host_name_ip_address";          String pwd = "some_pwd";              JSch jsch = null;       Session session = null;       Channel channel = null;       ChannelSftp c = null;jsch = new JSch();           session = jsch.getSession(username, host, 22);          session.setPassword(pwd);java.util.Properties config = new java.util.Properties();         config.put("StrictHostKeyChecking", "no");           session.setConfig(config);          session.connect();          channel = session.openChannel("sftp");           channel.connect();         c = (ChannelSftp) channel;String fsrc = fileName, fdest = outputFolder;        c.put(fsrc, fdest); c.disconnect();       session.disconnect();   }} 

This is just a sample FileSplitter. I can use several other available logics for splitting the files. Now I move on to my actual integration of Spring Batch with MongoDB to carry out the load process.

Step 2a: Configuring the job-repository.xml file

Spring Batch framework requires a job repository to store the details of the application and also other information related to job and steps. This repository can either be created in a database or held in memory. I will use the memory-based job repository in this example.



Step 2b: Configuring job.xml to load the data from a single file to MongoDB collection (table)

First, I define the Job.xml (FileToMongoTableJob.xml in my example). In this file, I specify the FlatFileItemReader, which is a class from the Spring Batch framework. I specify the resource to the FlatFileItemReader as the path of the input file. Here I say the resource value is file:d:dataemployee.csv, i.e., the location of input file employee.csv. I also define the delimiter, which in my case is a comma separator through the DelimitedLineTokenizer class. Then I define my own class EmployeeFieldSetMapper, which implements the Spring Batch framework’s FieldSetMapper class. This class binds the resultSet values to the fields of the table. If there is any calculation or process involved, I can cater that through my defined EmployeeProcessor class which implements the ItemProcessor class of the Spring Batch framework.

After this, I specify the MongoDB details by mentioning the hostname where the database is installed and also the port number. I access the database through the MongoTemplate, which takes the reference of the database details mentioned through the id (i.e., Mongo as the argument). In the MongoTemplate I also pass the other argument (i.e., the name of the database I will work with inside the MongoDB), and in this case it is “new.” Now I define my own class, MongoDBItemWriter, which is the extension of the ItemWriter class in Spring Batch. This class now reads the MongoTemplate to get the details of the database.

Next, I specify the DynamicJobParameters class, which implements the JobParametersIncrementer from the Spring Batch. This works as the incrementer for the job.

Finally, I specify my batch job where I give the batch:step and batch:tasklet details. The batch job here is employeeProcessorJob, which contains a single step that holds the tasklet where the task mentioned is to read the batch:chunk from the employeeFileItemReader. I also mention the process and the itemwriter details.



The above job description is to read from a single file and insert to a Mongo table.

Step 2c: Configuring job.xml to load the data from multiple files to MongoDB collection (table)

Next, I’ll look at a job description where I read from multiple files and insert into a table through MultipleFileToMongoTableJob.xml. This job description remains the same as the above one with just few differences. While mentioning the employeeFileItemReader, I also mention its scope, which is step. As the FlatFileItemReader will run in multiple steps to read from multiple files, the resource for the FlatFileItemReader is not a single fixed file. There are multiple files to be read; therefore the value for the resource is mentioned as #{stepExecutionContext[fileName]} to be read at the runtime. The employeeProcessor scope is also mentioned as step.

Next, I define the details for the PartitionStep, which is a class inside the Spring Batch framework. Here I give the name of the PartitionStep class as the step1:master. In the PartitionStep, I mention two properties: one is the reference of the jobRepository and other is the stepExecutionSplitter, which refers to the class SimpleStepExecutionSplitter in the Spring Batch framework. This class again takes two references: one is jobRepository, and the other is the step details.

Another argument that goes into this is the MultiResourcePartitioner class, which again is the Spring Batch framework. This class reads the multiple files from the given resource. Here I say the value of the resource is file:d:/data/inputFiles/employeePart*.csv, which indicates that from the mentioned locations I read all the file parts (employeePart0.csv, employeePart1.csv, employeePart2.csv and so on).

Under the step1:master, I also define another property, partitionHandler, which refers to the class TaskExecutorPartitionHandler inside the Spring Batch framework. This class takes three properties: taskExecutor, step and the gridSize. Then I define the step details, which takes the details of the task in the form of tasklet. Inside the task I mention the reader, processor and writer details. Finally, I give the job description under file_partition_Job, where I give the reference of the step details.



Step 3: The class files used in defining the Jobs.xml

Below is the Employee POJO class, which holds the details/attributes of the employee with their corresponding getter/setter methods, which are not shown here.

package com.infosys.springbatch.mongo.example; import; public class Employee implements Serializable {     private static final long serialVersionUID = 1L;     private String id;      private String name;           private String city;           private String designation;           private int joiningYear;           private int terminationYear;     private int tenure; }

Below, the given class maps the fieldSet data to the employee attributes and creates an employee object.

package com.infosys.springbatch.mongo.example; import org.springframework.batch.item.file.mapping.FieldSetMapper;import org.springframework.batch.item.file.transform.FieldSet;  public class EmployeeFieldSetMapper implements FieldSetMapper {           public Employee mapFieldSet(FieldSet fs)      {                     if(fs == null)          {                              return null;                    }                     Employee employee = new Employee();                    employee.setId(fs.readString("id"));                    employee.setName(fs.readString("name"));                    employee.setCity(fs.readString("city"));                    employee.setDesignation(fs.readString("designation"));                    employee.setJoiningYear(fs.readInt("joiningYear"));          employee.setTerminationYear(fs.readInt("terminationYear"));          return employee;          }  }

Below, the mentioned class implements the ItemProcessor, which does the processing of any logic if there is any involved using the employee object.

package com.infosys.springbatch.mongo.example; import org.springframework.batch.item.ItemProcessor;  public class EmployeeProcessor implements ItemProcessor {           public Employee process(Employee employee) throws Exception     {                     if(employee == null )                return null;           employee.setTenure(employee.getTerminationYear()-employee.getJoiningYear());                employee.setName(employee.getName());          return employee;          } }

This class implements the ItemWriter which actually writes the employee objects to the MongoDB table using the database details which has been defined in the MongoTemplate in the job xml file.

package com.infosys.springbatch.mongo.example;import java.util.List;import org.apache.commons.logging.Log;import org.apache.commons.logging.LogFactory;import org.springframework.batch.item.ItemWriter;import;import;public class MongoDBItemWriter implements ItemWriter {    private static final Log log = LogFactory.getLog(MongoDBItemWriter.class);    private MongoTemplate mongoTemplate;    /**     * @see ItemWriter#write(List)     */    public void write(List data) throws Exception     {; List employeeList = (List)data;                 MongoOperations operations = (MongoOperations)mongoTemplate;        if(operations.collectionExists("employee") == false)        {             operations.createCollection("employee");        }        operations.insertAll(employeeList);    }    public void setMongoTemplate(MongoTemplate mongoTemplate)     {      this.mongoTemplate = mongoTemplate;    }    public MongoTemplate getMongoTemplate()    {      return mongoTemplate;    }}

Below, the mentioned class implements the JobParametersIncrementer. This is basically used for incrementing the job count.

package com.infosys.springbatch.mongo.example; import org.springframework.batch.core.JobParameters;import org.springframework.batch.core.JobParametersBuilder;import org.springframework.batch.core.JobParametersIncrementer; public class DynamicJobParameters implements JobParametersIncrementer {                   public JobParameters getNext(JobParameters parameters)         {           if (parameters==null || parameters.isEmpty())                  { return new JobParametersBuilder().addLong("", 1L).toJobParameters();                }          long id = parameters.getLong("",1L) + 1;         parameters = new JobParametersBuilder().addLong("", id).toJobParameters();           return parameters;                }}

Step 4: Execution of the jobs mentioned in FileToMongoTableJob.xml and MultipleFileToMongoTableJob.xml

To run the jobs, I need to create run configurations for each of the jobs:

  • To load data from single file to MongoDB table I need to create run configuration where main class is and arguments are the xml definition file, job id mentioned in the xml, and the job incremental FileToMongoTableJob.xml employeeProcessorJob
  • To load data from multiple files to MongoDB table I need to create run configuration where main class is and arguments are same as above: MultipleFileToMongoTableJob.xml file_partition_Job


The main idea behind this article is to show the end-to-end integration process between Spring Batch and MongoDB to leverage their benefits.



Share the Post:
Apple Tech

Apple’s Search Engine Disruptor Brewing?

As the fourth quarter of 2023 kicks off, the technology sphere is abuzz with assorted news and advancements. Global stocks exhibit mixed results, whereas cryptocurrency

Revolutionary Job Market

AI is Reshaping the Tech Job Market

The tech industry is facing significant layoffs in 2023, with over 224,503 workers in the U.S losing their jobs. However, experts maintain that job security

Foreign Relations

US-China Trade War: Who’s Winning?

The August 2023 visit of Gina Raimondo, the U.S. Secretary of Commerce, to China demonstrated the progress being made in dialogue between the two nations.

Pandemic Recovery

Conquering Pandemic Supply Chain Struggles

The worldwide coronavirus pandemic has underscored supply chain challenges that resulted in billions of dollars in losses for automakers in 2021. Consequently, several firms are

Game Changer

How ChatGPT is Changing the Game

The AI-powered tool ChatGPT has taken the computing world by storm, receiving high praise from experts like Brex design lead, Pietro Schirano. Developed by OpenAI,

Apple Tech

Apple’s Search Engine Disruptor Brewing?

As the fourth quarter of 2023 kicks off, the technology sphere is abuzz with assorted news and advancements. Global stocks exhibit mixed results, whereas cryptocurrency tokens have seen a substantial

GlobalFoundries Titan

GlobalFoundries: Semiconductor Industry Titan

GlobalFoundries, a company that might not be a household name but has managed to make enormous strides in its relatively short 14-year history. As the third-largest semiconductor foundry in the

Revolutionary Job Market

AI is Reshaping the Tech Job Market

The tech industry is facing significant layoffs in 2023, with over 224,503 workers in the U.S losing their jobs. However, experts maintain that job security in the sector remains strong.

Foreign Relations

US-China Trade War: Who’s Winning?

The August 2023 visit of Gina Raimondo, the U.S. Secretary of Commerce, to China demonstrated the progress being made in dialogue between the two nations. However, the United States’ stance

Pandemic Recovery

Conquering Pandemic Supply Chain Struggles

The worldwide coronavirus pandemic has underscored supply chain challenges that resulted in billions of dollars in losses for automakers in 2021. Consequently, several firms are now contemplating constructing domestic manufacturing

Game Changer

How ChatGPT is Changing the Game

The AI-powered tool ChatGPT has taken the computing world by storm, receiving high praise from experts like Brex design lead, Pietro Schirano. Developed by OpenAI, ChatGPT is known for its

Future of Cybersecurity

Cybersecurity Battles: Lapsus$ Era Unfolds

In 2023, the cybersecurity field faces significant challenges due to the continuous transformation of threats and the increasing abilities of hackers. A prime example of this is the group of

Apple's AI Future

Inside Apple’s AI Expansion Plans

Rather than following the widespread pattern of job cuts in the tech sector, Apple’s CEO Tim Cook disclosed plans to increase the company’s UK workforce. The main area of focus

AI Finance

AI Stocks to Watch

As investor interest in artificial intelligence (AI) grows, many companies are highlighting their AI product plans. However, discovering AI stocks that already generate revenue from generative AI, such as OpenAI,

Web App Security

Web Application Supply Chain Security

Today’s web applications depend on a wide array of third-party components and open-source tools to function effectively. This reliance on external resources poses significant security risks, as malicious actors can

Thrilling Battle

Thrilling Battle: Germany Versus Huawei

The German interior ministry has put forward suggestions that would oblige telecommunications operators to decrease their reliance on equipment manufactured by Chinese firms Huawei and ZTE. This development comes after

iPhone 15 Unveiling

The iPhone 15’s Secrets and Surprises

As we dive into the most frequently asked questions and intriguing features, let us reiterate that the iPhone 15 brings substantial advancements in technology and design compared to its predecessors.

Chip Overcoming

iPhone 15 Pro Max: Overcoming Chip Setbacks

Apple recently faced a significant challenge in the development of a key component for its latest iPhone series, the iPhone 15 Pro Max, which was unveiled just a week ago.

Performance Camera

iPhone 15: Performance, Camera, Battery

Apple’s highly anticipated iPhone 15 has finally hit the market, sending ripples of excitement across the tech industry. For those considering upgrading to this new model, three essential features come

Battery Breakthrough

Electric Vehicle Battery Breakthrough

The prices of lithium-ion batteries have seen a considerable reduction, with the cost per kilowatt-hour dipping under $100 for the first occasion in two years, as reported by energy analytics

Economy Act Soars

Virginia’s Clean Economy Act Soars Ahead

Virginia has made significant strides towards achieving its short-term carbon-free objectives as outlined in the Clean Economy Act of 2020. Currently, about 44,000 megawatts (MW) of wind, solar, and energy

Renewable Storage Innovation

Innovative Energy Storage Solutions

The Department of Energy recently revealed a significant investment of $325 million in advanced battery technologies to store excess renewable energy produced by solar and wind sources. This funding will

Renesas Tech Revolution

Revolutionizing India’s Tech Sector with Renesas

Tushar Sharma, a semiconductor engineer at Renesas Electronics, met with Indian Prime Minister Narendra Modi to discuss the company’s support for India’s “Make in India” initiative. This initiative focuses on

Development Project

Thrilling East Windsor Mixed-Use Development

Real estate developer James Cormier, in collaboration with a partnership, has purchased 137 acres of land in Connecticut for $1.15 million with the intention of constructing residential and commercial buildings.

USA Companies

Top Software Development Companies in USA

Navigating the tech landscape to find the right partner is crucial yet challenging. This article offers a comparative glimpse into the top software development companies in the USA. Through a

Software Development

Top Software Development Companies

Looking for the best in software development? Our list of Top Software Development Companies is your gateway to finding the right tech partner. Dive in and explore the leaders in

India Web Development

Top Web Development Companies in India

In the digital race, the right web development partner is your winning edge. Dive into our curated list of top web development companies in India, and kickstart your journey to

USA Web Development

Top Web Development Companies in USA

Looking for the best web development companies in the USA? We’ve got you covered! Check out our top 10 picks to find the right partner for your online project. Your

Clean Energy Adoption

Inside Michigan’s Clean Energy Revolution

Democratic state legislators in Michigan continue to discuss and debate clean energy legislation in the hopes of establishing a comprehensive clean energy strategy for the state. A Senate committee meeting

Chips Act Revolution

European Chips Act: What is it?

In response to the intensifying worldwide technology competition, Europe has unveiled the long-awaited European Chips Act. This daring legislative proposal aims to fortify Europe’s semiconductor supply chain and enhance its

©2023 Copyright DevX - All Rights Reserved. Registration or use of this site constitutes acceptance of our Terms of Service and Privacy Policy.