J2EE Batch Processing Architecture
Once you identify the need for a batch solution, the next step is to design your system so that it can scale to meet your volume and throughput requirements. Batch processes are optimized to achieve the maximum throughput for a particular task, usually prioritizing performance over flexibility.
In the context of J2EE, this poses some interesting challenges. EJB containers are geared for session-oriented usage. They do not allow explicit thread creation and strictly control the number of threads and database connections that can be created through the use of thread and connection pools. Resources can be held only for so long before they timeout. Your batch processes must work within the constraints imposed by the container.
Isn't it just simpler to build an entirely separate batch architecture outside the container? Not really. You'd have to build everything from the ground up, including your database and threading frameworks. It's an expensive exercise, and you won't be able to leverage your EJB investment. Business logic implemented inside your container as session or entity beans will have to be replicated or ported over.
Your J2EE batch framework should provide a generalized mechanism for the following:
- Basic job control (start and stop)
- Job partitioning
- Parallel processing and distribution
- Fine-grained transaction control
- Error handling
- Job monitoring
The key to achieving parallelism in J2EE is the ability to partition your job into smaller units of work (chunks) that can be processed in parallel. Think of a chunk as an atomic thread-safe unit of work that you can execute in a single transaction and potentially in parallel with other chunks. A chunk can encompass many records. By partitioning prior to execution, you preempt any potential locking problems that thread contention causes. Processing code can assume efficient optimistic locking strategies. However, each batch job is different, and job partitioning must be considered on a case-by-case basis. For this reason, your framework should provide the mechanism but not constrain the actual partitioning logic.
Once you've partitioned a job according to your business and resource constraints, you may safely farm out individual chunks to multiple threads and servers. You can scale the job by distributing load across threads on one or more servers running on different machines. Even farming out to multiple threads on a single server multi-CPU machine can reap substantial performance benefits.
The simplest way to achieve parallelism in J2EE is to initiate and control your threads from a client-side thread pool outside your application servers. However, be sure to limit your thread pool to a reasonable number. If you spawn too many threads you could flood your application servers and consume resources too quickly for them to handle. Such tactics are likely to bring your servers to a halt.
Feed your chunks into the client thread pool, which will manage them (see Figure 1). Typically, you will have fewer threads than you do chunks. The thread pool will allocate free threads as they become available. Each client thread is responsible for executing a chunk of work. The thread passes the chunk into an application server via a stateless session bean that provides transaction semantics (i.e., deployed using TX_REQUIRED). This guarantees that the chunk is executed atomically inside a transactional context.
The framework session bean should be business agnostic. It doesn't contain any business-specific logic; it just provides transaction demarcation. The actual work is delegated to a batch-job-specific worker class to execute. The session bean must either commit the chunk or, in the case of an error, roll back. Remember that the chunk is executed in one transaction so any errors raised to the session bean will cause the entire chunk of work to be rolled backnot just the current row that caused the error!
A chunk is just a data transfer object; it doesn't contain any behavior or logic. The worker class that processes the chunk does the actual work. That way, partitioning semantics are clearly separate from processing logic. This facilitates reuse and allows batch jobs to plug in specific business processing logic.
A chunk contains query parameters (set by the chunker) that the worker uses to discriminate its data pickup queries. The worker acts only on the data described by the chunk, nothing else. This guarantees job partitioning across workers. Workers pick up the records to process using the chunk parameters to qualify their queries and then process the records to completion.
Once a batch job starts, it can sometimes be difficult to stopespecially if its been farmed out over multiple servers and threads. Killing your servers manually to stop a rogue process isn't the best way. Your batch framework should provide a controlled business mechanism for halting your batch processes, but it mustn't corrupt your data. The safest approach is to set a stop flag in a central place, typically the database although it can just be a global file. The various parts of your batch process can monitor the stop flag, so that when it gets set, they can wind processing down gracefully, preserving data integrity.
On the client side, your batch controller periodically polls the flag to check whether it is set. Once it detects a stop, it stops spawning threads. This ensures that no new processing initiates. On the server side, workers should check the stop flag at key points in the code. On a stop, the worker should stop its current processing loop and just return. How quickly the process halts depends on how many workers are currently running and how frequently they check the stop flag.