o you thought batch programming had died and gone the way of the COBOL dinosaur. Think again. Despite the trends toward real-time solutions, companies still run essential business processes using batch technologies?and with good reason. Batch processing has its place, and it’s not just within legacy systems. Some problems just aren’t suited to the trendy on-line, real-time solutions. High-volume business processes, such as bulk data upload and downloads, data cleansing (ETL? Extract, Transform, Load), file import and export, billing and letter, statement and report generation, are typically best performed as batches.
Identifying Batch Scenarios
Batch processes are not typically associated with J2EE architectures. But this doesn’t mean that J2EE can’t cut it?quite the contrary. Rather, when to apply a batch solution and how to implement it in the context of J2EE are generally misunderstood.
To understand whether you have a potential batch scenario, first examine your business problem for a few tell-tail indicators. For example, does your business function fall into one or more of the following scenarios:
- Not required to be triggered or processed in real-time
- High volume, involving thousands, hundreds of thousands, or millions of data rows or transactions
- Computationally expensive, and you don’t want this cost to be part of your on-line application
- Unable to be triggered by a particular user action as the data is incomplete or unstable; Data stabilises after the fact or when some other business process occurs
- Triggered by a high-level, overarching business or time-based event
If it does, you have two choices:
- Batch processing: Use batch processes to process the work off-line.
- Real-time asynchronous processing: Use fine-grained timers or asynchronous event triggers (such as JMS) to pass responsibility from on-line to back-end functions.
J2EE Batch Processing Architecture
Once you identify the need for a batch solution, the next step is to design your system so that it can scale to meet your volume and throughput requirements. Batch processes are optimized to achieve the maximum throughput for a particular task, usually prioritizing performance over flexibility.
In the context of J2EE, this poses some interesting challenges. EJB containers are geared for session-oriented usage. They do not allow explicit thread creation and strictly control the number of threads and database connections that can be created through the use of thread and connection pools. Resources can be held only for so long before they timeout. Your batch processes must work within the constraints imposed by the container.
Isn’t it just simpler to build an entirely separate batch architecture outside the container? Not really. You’d have to build everything from the ground up, including your database and threading frameworks. It’s an expensive exercise, and you won’t be able to leverage your EJB investment. Business logic implemented inside your container as session or entity beans will have to be replicated or ported over.
Your J2EE batch framework should provide a generalized mechanism for the following:
- Basic job control (start and stop)
- Job partitioning
- Parallel processing and distribution
- Fine-grained transaction control
- Error handling
- Job monitoring
The key to achieving parallelism in J2EE is the ability to partition your job into smaller units of work (chunks) that can be processed in parallel. Think of a chunk as an atomic thread-safe unit of work that you can execute in a single transaction and potentially in parallel with other chunks. A chunk can encompass many records. By partitioning prior to execution, you preempt any potential locking problems that thread contention causes. Processing code can assume efficient optimistic locking strategies. However, each batch job is different, and job partitioning must be considered on a case-by-case basis. For this reason, your framework should provide the mechanism but not constrain the actual partitioning logic.
Once you’ve partitioned a job according to your business and resource constraints, you may safely farm out individual chunks to multiple threads and servers. You can scale the job by distributing load across threads on one or more servers running on different machines. Even farming out to multiple threads on a single server multi-CPU machine can reap substantial performance benefits.
The simplest way to achieve parallelism in J2EE is to initiate and control your threads from a client-side thread pool outside your application servers. However, be sure to limit your thread pool to a reasonable number. If you spawn too many threads you could flood your application servers and consume resources too quickly for them to handle. Such tactics are likely to bring your servers to a halt.
Feed your chunks into the client thread pool, which will manage them (see Figure 1). Typically, you will have fewer threads than you do chunks. The thread pool will allocate free threads as they become available. Each client thread is responsible for executing a chunk of work. The thread passes the chunk into an application server via a stateless session bean that provides transaction semantics (i.e., deployed using TX_REQUIRED). This guarantees that the chunk is executed atomically inside a transactional context.
|Figure 1: Batch Processing Architecture|
The framework session bean should be business agnostic. It doesn’t contain any business-specific logic; it just provides transaction demarcation. The actual work is delegated to a batch-job-specific worker class to execute. The session bean must either commit the chunk or, in the case of an error, roll back. Remember that the chunk is executed in one transaction so any errors raised to the session bean will cause the entire chunk of work to be rolled back?not just the current row that caused the error!
A chunk is just a data transfer object; it doesn’t contain any behavior or logic. The worker class that processes the chunk does the actual work. That way, partitioning semantics are clearly separate from processing logic. This facilitates reuse and allows batch jobs to plug in specific business processing logic.
A chunk contains query parameters (set by the chunker) that the worker uses to discriminate its data pickup queries. The worker acts only on the data described by the chunk, nothing else. This guarantees job partitioning across workers. Workers pick up the records to process using the chunk parameters to qualify their queries and then process the records to completion.
Once a batch job starts, it can sometimes be difficult to stop?especially if its been farmed out over multiple servers and threads. Killing your servers manually to stop a rogue process isn’t the best way. Your batch framework should provide a controlled business mechanism for halting your batch processes, but it mustn’t corrupt your data. The safest approach is to set a stop flag in a central place, typically the database although it can just be a global file. The various parts of your batch process can monitor the stop flag, so that when it gets set, they can wind processing down gracefully, preserving data integrity.
On the client side, your batch controller periodically polls the flag to check whether it is set. Once it detects a stop, it stops spawning threads. This ensures that no new processing initiates. On the server side, workers should check the stop flag at key points in the code. On a stop, the worker should stop its current processing loop and just return. How quickly the process halts depends on how many workers are currently running and how frequently they check the stop flag.
Real-time Asynchronous Processing vs. Batch Processing
Real-time asynchronous processing is applicable when the processing must be performed immediately or when the results must eventually be communicated back to an on-line user. Integration solutions built upon application servers typically leverage this approach using JMS for messaging and notification.
Fine-grained, real-time timers and events are suitable for when you need to handle the issues in real-time. However, they can be expensive. You spend CPU cycles raising, maintaining, and checking events. Event/timer processes may contend with your on-line processes for resources and for access to the same data, leading to locking problems. Using many small events to process large volumes is not as efficient as processing large data sets in one go. It’s hard to get efficiencies of scale.
Batch solutions are ideal for processing that is time or non-real-time event based and/or state based:
- Time-based: The business function executes on a periodic or recurring basis. The function is required to run only on a daily/weekly/monthly/yearly cycle or at a time denoted by the business.
- State-based: The business function operates over data that matches a given state where state can include simple, complex, and time-based criteria.
Batch processes are data-centric and can efficiently crunch through large volumes off-line without affecting your on-line systems.
Think of your batch process as a pipe. The goal is to push as much data down that pipe as possible, as fast as you can. Minimize the obstructions to your information flow by assembling all the data upfront and passing it downstream.
Aim for high resource (CPU/memory) utilization. Efficient batch processes will be CPU/memory bound not I/O bound. Adding hardware and/or software can easily scale such processes. I/O-bound functions are harder to scale. I/O problems can sometimes be fixed with simple DBMS or logging optimizations, but more often than not they require some sort of redesign.
If your batch process consumes resources aggressively then you shouldn’t run it at the same time as your on-line systems. It’s safer to run your batch processes during quiet times when resources are available and data is stable. In some cases, this may be a prerequisite of your job!
The keys to efficient batch processing over high volumes are effective partitioning and efficient processing code.
Job Partitioning and Chunking
The first step in developing a batch process is to logically partition or breakup your job into smaller chunks of work. A chunk of work is basically a set of N records to process in one thread and in one transaction. Decide whether you want your job divided up into parallel chunks, sequential chunks, or a combination of both.
Jobs that have business ordering or sequencing constraints cannot use concurrent processing. This is because multi-threading doesn’t guarantee order across threads. However, you should still consider breaking up your job into serial chunks to make it more reliable.
Running your job as one long-lived transaction isn’t a good idea. Long-lived transactions are more fragile than short-lived transactions. They are more susceptible to errors and DBMS locking problems because the locks are held longer. Smaller chunks are more reliable and less likely to exceed your connection and session timeouts.
Large chunks that process lots of records are more efficient than small chunks, but they consume more memory. Large chunks mean potentially big wins but also big losses. If an error occurs during processing then the entire chunk will roll back. This won’t affect other chunks but you may have to rerun your erroneous chunk again. Balance your reliability, memory, and CPU constraints against your performance targets.
Be sure to divide the work in such a way that independent chunks or work, executing in parallel, do not update or reference the same records. In addition, the partitioning should take account of any previous batch run. This is so records are not processed more than once (unless required by design). The safest way is to divide your chunks so that they describe distinct data sets. That way, you avoid any possibility of data contention.
Be aware of network costs, especially if running in a multi-server distributed or clustered environment. Don’t pass too much data across the wire between your chunkers and workers. Pass the bare minimum amount of data necessary to assist your worker input queries. Aim for coarse-grained chunking. Design your chunker to pick up roughly what you want. Its responsibility is to partition your work into thread-safe executable chunks, not to make fine-grained decisions over your input data. Your worker can make these distinctions when it reads in its chunk data or on the fly in the processing code.
Optimize your chunks to process as much data as they can within your data and resource constraints. Don’t make your job partitioning too fine-grained (by dividing 1,000 rows into 1,000 parallel chunks of 1 row each, for example). The setup and serialization costs will be high and your job may not run any faster. Chunk sensibly.
You can employ several approaches to partition up large amounts of work, including the following:
- Strategy 1: Query and exclude
This is the most lightweight, but least scalable approach. Query your business tables for a list of records to process. Exclude records that have already been processed (by using either an exception or an outer join to result tables or to the same table). To partition by the number of transactions, divide the total number of records by the number of records to be processed in one unit of work. This yields the number of units of work to be processed. For each unit of work, create a chunk and store the start transaction ID and end transactional ID for a record range inside the chunk (e.g., 1-100 and 101-200 and so on).
You can also partition work along business lines and by database stripes. This is a useful technique for tuning batch queries. Database striping is the partitioning of tables across physical media (disk). By marrying your batch partitioning strategy to your stripping, you can accelerate your query response times. Chunk data is guaranteed to be stored on a single contiguous data volume and cache hits for the batch queries will be correspondingly high.
- Strategy 2: Process and mark
Exceptions joins are easy to implement, and they don’t impact the business schema. However, not all databases support exception joins, and they don’t scale particularly well for very large volumes. The second approach requires records to be marked as they get processed. That way, chunk reads can exclude records that have already been processed. This is more efficient but requires a column to be set aside to store a completion flag and the workers to issue additional updates.
- Strategy 3: Queue, collect, and drain
This is the preferred and most efficient approach. Assemble work to be done in a work or queuing table. The holding table may be populated either as part of your normal on-line processing or when your batch job partitions. Ensure that your DBMS has enough capacity to contain your work table.
Partition the work table records into chunks. The simplest way is to simply mark each set of N records with a chunk ID. Inside, your workers use the chunk ID to read in the records to be processed. Once all records have been processed, either mark them as complete or remove them from the work table. This approach has the advantage of not impacting your business tables. You can easily monitor progress by keeping track of the records draining from the holding table.
Partitioning effectively is the first part of the batch equation. But no matter how good your partitioning is, your job won’t scale if the actual processing code is slow. Your processing components (workers) should be as streamlined as possible. Try not to obstruct your flow with redundant data lookups, looping constructs, unnecessary logic, or abstraction patterns. Your batch-processing code should be as streamlined as possible and, in most cases, procedural in nature: no unnecessary object creation, string manipulation, I/O, logging, or validation checks.
Reduce the number of session beans and remote calls, and use plain old Java objects (POJOs) instead. Avoid initiating new transactions using TX_NEW or explicit rollbacks. Remember that the framework provides a transactional context and controls database commits and rollbacks. You shouldn’t attempt to wrest transaction control from the framework because it can lead to data integrity problems. If one transaction fails and the other succeeds, your data will be partially committed (i.e., in an incomplete or unknown state).
Start Simple and Expand
Keep your business workers simple. Reduce them down to the minimum number of steps possible, and keep the DBMS updates to a minimum. Complex processes are hard to tune, and they can have a lot of dependencies. Start off with something basic then expand as required. Sometimes breaking complex business operations up into several batch processes is more efficient than putting them all into one. Use this approach when different parts of your business process have different partitioning constraints. Ensure that the outputs of your first batch process logically flow into the inputs of the next.
Reduce Data Lookups
Try not to perform redundant or excessive lookups. The aim is to assemble as much of your data up front (within your memory constraints). Lots of small I/O is less efficient than bulk I/O. Sometimes you can’t avoid the extra data lookups because it isn’t possible or logical to include them as part of your worker queries. If your process is executed on a quiescent or dedicated system, you can assume that nobody else is touching your data. In this situation, you can reduce your additional read overheads by replacing them with dedicated caches. You don’t have to worry about cache coherency, because your data is not being changed by anyone else but you.
There are two aspects to dealing with errors. The first is error notification, understanding when an error occurs and what caused it. The second is job management, controlling whether the chunk and the overall job stops or continues. Error notification can be as simple as logging diagnostics to a file, issuing an email or console alert, or marking the error in the database. Whether the chunk commits or rolls back is determined by the business problem. If the job is executed serially, the current chunk must be rolled back and the overall job stopped. This is because work cannot continue. However, if the job is executed in parallel, it is possible to roll back the current chunk and, provided the error isn’t fatal, leave the rest of the job to continue as normal.
To Batch or Not to Batch
Batch programs have their place regardless of the technology you use. Whether you write a batch program or a real-time, asynchronous component is an architectural decision that should be based on business drivers not technology trends. J2EE is uniquely positioned to support both. Its ability to manage transactions, and distribute and scale across multiple processors or servers makes it an ideal platform on which to build high-volume transaction processing components.