Real-time Asynchronous Processing vs. Batch Processing
Real-time asynchronous processing is applicable when the processing must be performed immediately or when the results must eventually be communicated back to an on-line user. Integration solutions built upon application servers typically leverage this approach using JMS for messaging and notification.
Fine-grained, real-time timers and events are suitable for when you need to handle the issues in real-time. However, they can be expensive. You spend CPU cycles raising, maintaining, and checking events. Event/timer processes may contend with your on-line processes for resources and for access to the same data, leading to locking problems. Using many small events to process large volumes is not as efficient as processing large data sets in one go. It's hard to get efficiencies of scale.
Batch solutions are ideal for processing that is time or non-real-time event based and/or state based:
- Time-based: The business function executes on a periodic or recurring basis. The function is required to run only on a daily/weekly/monthly/yearly cycle or at a time denoted by the business.
- State-based: The business function operates over data that matches a given state where state can include simple, complex, and time-based criteria.
Batch processes are data-centric and can efficiently crunch through large volumes off-line without affecting your on-line systems.
Think of your batch process as a pipe. The goal is to push as much data down that pipe as possible, as fast as you can. Minimize the obstructions to your information flow by assembling all the data upfront and passing it downstream.
Aim for high resource (CPU/memory) utilization. Efficient batch processes will be CPU/memory bound not I/O bound. Adding hardware and/or software can easily scale such processes. I/O-bound functions are harder to scale. I/O problems can sometimes be fixed with simple DBMS or logging optimizations, but more often than not they require some sort of redesign.
If your batch process consumes resources aggressively then you shouldn't run it at the same time as your on-line systems. It's safer to run your batch processes during quiet times when resources are available and data is stable. In some cases, this may be a prerequisite of your job!
The keys to efficient batch processing over high volumes are effective partitioning and efficient processing code.
Job Partitioning and Chunking
The first step in developing a batch process is to logically partition or breakup your job into smaller chunks of work. A chunk of work is basically a set of N records to process in one thread and in one transaction. Decide whether you want your job divided up into parallel chunks, sequential chunks, or a combination of both.
Jobs that have business ordering or sequencing constraints cannot use concurrent processing. This is because multi-threading doesn't guarantee order across threads. However, you should still consider breaking up your job into serial chunks to make it more reliable.
Running your job as one long-lived transaction isn't a good idea. Long-lived transactions are more fragile than short-lived transactions. They are more susceptible to errors and DBMS locking problems because the locks are held longer. Smaller chunks are more reliable and less likely to exceed your connection and session timeouts.
Large chunks that process lots of records are more efficient than small chunks, but they consume more memory. Large chunks mean potentially big wins but also big losses. If an error occurs during processing then the entire chunk will roll back. This won’t affect other chunks but you may have to rerun your erroneous chunk again. Balance your reliability, memory, and CPU constraints against your performance targets.
Be sure to divide the work in such a way that independent chunks or work, executing in parallel, do not update or reference the same records. In addition, the partitioning should take account of any previous batch run. This is so records are not processed more than once (unless required by design). The safest way is to divide your chunks so that they describe distinct data sets. That way, you avoid any possibility of data contention.
Be aware of network costs, especially if running in a multi-server distributed or clustered environment. Don't pass too much data across the wire between your chunkers and workers. Pass the bare minimum amount of data necessary to assist your worker input queries. Aim for coarse-grained chunking. Design your chunker to pick up roughly what you want. Its responsibility is to partition your work into thread-safe executable chunks, not to make fine-grained decisions over your input data. Your worker can make these distinctions when it reads in its chunk data or on the fly in the processing code.
Optimize your chunks to process as much data as they can within your data and resource constraints. Don't make your job partitioning too fine-grained (by dividing 1,000 rows into 1,000 parallel chunks of 1 row each, for example). The setup and serialization costs will be high and your job may not run any faster. Chunk sensibly.