You can employ several approaches to partition up large amounts of work, including the following:
- Strategy 1: Query and exclude
This is the most lightweight, but least scalable approach. Query your business tables for a list of records to process. Exclude records that have already been processed (by using either an exception or an outer join to result tables or to the same table). To partition by the number of transactions, divide the total number of records by the number of records to be processed in one unit of work. This yields the number of units of work to be processed. For each unit of work, create a chunk and store the start transaction ID and end transactional ID for a record range inside the chunk (e.g., 1-100 and 101-200 and so on).
You can also partition work along business lines and by database stripes. This is a useful technique for tuning batch queries. Database striping is the partitioning of tables across physical media (disk). By marrying your batch partitioning strategy to your stripping, you can accelerate your query response times. Chunk data is guaranteed to be stored on a single contiguous data volume and cache hits for the batch queries will be correspondingly high.
- Strategy 2: Process and mark
Exceptions joins are easy to implement, and they don't impact the business schema. However, not all databases support exception joins, and they don't scale particularly well for very large volumes. The second approach requires records to be marked as they get processed. That way, chunk reads can exclude records that have already been processed. This is more efficient but requires a column to be set aside to store a completion flag and the workers to issue additional updates.
- Strategy 3: Queue, collect, and drain
This is the preferred and most efficient approach. Assemble work to be done in a work or queuing table. The holding table may be populated either as part of your normal on-line processing or when your batch job partitions. Ensure that your DBMS has enough capacity to contain your work table.
Partition the work table records into chunks. The simplest way is to simply mark each set of N records with a chunk ID. Inside, your workers use the chunk ID to read in the records to be processed. Once all records have been processed, either mark them as complete or remove them from the work table. This approach has the advantage of not impacting your business tables. You can easily monitor progress by keeping track of the records draining from the holding table.
Partitioning effectively is the first part of the batch equation. But no matter how good your partitioning is, your job won't scale if the actual processing code is slow. Your processing components (workers) should be as streamlined as possible. Try not to obstruct your flow with redundant data lookups, looping constructs, unnecessary logic, or abstraction patterns. Your batch-processing code should be as streamlined as possible and, in most cases, procedural in nature: no unnecessary object creation, string manipulation, I/O, logging, or validation checks.
Reduce the number of session beans and remote calls, and use plain old Java objects (POJOs) instead. Avoid initiating new transactions using TX_NEW or explicit rollbacks. Remember that the framework provides a transactional context and controls database commits and rollbacks. You shouldn't attempt to wrest transaction control from the framework because it can lead to data integrity problems. If one transaction fails and the other succeeds, your data will be partially committed (i.e., in an incomplete or unknown state).
Start Simple and Expand
Keep your business workers simple. Reduce them down to the minimum number of steps possible, and keep the DBMS updates to a minimum. Complex processes are hard to tune, and they can have a lot of dependencies. Start off with something basic then expand as required. Sometimes breaking complex business operations up into several batch processes is more efficient than putting them all into one. Use this approach when different parts of your business process have different partitioning constraints. Ensure that the outputs of your first batch process logically flow into the inputs of the next.
Reduce Data Lookups
Try not to perform redundant or excessive lookups. The aim is to assemble as much of your data up front (within your memory constraints). Lots of small I/O is less efficient than bulk I/O. Sometimes you can’t avoid the extra data lookups because it isn't possible or logical to include them as part of your worker queries. If your process is executed on a quiescent or dedicated system, you can assume that nobody else is touching your data. In this situation, you can reduce your additional read overheads by replacing them with dedicated caches. You don't have to worry about cache coherency, because your data is not being changed by anyone else but you.
There are two aspects to dealing with errors. The first is error notification, understanding when an error occurs and what caused it. The second is job management, controlling whether the chunk and the overall job stops or continues. Error notification can be as simple as logging diagnostics to a file, issuing an email or console alert, or marking the error in the database. Whether the chunk commits or rolls back is determined by the business problem. If the job is executed serially, the current chunk must be rolled back and the overall job stopped. This is because work cannot continue. However, if the job is executed in parallel, it is possible to roll back the current chunk and, provided the error isn't fatal, leave the rest of the job to continue as normal.
To Batch or Not to Batch
Batch programs have their place regardless of the technology you use. Whether you write a batch program or a real-time, asynchronous component is an architectural decision that should be based on business drivers not technology trends. J2EE is uniquely positioned to support both. Its ability to manage transactions, and distribute and scale across multiple processors or servers makes it an ideal platform on which to build high-volume transaction processing components.