Concurrency needn't be so complicated that you avoid it completely. One of the easiest ways to gain performance increases on multi-core platforms is with the parallel_for algorithm. To get a sense of how real-world developers are using Intel Threading Building Blocks, we spoke with Vincent Tan, a programmer with Pongrass Australia Pty. Ltd. in Bondi Junction, New South Wales, Australia.
As described on the Intel Software Network, Tan created multithreaded version of par2cmdline 0.4, a utility commonly used to repair corrupted Usenet postings via Reed Solomon coding. By leveraging the Intel Threading Building Blocks 2.0 library (using TBB's mutex, concurrent_hash_map, atomic, and parallel_for constructs), the program can process files concurrently instead of serially. As a result, dual-core machines can nearly double performance time when creating or repairing data files.
How did you learn about parallel_for?
I read the Intel TBB tutorial and reference manuals. From there, I looked at the sample code.
Was it easy to add the algorithm to your application?
After studying the sample code, it was straightforward to convert the code. The harder part was finding all of the shared resources (such as member variables) and then ensuring that access to them was thread-safe.
Did you make any mistakes?
I originally specified a grain size, but I found that it did not really help (because the TBB's default behavior was good enough for the code to which I tried to apply the grain size).
What would be the most interesting use for this algorithm?
To be honest, I view it as a tool to solve a particular problem. The obvious for loops in the project's code pretty much dictated the use of parallel_for. I'll put it another way: If you can process elements of a random-accessible array in parallel (i.e., the elements have no interdependencies) then parallel_for is the tool you probably want.
What performance or productivity benefits did you gain?
CPU utilization on a dual-core machine went from ~40-45 percent to ~80-85 percent. Because I/O is still performed serially (non-overlapped), the code never achieves 100 percent utilization—but a doubling of performance is good enough for most users.
How should a developer get started with parallel_for?
Read the Intel TBB tutorial on the Documentation page of threadingbuildingblocks.org and study the sample code. The reference manual helps out with the nitty-gritty details but you'll probably only need it if you need to specify the grain size.
TBB Code Listing
Here's a snippet of parallel_for at work in the par2cmdline source code.
(Key to colors used here)
Original Code
Note
TBB Class or Function
Boilerplate Code
The For Loop:
Helper functions:
// par2creator.cpp::973
// New function to hold the original loop body
void ProcessData(u32 outputblk, u32 endindex, size_t blklength, u32 inputblk) {
for( ; outputblk != endindex; ++outputblk ) {
// Select the appropriate part of the output buffer
void *outbuf = &((u8*)outputbuf)[chunksize * outputblk];
// Process the data through the RS matrix
rs.Process(blklength, inputblk, inputbuf, outputblk, outbuf);
}
}
// Encapsulates the loop body
class ApplyRSProcess {
public:
ApplyRSProcess(Par2Creator* obj, size_t blklength, u32 inputblk) :
_obj(obj), _blklength(blklength), _inputblk(inputblk) {}
void operator()(const tbb::blked_range<u32>& r) const {
_obj->ProcessData(r.begin(), r.end(), _blklength, _inputblk);
}
private:
Par2Creator* _obj;
size_t _blklength;
u32 _inputblk;
};