RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Demystify Scalable Parallelism with Intel Threading Building Block’s Generic Parallel Algorithms : Page 3

Multithreaded applications are notoriously difficult to write, test, and debug, but for many client-server developers, this task will soon become necessary for some high-performance applications. If it's in your future, start here to get an introduction to the algorithms behind the scenes in Intel's C++ template library for multithreading and start demystifing the multicore future.

A Case Study: Finding Matching Substrings
This section presents a basic example that uses the parallel_for template in a substring matching program. For each position in a string, the program displays the length and location of the largest matching substring elsewhere in the string. Consider the string "babba" as an example. Starting at position 0, "ba" is the largest substring with a match elsewhere in the string (position 3).

The code below shows the serial code, where to_scan is the string to examine and the class Result stores the results of the matching:

00: Result RunSerialFinder(string to_scan) {
01:   Result r(to_scan.size());

02:  for ( size_t i = 0; i < to_scan.size(); ++i ) {
03:    size_t max_size = 0, max_pos = 0;
04:    for (size_t j = 0; j < to_scan.size(); ++j) {
05:      if (j != i) {
06:        size_t limit = to_scan.size()-( i > j ? i : j );
07:        for (size_t k = 0; k < limit; ++k) {
08:          if (to_scan[i + k] != to_scan[j + k]) break;
09:          if (k > max_size) {
10:            max_size = k;
11:            max_pos = j;
12:          }
13:        }
14:      }
15:    }
16:    r.max[i] = max_size;
17:    r.pos[i] = max_pos;
18:  }
19:  return r;
20: }
Before applying a parallel algorithm from Intel TBB, I need to determine if any of the loops can be executed concurrently.

I start with the i-loop. The only variables referenced in the i-loop body that are declared outside of the loop body are i, to_scan, and r. Only these variables have the potential to create dependencies across the iterations. Both i and to_scan are only read in the loop body, so are safe. The max and pos arrays in r are modified, but each iteration accesses distinct elements, so again there are no conflicts between iterations.

Because there are no dependencies to preclude concurrent execution of the i-loop, I can therefore parallelize this loop using tbb::parallel_for. While Intel TBB does require developers to identify independent tasks, tools are available to find any mistakes one might make. For example, Intel Thread Checker can determine if a parallel execution of a program violates any of its serial dependencies, pointing developers to the exact line and variable responsible for the violation.

The modified code for this example using Intel TBB to parallelize the i-loop is shown below:

01: class TBBFinderBody {
02: const string str;
03: size_t *max_array;
04: size_t *pos_array;
05: public:
06: void operator() ( const tbb::blocked_range& r ) const {
07:  for ( size_t i = r.begin(); i != r.end(); ++i ) {
08:   size_t max_size = 0, max_pos = 0;
09:   for (size_t j = 0; j < str.size(); ++j)
10:    if (j != i) {
11:     size_t limit = str.size()-( i > j ? i : j );
12:     for (size_t k = 0; k < limit; ++k) {
13:      if (str[i + k] != str[j + k]) break;
14:      if (k > max_size) {
15:       max_size = k;
16:       max_pos = j;
17:      }
18:     }
19:    }
20:    max_array[i] = max_size;
21:    pos_array[i] = max_pos;
22:   }
23:  }
24:  TBBFinderBody(string &s, size_t *m, size_t *p) :
25:   str(s), max_array(m), pos_array(p) { }
26: };

27: Result RunTBBFinder(string to_scan) {
28:   Result r(to_scan.size());

29:   tbb::parallel_for(tbb::blocked_range(0, to_scan.size(),   
30:      100), TBBFinderBody( to_scan, r.max, r.pos ) );
31:   return r;
32: }
In the above example, the code shown in blue represents new code that must be added to enable use of the template function tbb::parallel_for. The remainder of the code is unchanged from the original sequential implementation.

At line 29, a tbb::parallel_for is passed a blocked_range object representing the original iteration space with a grainsize of 100 and a TBBFinderBody object, which implements the loop nest. Look at the method opertaor()at line 06; it is clear that much of the original code is used without change. Only the bounds of the i-loop are modified, to now reference the bounds provided by the range object.

To evaluate the effectiveness of this implementation, I compared the performance of the Intel TBB implementation to the serial implementation on a server with two dual-core Intel Xeon Processor 5100 processors. I also ran an implementation that uses POSIX threads directly for the parallelization (see Figure 4). The complete code for each version is provided in Listing 1.

Figure 4. Intel Threading Building Blocks provides good scalability without burdening developers with the details of thread creation, scheduling and management required by a POSIX thread implementation.

Both the user threaded POSIX threads version and the tbb::parallel_for version of the application show good scalability, and yield a speedup of around 4.5 times when four processors are used. The superlinear speedups seen by both versions can be attributed to the larger aggregate cache size available when using multiple processors.

With Intel TBB, the application has been parallelized without the need to deal with the low-level details of threading. Instead, scheduling, load balancing and cache locality are left for the library to manage. And even with this easier transition to parallelism, the resulting application performance is comparable to that of hand-threaded code.

Making the Transition to Multithreaded
Multithreaded applications are notoriously difficult to write, test, and debug. However, to take full advantage of the added performance potential of multicore desktop and laptop systems, developers now face the challenging task of threading their applications. Leveraging the generic parallel algorithms provided by the Intel Threading Building Blocks library, which are tested and tuned for current and future multicore processors, can dramatically ease the burden of this transition.

In addition, Intel Threading Building Blocks is compatible with Intel Thread Checker and Intel Thread Profiler. These tools provide correctness and performance analysis for programs that use Intel TBB, the OpenMP API and native threading libraries such as WinThreads or POSIX Threads. Using these tools, developers can quickly identify correctness or performance issues in their threaded code.

For more information on using or obtaining Intel Threading Building Blocks, visit the product web page at http://www.intel.com/software/products/tbb . For more information on the other Intel Threading Analysis Tools, Intel Thread Checker and Intel Thread Profiler, visit http://www.intel.com/software/products/threading.

Michael Voss is a Senior Staff Software Engineer at Intel Corp. Voss received his Ph.D. in Electrical Engineering from Purdue University in 2001. He is currently a Senior Staff Software Engineer in the Performance Analysis and Threading Lab at Intel and an adjunct Professor in the Edward S. Rogers Sr. Department of Electrical and Computer Engineering at the University of Toronto. His interests include parallel programming and optimizing compilers.
Email AuthorEmail Author
Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date