RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Demystify Scalable Parallelism with Intel Threading Building Block’s Generic Parallel Algorithms : Page 2

Multithreaded applications are notoriously difficult to write, test, and debug, but for many client-server developers, this task will soon become necessary for some high-performance applications. If it's in your future, start here to get an introduction to the algorithms behind the scenes in Intel's C++ template library for multithreading and start demystifing the multicore future.

Initializing and Terminating the Library
As mentioned previously, the engine underneath the parallel algorithms in Intel TBB is the task scheduler. The task scheduler creates enough threads to makes efficient use of the available processors without oversubscribing the system, and dynamically assigns user tasks to these threads to balance load and exploit the natural cache locality available in most applications.

Before using any of the generic parallel algorithms, an application must first initialize the task scheduler by creating a tbb::task_scheduler_init object:

#include <tbb/task_scheduler_init.h>

int main() {
    tbb::task_scheduler_init init;
    // use of an algorithm
The task scheduler is initialized at the creation of the task_scheduler_init object and is shut down in the destructor. As demonstrated in the code above, all of the Intel TBB classes and methods are within the namespace tbb.

The generic parallel algorithms provided by the library are safe to mix with user threads, so they can be deployed in an already threaded application. However, each thread that invokes an Intel TBB algorithm must first create a task_scheduler_init object so that it is registered with the single shared task scheduler. The task_scheduler_init objects behave like reference-counted handles to the scheduler. The first handle constructed creates the scheduler and the last handle destroyed destroys the scheduler.

Parallelizing Simple Loops: parallel_for and parallel_reduce
In the data decomposition example earlier, I showed you a simple loop that calls a single function. To flesh out this example slightly, suppose you want to apply a function Foo to each element of an array, and it is safe to process each element concurrently. Here is the sequential code to do this:

void SerialApplyFoo(float a[], size_t n ) {
    for( size_t i=0; i<n; ++i )
I can use the template function tbb::parallel_for<Range,Body> to break the iteration space of this loop into independent tasks, each task being a subset of the iterations:

00:#include "tbb/blocked_range.h"

01: class ApplyFoo {
02:    float *const my_a;
03: public:
04:     void operator()( const blocked_range<size_t>& r ) const {
05:         float *a = my_a;
06:         for( size_t i=r.begin(); i!=r.end(); ++i ) {
07:               Foo(a[i]);     
08:         }
09:     }
10:     ApplyFoo( float a[] ) :
11:         my_a(a) {}
12: };

13: void ParallelApplyFoo( float a[], size_t n ) {
14:     parallel_for(blocked_range<size_t>(0,n,IdealGrainSize), 
15:                  ApplyFoo(a) );
16: }
The first step in parallelizing this loop is to convert the loop body into a form that operates on a chunk of iterations. The class ApplyFoo at line 01, is a C++ Standard Template Library (STL) style function object. It defines a method operator() that contains a modified version of the original loop. The method operator() receives a blocked_range<size_t> object r that describes the subset of the original iteration space on which the function should operate.

The call to the template function tbb::parallel_for is found at line 14. It receives two arguments, a Range and Body.

In this example, the Intel TBB-provided class blocked_range<T> is used to define the range. It describes a one-dimensional iteration space over type T. Class parallel_for is templated to work with any iteration space that matches the range concept expected by the library. For example, the library also provides blocked_range2d for two-dimensional spaces, and you can define your own spaces as well.

At line 14, a blocked_range<size_t> is defined that represents the STL-style half-open interval [0, n). The IdealGrainSize argument specifies the number of iterations to be lumped together when handing out chunks of iterations as tasks. So for example, a value of 100 would mean that each task generated by the Intel TBB library would consist of about 100 iterations.

While template function parallel_for is sufficient for loops that have no dependencies between iterations, it is very common for loops to contain reductions. Reductions are expressions of the form x = x + y, where + is associative.

Reductions can be parallelized by exploiting their mathematical properties; it is legal to generate local reduction results on each thread for its part of the iteration space, and then combine the local results into a final global result.

The loop below performs a simple reduction, a summation:

float SerialSumFoo( const float a[], size_t n ) {
    float sum = 0;
    for( size_t i=0; i<n; ++i )
        sum += Foo(a[i]);
    return sum;
If the iterations in the above loop are otherwise independent, you can parallelize this loop using the template class tbb::parallel_reduce<Range,Body>. The Body's method operator() must now execute the loop on the provided range and also compute the local sum for that range. In addition, it must also define a method join() that can combine two local results.

Using these methods, Intel TBB will create and execute tasks to generate the local sums and combine their partial results into a final global result.

So far, all of the loops I have examined have had iteration spaces that were known before the loops commence. But not all loops are like that. The end of the iteration space may not be known in advance, or the loop body may add more iterations before the loop exits. You can deal with both situations using the template class tbb::parallel_while.

An example is a linked list traversal:

void SerialApplyFooToList( Item*root ) {
    for( Item* ptr=root; ptr!=NULL; ptr=ptr->next ) 
Unlike the templates described earlier, tbb::parallel_while is a class, not a function, and requires two user-defined objects. The first object defines the stream of items and the second object defines the body of the loop.

So my linked list example becomes:

void ParallelApplyFooToList( Item*root ) {
    parallel_while<ApplyFooToList> w;
    ItemStream stream;
    ApplyFoo body;
    w.run( stream, body ); 
The ItemStream object must define a method pop_if_present such that when pop_if_present(v) is invoked, it sets v to the next iteration value if there is one and returns true. If there are no more iterations, it returns false.

class ItemStream {
    Item* my_ptr;
    bool pop_if_present( Item*& item ) {
        if( my_ptr ) {
            item = my_ptr;
            my_ptr = my_ptr->next;
            return true;
        } else {
            return false;
    ItemStream( Item* root ) : my_ptr(root) {}
The loop body object is similar to a C++ function object from the C++ standard header <functional>. It must define a const operator() and a member type argument_type.

class ApplyFoo {
    void operator()( Item* item ) const {
    typedef Item* argument_type;
The template class parallel_while pops elements from the stream and assigns them as tasks for threads to operate on using the function object. While the sequential fetching of the work can limit concurrency, useful speedup can often be obtained when the tasks are of sufficient size or if the body of the loop adds more work to the stream.

The last algorithm that this article will introduce is tbb::pipeline. Pipelining is a common parallel pattern that mimics a traditional manufacturing assembly line. Data flows though a series of pipeline stages, and each stage processes the data in some way. Given an incoming stream of data, some of these stages can operate in parallel and others cannot. For example, in video processing, some operations on frames do not depend on other frames, and so can be done on multiple frames at the same time. On the other hand, some operations on frames require processing prior frames first.

The Intel TBB classes pipeline and filter implement the pipeline pattern. A pipeline is represented by Intel TBB as a series of filters. Filters can be configured to execute concurrently on distinct data packets or to only process a single packet at a time.

A simple text processing problem can be used to demonstrate the usage of pipeline and filter. The problem is to read a text file, capitalize the first letter of each word, and write the modified text to a new file. Figure 3 is a picture of the pipeline.

Figure 3. A simple text processing application that capitalizes the first letter of each word can be expressed as a pipeline.

I shall assume that the file I/O is sequential. However, the capitalization stage can be done in parallel. That is, if I can serially read n chunks very quickly, I am free to capitalize each of the n chunks in parallel, as long as they are written in the proper order to the output file. To decide whether to capitalize a letter, I inspect whether the previous character is a blank. For the first letter in each chunk, I need to inspect the last letter of the previous chunk. But doing so would introduce a complicating dependence in the middle stage. The solution is to have each chunk also store the last character of the previous chunk. The chunks overlap by one character.

Using this approach, a pipeline could be created, with a serial input filter, a parallel capitalize filter and a serial output filter. As with all pipelines there are dependencies between the stages, but if the input and output stages do not dominate the overall computation time, useful concurrency can be achieved in the middle stage.

Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date