devxlogo

Demystify Scalable Parallelism with Intel Threading Building Block’s Generic Parallel Algorithms

Demystify Scalable Parallelism with Intel Threading Building Block’s Generic Parallel Algorithms

o exploit the performance potential of multicore processors, applications must be threaded for performance. While performance-oriented threading is new for many developers of desktop and laptop systems, there is a long history of threading in the scientific and high-performance computing domains that can be leveraged on the desktop?at least theoretically. Unfortunately, the languages and libraries developed in those domains do not match the unique demands of mainstream client and server applications, and therefore are difficult to apply directly.

This article introduces the generic parallel algorithms provided by Intel Threading Building Blocks (Intel TBB). This C++ template library includes several common algorithms such as parallel_for, parallel_reduce, and pipeline, which are known through experience to be useful in a wide range of application domains. In Intel TBB, these algorithms have been specifically tuned to provide scalable performance for applications targeting current and future generation multicore and multithreaded systems.

Functional versus Data Decomposition
There are two main methods for decomposing a sequential program into a parallel program: (1) functional decomposition and (2) data decomposition. With functional decomposition, independent tasks that are doing different types of work are identified. These functionally distinct tasks are then executed concurrently. With data decomposition, a single task performed on a large amount of data is split into independent tasks, each task processing a subset of the data.

Of course, in both cases the developer must ensure that any tasks executed concurrently do not have data dependencies between them. A data dependency occurs when two instructions access the same memory location. Some dependencies, known as false dependencies, arise as an artifact of reusing variables for different purposes, and can often be broken if the code is rewritten. On the other hand, true dependencies are a normal part of the communication between instructions in a program and constrain their legal ordering. When creating a parallel program from a sequential program, it is important that any true data dependencies in the sequential ordering of instructions are preserved in the parallel program.

Returning to the discussion of program decomposition, Figures 1 and 2 show examples of functional and data decomposition. While both examples provide adequate concurrency for a dual-core system, the example of functional decomposition only identifies two pieces of independent work, methods func1 and func2. If this code is later ported to a quad-core system, no additional improvement can be expected for this region, and so to further improve performance the developer would need to look for additional concurrency elsewhere.


Figure 1. Functional decomposition provides a small number of independent tasks and therefore only limited scalability.
?
Figure 2. Data decomposition provides a large pool of independent tasks and therefore scales as more cores become available.

On the other hand, the data decomposition approach is highly scalable. A loop with independent iterations provides a large pool of tasks. If the data decomposition example code were ported to a quad-core system, the loop iterations could be distributed across four processors instead of two. It is generally the case that data decomposition allows for better scaling as processors are added.

Because Intel TBB is designed to support scalable parallelism for current and future multi-core architectures, its generic parallel algorithms emphasize data parallel approaches.

Mapping Tasks to Threads and Processors
One of the main difficulties in developing multithreaded applications is managing the many low-level details associated with creating and managing threads and mapping them efficiently to the target platform. With Intel TBB, developers specify tasks instead of threads, and the library itself manages and distributes these tasks across threads and processors.

TBB’s library uses an internal task scheduler to assign tasks to threads. The task scheduler in Intel TBB is designed to:

  • Automatically balance the load across processors
  • Schedule tasks to exploit the natural cache locality of applications
  • Avoid the over-subscription of resources that often comes when composing applications from threaded components.
    The details of the task scheduler are beyond the scope of this article, but interested readers can learn more by visiting the product website at http://www.intel.com/software/products/tbb.

In the remainder of this article, I introduce the Intel TBB library, explain how to initialize the task scheduler, and then provide brief overviews of several generic parallel algorithms. I then walk through a case study that demonstrates how the parallel_for algorithm can be deployed to add concurrency to an existing application.

Editor’s Note: Michael Voss is a Senior Staff Software Engineer at Intel Corporation, which is the owner-developer of the TBB technology discussed herein. This article has been selected for publication because we believe it to have objective technical merit. No specific endorsement of Intel technologies by the editors of DevX is implied.

Initializing and Terminating the Library
As mentioned previously, the engine underneath the parallel algorithms in Intel TBB is the task scheduler. The task scheduler creates enough threads to makes efficient use of the available processors without oversubscribing the system, and dynamically assigns user tasks to these threads to balance load and exploit the natural cache locality available in most applications.

Before using any of the generic parallel algorithms, an application must first initialize the task scheduler by creating a tbb::task_scheduler_init object:

#include int main() {    tbb::task_scheduler_init init;        // use of an algorithm}

The task scheduler is initialized at the creation of the task_scheduler_init object and is shut down in the destructor. As demonstrated in the code above, all of the Intel TBB classes and methods are within the namespace tbb.

The generic parallel algorithms provided by the library are safe to mix with user threads, so they can be deployed in an already threaded application. However, each thread that invokes an Intel TBB algorithm must first create a task_scheduler_init object so that it is registered with the single shared task scheduler. The task_scheduler_init objects behave like reference-counted handles to the scheduler. The first handle constructed creates the scheduler and the last handle destroyed destroys the scheduler.

Parallelizing Simple Loops: parallel_for and parallel_reduce
In the data decomposition example earlier, I showed you a simple loop that calls a single function. To flesh out this example slightly, suppose you want to apply a function Foo to each element of an array, and it is safe to process each element concurrently. Here is the sequential code to do this:

void SerialApplyFoo(float a[], size_t n ) {    for( size_t i=0; i

I can use the template function tbb::parallel_for to break the iteration space of this loop into independent tasks, each task being a subset of the iterations:

00:#include "tbb/blocked_range.h"01: class ApplyFoo {02:    float *const my_a;03: public:04:     void operator()( const blocked_range& r ) const {05:         float *a = my_a;06:         for( size_t i=r.begin(); i!=r.end(); ++i ) {07:               Foo(a[i]);     08:         }09:     }10:     ApplyFoo( float a[] ) :11:         my_a(a) {}12: };13: void ParallelApplyFoo( float a[], size_t n ) {14:     parallel_for(blocked_range(0,n,IdealGrainSize), 15:                  ApplyFoo(a) );16: }

The first step in parallelizing this loop is to convert the loop body into a form that operates on a chunk of iterations. The class ApplyFoo at line 01, is a C++ Standard Template Library (STL) style function object. It defines a method operator() that contains a modified version of the original loop. The method operator() receives a blocked_range object r that describes the subset of the original iteration space on which the function should operate.

The call to the template function tbb::parallel_for is found at line 14. It receives two arguments, a Range and Body.

In this example, the Intel TBB-provided class blocked_range is used to define the range. It describes a one-dimensional iteration space over type T. Class parallel_for is templated to work with any iteration space that matches the range concept expected by the library. For example, the library also provides blocked_range2d for two-dimensional spaces, and you can define your own spaces as well.

At line 14, a blocked_range is defined that represents the STL-style half-open interval [0, n). The IdealGrainSize argument specifies the number of iterations to be lumped together when handing out chunks of iterations as tasks. So for example, a value of 100 would mean that each task generated by the Intel TBB library would consist of about 100 iterations.

While template function parallel_for is sufficient for loops that have no dependencies between iterations, it is very common for loops to contain reductions. Reductions are expressions of the form x = x + y, where + is associative.

Reductions can be parallelized by exploiting their mathematical properties; it is legal to generate local reduction results on each thread for its part of the iteration space, and then combine the local results into a final global result.

The loop below performs a simple reduction, a summation:

float SerialSumFoo( const float a[], size_t n ) {    float sum = 0;    for( size_t i=0; i

If the iterations in the above loop are otherwise independent, you can parallelize this loop using the template class tbb::parallel_reduce. The Body's method operator() must now execute the loop on the provided range and also compute the local sum for that range. In addition, it must also define a method join() that can combine two local results.

Using these methods, Intel TBB will create and execute tasks to generate the local sums and combine their partial results into a final global result.

parallel_while
So far, all of the loops I have examined have had iteration spaces that were known before the loops commence. But not all loops are like that. The end of the iteration space may not be known in advance, or the loop body may add more iterations before the loop exits. You can deal with both situations using the template class tbb::parallel_while.

An example is a linked list traversal:

void SerialApplyFooToList( Item*root ) {    for( Item* ptr=root; ptr!=NULL; ptr=ptr->next )         Foo(pointer->data);}

Unlike the templates described earlier, tbb::parallel_while is a class, not a function, and requires two user-defined objects. The first object defines the stream of items and the second object defines the body of the loop.

So my linked list example becomes:

void ParallelApplyFooToList( Item*root ) {    parallel_while w;    ItemStream stream;    ApplyFoo body;    w.run( stream, body ); }

The ItemStream object must define a method pop_if_present such that when pop_if_present(v) is invoked, it sets v to the next iteration value if there is one and returns true. If there are no more iterations, it returns false.

class ItemStream {    Item* my_ptr;public:    bool pop_if_present( Item*& item ) {        if( my_ptr ) {            item = my_ptr;            my_ptr = my_ptr->next;            return true;        } else {            return false;        }    };    ItemStream( Item* root ) : my_ptr(root) {}}

The loop body object is similar to a C++ function object from the C++ standard header . It must define a const operator() and a member type argument_type.

class ApplyFoo {public:    void operator()( Item* item ) const {        Foo(item->data);    }    typedef Item* argument_type;};

The template class parallel_while pops elements from the stream and assigns them as tasks for threads to operate on using the function object. While the sequential fetching of the work can limit concurrency, useful speedup can often be obtained when the tasks are of sufficient size or if the body of the loop adds more work to the stream.

pipeline
The last algorithm that this article will introduce is tbb::pipeline. Pipelining is a common parallel pattern that mimics a traditional manufacturing assembly line. Data flows though a series of pipeline stages, and each stage processes the data in some way. Given an incoming stream of data, some of these stages can operate in parallel and others cannot. For example, in video processing, some operations on frames do not depend on other frames, and so can be done on multiple frames at the same time. On the other hand, some operations on frames require processing prior frames first.

The Intel TBB classes pipeline and filter implement the pipeline pattern. A pipeline is represented by Intel TBB as a series of filters. Filters can be configured to execute concurrently on distinct data packets or to only process a single packet at a time.

A simple text processing problem can be used to demonstrate the usage of pipeline and filter. The problem is to read a text file, capitalize the first letter of each word, and write the modified text to a new file. Figure 3 is a picture of the pipeline.

Figure 3. A simple text processing application that capitalizes the first letter of each word can be expressed as a pipeline.

I shall assume that the file I/O is sequential. However, the capitalization stage can be done in parallel. That is, if I can serially read n chunks very quickly, I am free to capitalize each of the n chunks in parallel, as long as they are written in the proper order to the output file. To decide whether to capitalize a letter, I inspect whether the previous character is a blank. For the first letter in each chunk, I need to inspect the last letter of the previous chunk. But doing so would introduce a complicating dependence in the middle stage. The solution is to have each chunk also store the last character of the previous chunk. The chunks overlap by one character.

Using this approach, a pipeline could be created, with a serial input filter, a parallel capitalize filter and a serial output filter. As with all pipelines there are dependencies between the stages, but if the input and output stages do not dominate the overall computation time, useful concurrency can be achieved in the middle stage.

A Case Study: Finding Matching Substrings
This section presents a basic example that uses the parallel_for template in a substring matching program. For each position in a string, the program displays the length and location of the largest matching substring elsewhere in the string. Consider the string "babba" as an example. Starting at position 0, "ba" is the largest substring with a match elsewhere in the string (position 3).

The code below shows the serial code, where to_scan is the string to examine and the class Result stores the results of the matching:

00: Result RunSerialFinder(string to_scan) {01:   Result r(to_scan.size());02:  for ( size_t i = 0; i < to_scan.size(); ++i ) {03:    size_t max_size = 0, max_pos = 0;04:    for (size_t j = 0; j < to_scan.size(); ++j) {05:      if (j != i) {06:        size_t limit = to_scan.size()-( i > j ? i : j );07:        for (size_t k = 0; k < limit; ++k) {08:          if (to_scan[i + k] != to_scan[j + k]) break;09:          if (k > max_size) {10:            max_size = k;11:            max_pos = j;12:          }13:        }14:      }15:    }16:    r.max[i] = max_size;17:    r.pos[i] = max_pos;18:  }19:  return r;20: }

Before applying a parallel algorithm from Intel TBB, I need to determine if any of the loops can be executed concurrently.

I start with the i-loop. The only variables referenced in the i-loop body that are declared outside of the loop body are i, to_scan, and r. Only these variables have the potential to create dependencies across the iterations. Both i and to_scan are only read in the loop body, so are safe. The max and pos arrays in r are modified, but each iteration accesses distinct elements, so again there are no conflicts between iterations.

Because there are no dependencies to preclude concurrent execution of the i-loop, I can therefore parallelize this loop using tbb::parallel_for. While Intel TBB does require developers to identify independent tasks, tools are available to find any mistakes one might make. For example, Intel Thread Checker can determine if a parallel execution of a program violates any of its serial dependencies, pointing developers to the exact line and variable responsible for the violation.

The modified code for this example using Intel TBB to parallelize the i-loop is shown below:

01: class TBBFinderBody {02: const string str;03: size_t *max_array;04: size_t *pos_array;05: public:06: void operator() ( const tbb::blocked_range& r ) const {07:  for ( size_t i = r.begin(); i != r.end(); ++i ) {08:   size_t max_size = 0, max_pos = 0;09:   for (size_t j = 0; j < str.size(); ++j)10:    if (j != i) {11:     size_t limit = str.size()-( i > j ? i : j );12:     for (size_t k = 0; k < limit; ++k) {13:      if (str[i + k] != str[j + k]) break;14:      if (k > max_size) {15:       max_size = k;16:       max_pos = j;17:      }18:     }19:    }20:    max_array[i] = max_size;21:    pos_array[i] = max_pos;22:   }23:  }24:  TBBFinderBody(string &s, size_t *m, size_t *p) :25:   str(s), max_array(m), pos_array(p) { }26: };27: Result RunTBBFinder(string to_scan) {28:   Result r(to_scan.size());29:   tbb::parallel_for(tbb::blocked_range(0, to_scan.size(),   30:      100), TBBFinderBody( to_scan, r.max, r.pos ) );31:   return r;32: }

In the above example, the code shown in blue represents new code that must be added to enable use of the template function tbb::parallel_for. The remainder of the code is unchanged from the original sequential implementation.

At line 29, a tbb::parallel_for is passed a blocked_range object representing the original iteration space with a grainsize of 100 and a TBBFinderBody object, which implements the loop nest. Look at the method opertaor()at line 06; it is clear that much of the original code is used without change. Only the bounds of the i-loop are modified, to now reference the bounds provided by the range object.

To evaluate the effectiveness of this implementation, I compared the performance of the Intel TBB implementation to the serial implementation on a server with two dual-core Intel Xeon Processor 5100 processors. I also ran an implementation that uses POSIX threads directly for the parallelization (see Figure 4). The complete code for each version is provided in Listing 1.

Figure 4. Intel Threading Building Blocks provides good scalability without burdening developers with the details of thread creation, scheduling and management required by a POSIX thread implementation.

Both the user threaded POSIX threads version and the tbb::parallel_for version of the application show good scalability, and yield a speedup of around 4.5 times when four processors are used. The superlinear speedups seen by both versions can be attributed to the larger aggregate cache size available when using multiple processors.

With Intel TBB, the application has been parallelized without the need to deal with the low-level details of threading. Instead, scheduling, load balancing and cache locality are left for the library to manage. And even with this easier transition to parallelism, the resulting application performance is comparable to that of hand-threaded code.

Making the Transition to Multithreaded
Multithreaded applications are notoriously difficult to write, test, and debug. However, to take full advantage of the added performance potential of multicore desktop and laptop systems, developers now face the challenging task of threading their applications. Leveraging the generic parallel algorithms provided by the Intel Threading Building Blocks library, which are tested and tuned for current and future multicore processors, can dramatically ease the burden of this transition.

In addition, Intel Threading Building Blocks is compatible with Intel Thread Checker and Intel Thread Profiler. These tools provide correctness and performance analysis for programs that use Intel TBB, the OpenMP API and native threading libraries such as WinThreads or POSIX Threads. Using these tools, developers can quickly identify correctness or performance issues in their threaded code.

For more information on using or obtaining Intel Threading Building Blocks, visit the product web page at http://www.intel.com/software/products/tbb . For more information on the other Intel Threading Analysis Tools, Intel Thread Checker and Intel Thread Profiler, visit http://www.intel.com/software/products/threading.

devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist