o exploit the performance potential of multicore processors, applications must be threaded for performance. While performance-oriented threading is new for many developers of desktop and laptop systems, there is a long history of threading in the scientific and high-performance computing domains that can be leveraged on the desktopat least theoretically. Unfortunately, the languages and libraries developed in those domains do not match the unique demands of mainstream client and server applications, and therefore are difficult to apply directly.
This article introduces the generic parallel algorithms provided by Intel Threading Building Blocks (Intel TBB). This C++ template library includes several common algorithms such as parallel_for, parallel_reduce, and pipeline, which are known through experience to be useful in a wide range of application domains. In Intel TBB, these algorithms have been specifically tuned to provide scalable performance for applications targeting current and future generation multicore and multithreaded systems.
Functional versus Data Decomposition
There are two main methods for decomposing a sequential program into a parallel program: (1) functional decomposition and (2) data decomposition. With functional decomposition, independent tasks that are doing different types of work are identified. These functionally distinct tasks are then executed concurrently. With data decomposition, a single task performed on a large amount of data is split into independent tasks, each task processing a subset of the data.
Of course, in both cases the developer must ensure that any tasks executed concurrently do not have data dependencies between them. A data dependency occurs when two instructions access the same memory location. Some dependencies, known as false dependencies, arise as an artifact of reusing variables for different purposes, and can often be broken if the code is rewritten. On the other hand, true dependencies are a normal part of the communication between instructions in a program and constrain their legal ordering. When creating a parallel program from a sequential program, it is important that any true data dependencies in the sequential ordering of instructions are preserved in the parallel program.
Returning to the discussion of program decomposition, Figures 1 and 2 show examples of functional and data decomposition. While both examples provide adequate concurrency for a dual-core system, the example of functional decomposition only identifies two pieces of independent work, methods func1 and func2. If this code is later ported to a quad-core system, no additional improvement can be expected for this region, and so to further improve performance the developer would need to look for additional concurrency elsewhere.
Figure 1. Functional decomposition provides a small number of independent tasks and therefore only limited scalability.
Figure 2. Data decomposition provides a large pool of independent tasks and therefore scales as more cores become available.
On the other hand, the data decomposition approach is highly scalable. A loop with independent iterations provides a large pool of tasks. If the data decomposition example code were ported to a quad-core system, the loop iterations could be distributed across four processors instead of two. It is generally the case that data decomposition allows for better scaling as processors are added.
Because Intel TBB is designed to support scalable parallelism for current and future multi-core architectures, its generic parallel algorithms emphasize data parallel approaches.
Mapping Tasks to Threads and Processors
One of the main difficulties in developing multithreaded applications is managing the many low-level details associated with creating and managing threads and mapping them efficiently to the target platform. With Intel TBB, developers specify tasks instead of threads, and the library itself manages and distributes these tasks across threads and processors.
TBB's library uses an internal task scheduler to assign tasks to threads. The task scheduler in Intel TBB is designed to:
- Automatically balance the load across processors
- Schedule tasks to exploit the natural cache locality of applications
- Avoid the over-subscription of resources that often comes when composing applications from threaded components.
The details of the task scheduler are beyond the scope of this article, but interested readers can learn more by visiting the product website at http://www.intel.com/software/products/tbb.
In the remainder of this article, I introduce the Intel TBB library, explain how to initialize the task scheduler, and then provide brief overviews of several generic parallel algorithms. I then walk through a case study that demonstrates how the parallel_for algorithm can be deployed to add concurrency to an existing application.
| Editor's Note: Michael Voss is a Senior Staff Software Engineer at Intel Corporation, which is the owner-developer of the TBB technology discussed herein. This article has been selected for publication because we believe it to have objective technical merit. No specific endorsement of Intel technologies by the editors of DevX is implied.