Enable Safe, Scalable Parallelism with Intel Threading Building Block’s Concurrent Containers

ultithreaded applications are notoriously difficult to write, test, and debug. However, to take full advantage of the added performance potential of multicore desktop and laptop systems, developers now face the challenging task of threading their applications. While there is no panacea for the difficulties that arise in multithreaded application development, leveraging existing libraries and tools can dramatically ease the burden of this transition.

In this article, I’ll examine one of the leading sources of correctness and performance issues encountered when threading C++ applications: the use of thread-unsafe container classes. After providing some examples of why this issue arises, I’ll describe the concurrent container classes provided by the Intel Threading Building Blocks (Intel TBB) library, a C++ template library specifically designed to aid in developing multithreaded applications. The concurrent container classes in TBB can be leveraged to safely add scalable parallelism to applications.

Are Your Containers Thread-Safe?
Many developers rely on hand-written container classes or those provided by implementations of the C++ Standard Template Library (STL). Unfortunately, these libraries are often not thread-safe. In particular, the STL specification makes no mention of threads or the behavior required of container classes when used in multithreaded code. It is therefore commonly the case that the implementations of these STL container classes are not thread safe.

For example, consider the use of an STL map values:

http://assets.devx.com/articlefigs/17803.gif
Editor’s Note: Michael Voss is a Senior Staff Software Engineer at Intel Corporation, which is the owner-developer of the TBB technology discussed herein. This article has been selected for publication because we believe it to have objective technical merit. No specific endorsement of Intel technologies by the editors of DevX is implied.

Even though two distinct values associated with two distinct keys are being modified in the above code, most STL implementations provide no guarantee of correct behavior. Performing these operations concurrently without synchronization may corrupt the map. With no requirements specified for thread-safety, it?s even possible that accessing two distinct maps may lead to data corruption.

Of course, it’s possible to implement the STL template class map in such a way to make the above code thread safe. Unfortunately, some common map operation sequences cannot be implemented in a thread friendly way. While each operation alone may be made thread safe, sequences commonly used in serial code can lead to unexpected results. For example, what if two threads operate on the same element in the map using the code below:

http://assets.devx.com/articlefigs/17804.gif

The code executed by Thread 0 performs two operations. First it invokes operator [] to retrieve a reference to the object associated with “Key1”. If this key is not in the map, operator [] allocates space to hold an object of type MyClass to associate with this key. Next operator = is invoked to copy the temporary instance of MyClass to the object pointed to by the retrieved reference.

The desired outcome is that either “Key1” does not appear in the map, or it is paired with an instance of MyClass(). But without user-inserted synchronization, other outcomes are possible, even when each operator itself is thread-safe. The method erase invoked by Thread 1 might occur between the call to operator [] and the call to operator = by Thread 0. In that case, Thread 0 will attempt to invoke operator = on a deleted object, resulting in incorrect behavior. This common type of multithreading bug is known as a race condition; the behavior (unintentionally) depends on which thread performs its operation first.

One particularly difficult aspect of a race, as shown by this example, is that the behavior is non-deterministic. During every run of this code during testing, it’s possible that the call to erase from Thread 1 never falls between the fetch and update on Thread 0. Such a bug can therefore evade testing, and lie dormant in your validated and shipped code, potentially failing at any time on a customer’s system.

To avoid these bugs and to ensure correctness when using thread-unfriendly container classes, developers are relegated to wrapping locks around all uses of each container, allowing only a single thread to access the container at a time. This coarse-grain approach to synchronization limits the concurrency available in the application and adds code to each access point, increasing complexity. However, it’s a price that must be paid to make use of these existing libraries.

An Alternative: Intel TBB Container Classes
There is an alternative to using thread-unsafe containers wrapped by coarse-grain synchronization. The Intel TBB library is a runtime-based parallel programming model for C++ that uses threads. In addition to the scalable parallel algorithms provided by the library, Intel TBB provides safe, scalable implementations of map, queue, and vector container classes. These templates can be used directly in user-threaded code or in conjunction with the parallel algorithms included with the library.

As discussed above, the standard interfaces for some container classes are inherently thread-unfriendly. Therefore the Intel TBB concurrent containers cannot be simple drop-in replacements for their STL counterparts. Instead, Intel TBB follows the spirit of STL but provides modified interfaces where necessary to ensure thread-safety.

All of the concurrent containers in the Intel TBB library are implemented using fine-grain locking. Whenever a method is invoked on a container, only the portion of the data structure touched by the method is locked, allowing multiple threads to concurrently access other parts of the structure.

In the next few sections, I’ll introduce the template classes concurrent_hash_map, concurrent_queue, and concurrent_vector, and then provide a case study that demonstrates how to detect and replace uses of unsafe containers.

Class: concurrent_hash_map< Key, T, HashCompare >
Intel TBB’s template class concurrent_hash_map is similar to the associative containers of STL, but permits concurrent accesses to its elements. It’s a hash table that maps from keys of type Key to values of type T. The traits type HashCompare defines the hash function to use in mapping, as well as a function that evaluates the equality of two keys. The functions travel together because they must agree that any two equal keys hash to the same hash code.

An example of a traits class is given below for keys of type string:

struct my_hash_compare {    static size_t hash( const string& x ) {        size_t h = 0;        for( const char* s = x.c_str(); *s; ++s )            h = (h*17)^*s;        return h;    }    //! True if strings are equal    static bool equal( const string& x, const string& y ) {        return x==y;    };

A concurrent_hash_map acts as a container of elements of type std::pair. Typically, when accessing a container element, you are interested in either updating it or reading it. Template class concurrent_hash_map supports these two purposes respectively with the classes accessor and const_accessor. These act as smart pointers and enable the atomic access to elements that was lacking in my previous STL map example.

An accessor represents update (write) access. As long as it points to an element, all other attempts to look up that key in the table will block until the accessor is done. A const_accessor is similar, except that is represents read-only access. Multiple const_accessors are allowed to point to the same element at the same time, which can greatly improve concurrency in situations where elements are frequently read and infrequently updated.

You primarily access concurrent_hash_map elements through its methods insert, find, and erase. Methods find and insert take an accessor or const_accessor as an argument. Which one you choose tells concurrent_hash_map whether you are asking for update or read-only access. Once the method returns, the access lasts until the accessor or const_acessor is destroyed. The method remove implicitly requests write access; hence before removing the key, it will wait for any other extant accesses to finish.

The code below shows an example of inserting a new element into a concurrent_hash_map:

concurrent_hash_map string_table;void insert_into_string_table ( string &key, MyClass &m ) {        // create an accessor that will act as a smart pointer     // for write access    concurrent_hash_map::accessor a;    // call insert to create a new element, or return an existing     // element if one exists.    // accessor a locks this element for exclusive use by this thread    string_table.insert( a, key );    // modify the value held by the pair    a->second = m;    // the accessor "a" releases the lock on the element when it is     // destroyed at the end of the scope}

concurrent_queue< Key, T, HashCompare >
Intel TBB’s template class concurrent_queue implements a concurrent queue with values of type T. Multiple threads may simultaneously push and pop elements from the queue. By default, the queue is unbounded, but can be bounded by setting the maximum capacity.

In general, queues are first-in first-out data structures, and in a single-threaded program a queue can be designed to support this strict ordering. However, if multiple threads are pushing and popping concurrently, the definition of “first” is not so well defined. The Intel TBB template class concurrent_queue instead guarantees that if a thread pushes two values, and another thread pops those two values, they will be popped in the same order that they were pushed. The interleaving of values pushed by different threads is not restricted.

A concurrent queue is often used in producer-consumer applications, where data produced by one thread is consumed by another. To flexibly support these types of applications, Intel TBB offers both a blocking and non-blocking version of pop. Method pop_if_present is non-blocking. It attempts to pop a value, but if the queue is empty, it immediately returns. Method pop on the other hand, blocks until an item is available and then pops that item from the queue.

The example below uses the blocking method pop for communication between two threads using a concurrent_queue. Thread 1 prints each value pushed into the queue by Thread 0 as it becomes available:

http://assets.devx.com/articlefigs/17805.gif

Unlike most STL containers, concurrent_queue::size_type is a signed integral type, not unsigned. This is because concurrent_queue::size() is defined as the number of push operations started minus the number of pop operations started. If pops outnumber pushes, size() becomes negative. For instance, if a concurrent_queue is empty, and there are n pending pop operations, size() returns ?n. This provides an easy way for producers to know how many consumers are waiting on the queue.

concurrent_vector< Key, T, HashCompare >
The final concurrent container defined by TBB is the template class concurrent_vector. A concurrent_vector is a dynamically growable array that allows access to elements in the vector while it is concurrently being grown.

Class concurrent_vector defines two methods for safe concurrent growth: grow_by and grow_to_at_least. Method grow_by(n) lets you safely append n consecutive elements to a vector and returns the index of the first appended element. Each element is initialized with T(). Method grow_to_at_least(n)grows a vector to size n if it is currently smaller than n elements.

The following routine safely appends a C string to a shared vector:

void Append ( concurrent_vector& vector, const char* string) {    size_t n = strlen(string) + 1;    memcpy( &vector[vector_grow_by(n)], string, n+1 );}

Method size() returns the number of elements in the vector, which may include elements that are still undergoing concurrent construction by methods grow_by and grow_to_at_least. Consequently, it is safe to use the iterators while the concurrent_vector is being grown, as long as the iterators never go past the current value of end(). However, the iterator may reference an element undergoing concurrent construction. When using an Intel TBB concurrent_vector, synchronizing construction and access is the responsibility of the developer.

How to Find and Replace Uses of Unsafe Containers
As mentioned in the previous sections, the TBB containers can enable concurrency where other libraries such the STL are either unsafe or must be wrapped in coarse-grain synchronization to ensure safety. However, it must be stressed that these concurrent containers have more overhead than their thread-unsafe counterparts. So it is important to use these thread-safe variants only when required to ensure thread safety or to enable increased concurrency.

So the first question then is, “How do we find thread-unsafe or concurrency-limiting uses of container classes?” For the remainder of this article, I will use a string counting example to demonstrate how you can use the Intel Threading Tools, Intel Thread Checker, and Intel Thread Profiler (see http://www.intel.com/software/products/threading ), along with Intel Threading Building Blocks to find and replace these troublesome container classes.

A string_count Example
The code below uses an STL map table to count the occurrences of distinct strings in an array Data. The code is parallelized by creating Win32 Threads (NThread of them) and having each thread process a chunk of the array Data.

The method CountOccurrences creates the threads and passes to each thread the range of elements in the array that it should process. The method tally performs the actual calculation, in which each thread iterates through its assigned portion of Data and increments the corresponding element in the STL map table. Line 30 of the example prints the total time used to tally the occurrences as well as the total number of strings that were inspected (this total should always equal the data set size of N). The complete code for all versions of the example described in this section is shown in Listing 1.

00: const size_t N = 1000000;01: static string Data[N];02: typedef map StringTable;03: StringTable table;04: DWORD WINAPI tally ( LPVOID arg ) {05:    pair *range = 06:      reinterpret_cast< pair * >(arg);07:    for( int i=range->first; i < range->second; ++i )08:        table[Data[i]] += 1;09:    return 0;10: }11: static void CountOccurrences() {12:  HANDLE worker[8];13:  pair range[8];14:  int g = static_cast(ceil(static_cast(N) / NThread));15:  tick_count t0 = tick_count::now();    16:  for (int i = 0; i < NThread; i++) {17:    int start = i * g;18:    range[i].first = start;29:    range[i].second = (start + g) < N ? start + g : N;20:    worker[i] = 21:      CreateThread( NULL, 0, tally, &range[i], 0, NULL ); 22:  }23:  WaitForMultipleObjects ( NThread, worker, TRUE, INFINITE );24:  tick_count t1 = tick_count::now(); 25:  int n = 0;26:  for( StringTable::iterator i=table.begin(); 27:          i!=table.end(); ++i )28:    n+=i->second;39:  30:  printf("total=%d time = %g
",n,(t1-t0).seconds());31: }

Finding the Unsafe Container Uses

Figure 1. The Intel Thread Checker tool finds dependencies, making it easy to eliminate potential conflicts.

Compile this example with Microsoft Visual Studio .NET 2005 and execute it with four threads on a four-processor Intel Xeon server, and you’ll immediately see that something is wrong. The total reported by line 31 changes during each execution and is never equal to N (1000000). Perhaps there is a race condition or other threading error.

To find out, I ran this example through a tool that detects potential data races, deadlocks, and other threading errors in user code. This tool, Intel Thread Checker, was able to find a number of potential dependencies between threads and map those dependencies back to the source lines responsible for the errors. All of the potential dependencies point to the same place, line 08, where the threads access the STL map (see Figure 1).

Finding a Scalable Solution
Figure 1 makes clear that I need to synchronize the accesses to the map. I’ll first examine the performance of using a single Win32 Mutex or CRITICAL_SECTION object to guard the accesses. As described earlier, coarse-grain synchronization is the only way to ensure thread-safety with a thread-unsafe library.

A Win32 Mutex is a kernel object that is visible across processes. While guarding each access to the STL map with a single Mutex object will ensure thread safety, doing so will also incur a very high overhead. On the other hand, a Win32 CRITICAL_SECTION is a lightweight, user-space, intra-process mutex object and therefore a better option.

The speedups of the synchronized versions using STL map relative to a sequential implementation of the code is shown in Figure 2.


Figure 2. None of the three options shown provides an acceptable performance result.
?
Figure 3. Intel TBB concurrent_hash_map provides improves the performance of the STL map.

Without synchronization, the STL map provides good performance but is corrupted by the concurrent accesses, which leads to incorrect results. As expected, the high cost of a Win32 Mutex leads to dismal performance. And the coarse-grain synchronization of a single CRITICAL_SECTION object also results in degraded performance. Both thread-safe STL map-based implementations take longer to execute than the original sequential implementation. While the added coarse-grain synchronizations restore correctness, the lack of concurrency inhibits scaling.

Figure 3 shows the performance when the uses of STL map are replaced with uses of the TBB’s concurrent_hash_map.

Comparing the speed-up enabled by the concurrent_hash_map to that provided by the STL map without synchronization, it is clear that the thread-safety provided by Intel TBB does not come for free. But unlike the other thread-safe versions, it does provide a thread-safe implementation with enough concurrency to yield a speed-up greater than 1.

Therefore, the concurrency provided by TBB concurrent container enables faster performance where the coarse-grain locking of unsafe containers does not.

As an increasing number of developers face an imperative need to move to multicore systems, they will surely need to research the tooling available to make multithreaded application development less error-prone. The concurrent container classes included in TBB are one such asset; they can help ease a transition to multicore by providing safe, scalable implementations that allow you to avoid using thread unfriendly containers, such as those defined by the C++ STL, and all the problems that come with them.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Related Posts