Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Enable Safe, Scalable Parallelism with Intel Threading Building Block's Concurrent Containers : Page 3

If you're among the many developers who are tackling the multicore future by writing multithreaded applications today, you'll soon learn that the container classes provided by the C++ STL are not thread friendly. Intel provides a C++ template library with thread-safe concurrent containers. Get a walk-through of what you can expect.

How to Find and Replace Uses of Unsafe Containers
As mentioned in the previous sections, the TBB containers can enable concurrency where other libraries such the STL are either unsafe or must be wrapped in coarse-grain synchronization to ensure safety. However, it must be stressed that these concurrent containers have more overhead than their thread-unsafe counterparts. So it is important to use these thread-safe variants only when required to ensure thread safety or to enable increased concurrency.

So the first question then is, "How do we find thread-unsafe or concurrency-limiting uses of container classes?" For the remainder of this article, I will use a string counting example to demonstrate how you can use the Intel Threading Tools, Intel Thread Checker, and Intel Thread Profiler (see http://www.intel.com/software/products/threading ), along with Intel Threading Building Blocks to find and replace these troublesome container classes.

A string_count Example
The code below uses an STL map<string, int> table to count the occurrences of distinct strings in an array Data. The code is parallelized by creating Win32 Threads (NThread of them) and having each thread process a chunk of the array Data.

The method CountOccurrences creates the threads and passes to each thread the range of elements in the array that it should process. The method tally performs the actual calculation, in which each thread iterates through its assigned portion of Data and increments the corresponding element in the STL map table. Line 30 of the example prints the total time used to tally the occurrences as well as the total number of strings that were inspected (this total should always equal the data set size of N). The complete code for all versions of the example described in this section is shown in Listing 1.

00: const size_t N = 1000000; 01: static string Data[N]; 02: typedef map<string,int> StringTable; 03: StringTable table; 04: DWORD WINAPI tally ( LPVOID arg ) { 05: pair<int, int> *range = 06: reinterpret_cast< pair<int,int> * >(arg); 07: for( int i=range->first; i < range->second; ++i ) 08: table[Data[i]] += 1; 09: return 0; 10: } 11: static void CountOccurrences() { 12: HANDLE worker[8]; 13: pair<int,int> range[8]; 14: int g = static_cast<int>(ceil(static_cast<double>(N) / NThread)); 15: tick_count t0 = tick_count::now(); 16: for (int i = 0; i < NThread; i++) { 17: int start = i * g; 18: range[i].first = start; 29: range[i].second = (start + g) < N ? start + g : N; 20: worker[i] = 21: CreateThread( NULL, 0, tally, &range[i], 0, NULL ); 22: } 23: WaitForMultipleObjects ( NThread, worker, TRUE, INFINITE ); 24: tick_count t1 = tick_count::now(); 25: int n = 0; 26: for( StringTable::iterator i=table.begin(); 27: i!=table.end(); ++i ) 28: n+=i->second; 39: 30: printf("total=%d time = %g\n",n,(t1-t0).seconds()); 31: }

Finding the Unsafe Container Uses
Figure 1. The Intel Thread Checker tool finds dependencies, making it easy to eliminate potential conflicts.
Compile this example with Microsoft Visual Studio .NET 2005 and execute it with four threads on a four-processor Intel Xeon server, and you'll immediately see that something is wrong. The total reported by line 31 changes during each execution and is never equal to N (1000000). Perhaps there is a race condition or other threading error.

To find out, I ran this example through a tool that detects potential data races, deadlocks, and other threading errors in user code. This tool, Intel Thread Checker, was able to find a number of potential dependencies between threads and map those dependencies back to the source lines responsible for the errors. All of the potential dependencies point to the same place, line 08, where the threads access the STL map (see Figure 1).

Finding a Scalable Solution
Figure 1 makes clear that I need to synchronize the accesses to the map. I'll first examine the performance of using a single Win32 Mutex or CRITICAL_SECTION object to guard the accesses. As described earlier, coarse-grain synchronization is the only way to ensure thread-safety with a thread-unsafe library.

A Win32 Mutex is a kernel object that is visible across processes. While guarding each access to the STL map with a single Mutex object will ensure thread safety, doing so will also incur a very high overhead. On the other hand, a Win32 CRITICAL_SECTION is a lightweight, user-space, intra-process mutex object and therefore a better option.

The speedups of the synchronized versions using STL map relative to a sequential implementation of the code is shown in Figure 2.

Figure 2. None of the three options shown provides an acceptable performance result.
Figure 3. Intel TBB concurrent_hash_map provides improves the performance of the STL map.

Without synchronization, the STL map provides good performance but is corrupted by the concurrent accesses, which leads to incorrect results. As expected, the high cost of a Win32 Mutex leads to dismal performance. And the coarse-grain synchronization of a single CRITICAL_SECTION object also results in degraded performance. Both thread-safe STL map-based implementations take longer to execute than the original sequential implementation. While the added coarse-grain synchronizations restore correctness, the lack of concurrency inhibits scaling.

Figure 3 shows the performance when the uses of STL map are replaced with uses of the TBB's concurrent_hash_map.

Comparing the speed-up enabled by the concurrent_hash_map to that provided by the STL map without synchronization, it is clear that the thread-safety provided by Intel TBB does not come for free. But unlike the other thread-safe versions, it does provide a thread-safe implementation with enough concurrency to yield a speed-up greater than 1.

Therefore, the concurrency provided by TBB concurrent container enables faster performance where the coarse-grain locking of unsafe containers does not.

As an increasing number of developers face an imperative need to move to multicore systems, they will surely need to research the tooling available to make multithreaded application development less error-prone. The concurrent container classes included in TBB are one such asset; they can help ease a transition to multicore by providing safe, scalable implementations that allow you to avoid using thread unfriendly containers, such as those defined by the C++ STL, and all the problems that come with them.

Michael Voss is a Senior Staff Software Engineer at Intel Corp. Voss received his Ph.D. in Electrical Engineering from Purdue University in 2001. He is currently a Senior Staff Software Engineer in the Performance Analysis and Threading Lab at Intel and an adjunct Professor in the Edward S. Rogers Sr. Department of Electrical and Computer Engineering at the University of Toronto. His interests include parallel programming and optimizing compilers.
Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.