Making the Most of Mutex Locks
Mutexes are often vilified as major performance snags in multi-threaded, concurrent application development; however, our experience suggests that mutex locks are the least evil among synchronization methods available today. Even though the nominal overhead appears large, you can harness them to your advantage if you use them in well-disciplined ways. Throughout this article, you'll see some of the lessons learned, stated as guidelines, the first two of which are:
Guideline 1: Being Frugal Always Pays Off
Minimize explicit uses of locks. Instead, use concurrent containers and concurrent algorithms provided by efficient, thread-safe concurrency libraries. If you still find places in your application that you think benefit from explicit use of locks, then:
Guideline 2: Make Critical Sections as Small as Possible
When a thread arrives at a critical section and finds that another thread is already in it, it must wait. Keep the critical section small, and you will get small waiting times for threads and better overall performance. Examine when a shared data in a critical section is made private and see if you can safely take some of the accesses to the data out of the critical section.
For example, the code in
Listing 1 implements a concurrent stack. It defines two methods
push() and
pop(), each protected using a TBB mutex lock (
smtx) that's acquired in the constructor and released in the destructor. The examples in
Listing 1 rely on the C++ scoping rules to delimit the critical sections.
A cursory look at
pop() shows that:
- If the stack is empty pop() returns false.
- If the stack is not empty, the code acquires the mutex lock and then re-examines the stack.
- If the stack has become empty since the previous test, pop() returns false.
- Otherwise, the code updates the top variable, and copies the old top element.
- Finally, pop() releases the lock, reclaims the popped node, and returns true.
Here's a closer look at the critical section. Copying type T may take a lot of time, depending on T. Because of the lock, you know that, once updated, the old
top value cannot be viewed by other threads; it becomes private and local to the thread inside the critical section. Therefore, you can safely yank the copy statement out of the critical section (following guideline 2) as follows.
bool pop( T& _e ) {
node* tp = NULL;
if( !top ) goto done;
{
tbb::spin_mutex::scoped_lock lock( smtx );
if( !top ) goto done;
tp = top;
top = top->nxt;
// move the next line...
// _e = tp->elt;
}
// ...to here
_e = tp->elt;
delete tp;
done:
return tp!=NULL;
}
As another example, consider implementing a tiny memory management routine. A thread allocates objects from its private blocks and returns objects to their parent block. It is possible for a thread to free objects allocated by another thread. Such objects are added to their parent block's public free list. In addition, a block with a non-empty public free list is added to a list (i.e., public block list) formed with
block_t::next_to_internalize and accessed through
block_bin_t::mailbox, if not already in.
The owner thread privatizes objects in a block's public free list, as needed. Function
internalize_next() implements the functionality and is invoked when a thread runs out of private blocks with free objects to allocate. It takes a block bin private to the caller thread as its argument and pops the front block from the list
bin->mailbox, if not empty. Then, it internalizes objects in the block's public free list:
block_t* internalize_next ( block_bin_t* bin )
{
block_t* block;
{
tbb::spin_mutex::scoped_lock scoped_cs(bin->mailbox_lock);
block = bin->mailbox;
if( block ) {
bin->mailbox = block->next_to_internalize;
block->next_to_internalize = NULL;
}
}
if( block )
internalize_returned_objects( block );
return block;
}
The function's critical section protects access to
bin->mailbox with
bin->mailbox_lock. Inside the critical section, if
bin->mailbox is not empty it pops the front block to block and resets the block's
next_to_internalize.
Note that
block is a local variable. By the time
bin->mailbox is updated,
block (which points to the old front block) becomes invisible to other threads and access to its
next_to_internalize field becomes race-free. Thus, you can safely move the reset statement outside the critical section:
block_t* internalize_next ( block_bin_t* bin )
{
block_t* block;
{
tbb::spin_mutex::scoped_lock scoped_cs(bin->mailbox_lock);
block = bin->mailbox;
if( block ) {
bin->mailbox = block->next_to_internalize;
// move the next statement...
// block->next_to_internalize = NULL;
}
if( block ) {
// ...to here
block->next_to_internalize = NULL;
internalize_returned_objects( block );
}
return block;
}
Guideline 3: Synchronize as Infrequently as Possible
The idea behind this guideline is that you can amortize the cost of a lock operation over a number of local operations. Doing so reduces the overall execution time because executing atomic instructions tends to consume an order of magnitude more cycles.
Again, suppose you're designing a memory allocator that allocates objects out of a block. To reduce the number of trips to the operating system to get more memory blocks, the allocator uses a function called
allocate_blocks() to get a big strip from the operating system, partition it into a number of blocks, and then put them in the global free block list shared among threads. The free block list
free_list is implemented as a concurrent stack (see
Listing 2).
Note that the code to push a newly carved-out block into
free_list is inside a
while loop. Also, note that
stack2::push() protects concurrent accesses to
stack2::top through a mutex lock. That means
allocate_blocks() acquires the lock
free_list.mtx N times for a strip containing N blocks.
You can reduce that frequency to one per strip by adding a few thread-local instructions. The idea is to build a thread-local list of blocks in the
while loop first (using two pointer variables head and tail) and then push the entire list into
free_list with a single lock acquisition (see
Listing 3). Finally, so that
allocate_blocks() can access
free_list's private fields, it's declared as a friend of
stack2.
Guideline 4: Most of All, Know Your Application.
The guideline that will help you most in practice is to analyze and understand your application using actual use scenarios and representative input sets. Then you can determine what kinds of locks are best used where. Performance analysis tools such as
Intel® Parallel Amplifier can help you identify where the hot spots are and fine-tune your application accordingly.