Browse DevX
Sign up for e-mail newsletters from DevX


Lots about Locks : Page 2

Despite claims by competing technologies, locks remain the best choice for implementing synchronization and protecting critical sections of your code.




Building the Right Environment to Support AI, Machine Learning and Deep Learning

Making the Most of Mutex Locks

Mutexes are often vilified as major performance snags in multi-threaded, concurrent application development; however, our experience suggests that mutex locks are the least evil among synchronization methods available today. Even though the nominal overhead appears large, you can harness them to your advantage if you use them in well-disciplined ways. Throughout this article, you'll see some of the lessons learned, stated as guidelines, the first two of which are:

Guideline 1: Being Frugal Always Pays Off

Minimize explicit uses of locks. Instead, use concurrent containers and concurrent algorithms provided by efficient, thread-safe concurrency libraries. If you still find places in your application that you think benefit from explicit use of locks, then:

Guideline 2: Make Critical Sections as Small as Possible

When a thread arrives at a critical section and finds that another thread is already in it, it must wait. Keep the critical section small, and you will get small waiting times for threads and better overall performance. Examine when a shared data in a critical section is made private and see if you can safely take some of the accesses to the data out of the critical section.

For example, the code in Listing 1 implements a concurrent stack. It defines two methods push() and pop(), each protected using a TBB mutex lock (smtx) that's acquired in the constructor and released in the destructor. The examples in Listing 1 rely on the C++ scoping rules to delimit the critical sections.

A cursory look at pop() shows that:

  1. If the stack is empty pop() returns false.
  2. If the stack is not empty, the code acquires the mutex lock and then re-examines the stack.
  3. If the stack has become empty since the previous test, pop() returns false.
  4. Otherwise, the code updates the top variable, and copies the old top element.
  5. Finally, pop() releases the lock, reclaims the popped node, and returns true.
Here's a closer look at the critical section. Copying type T may take a lot of time, depending on T. Because of the lock, you know that, once updated, the old top value cannot be viewed by other threads; it becomes private and local to the thread inside the critical section. Therefore, you can safely yank the copy statement out of the critical section (following guideline 2) as follows.

bool pop( T& _e ) { node* tp = NULL; if( !top ) goto done; { tbb::spin_mutex::scoped_lock lock( smtx ); if( !top ) goto done; tp = top; top = top->nxt; // move the next line... // _e = tp->elt; } // ...to here _e = tp->elt; delete tp; done: return tp!=NULL; }

As another example, consider implementing a tiny memory management routine. A thread allocates objects from its private blocks and returns objects to their parent block. It is possible for a thread to free objects allocated by another thread. Such objects are added to their parent block's public free list. In addition, a block with a non-empty public free list is added to a list (i.e., public block list) formed with block_t::next_to_internalize and accessed through block_bin_t::mailbox, if not already in.

The owner thread privatizes objects in a block's public free list, as needed. Function internalize_next() implements the functionality and is invoked when a thread runs out of private blocks with free objects to allocate. It takes a block bin private to the caller thread as its argument and pops the front block from the list bin->mailbox, if not empty. Then, it internalizes objects in the block's public free list:

block_t* internalize_next ( block_bin_t* bin ) { block_t* block; { tbb::spin_mutex::scoped_lock scoped_cs(bin->mailbox_lock); block = bin->mailbox; if( block ) { bin->mailbox = block->next_to_internalize; block->next_to_internalize = NULL; } } if( block ) internalize_returned_objects( block ); return block; }

The function's critical section protects access to bin->mailbox with bin->mailbox_lock. Inside the critical section, if bin->mailbox is not empty it pops the front block to block and resets the block's next_to_internalize.

Note that block is a local variable. By the time bin->mailbox is updated, block (which points to the old front block) becomes invisible to other threads and access to its next_to_internalize field becomes race-free. Thus, you can safely move the reset statement outside the critical section:

block_t* internalize_next ( block_bin_t* bin ) { block_t* block; { tbb::spin_mutex::scoped_lock scoped_cs(bin->mailbox_lock); block = bin->mailbox; if( block ) { bin->mailbox = block->next_to_internalize; // move the next statement... // block->next_to_internalize = NULL; } if( block ) { // ...to here block->next_to_internalize = NULL; internalize_returned_objects( block ); } return block; }

Guideline 3: Synchronize as Infrequently as Possible

The idea behind this guideline is that you can amortize the cost of a lock operation over a number of local operations. Doing so reduces the overall execution time because executing atomic instructions tends to consume an order of magnitude more cycles.

Again, suppose you're designing a memory allocator that allocates objects out of a block. To reduce the number of trips to the operating system to get more memory blocks, the allocator uses a function called allocate_blocks() to get a big strip from the operating system, partition it into a number of blocks, and then put them in the global free block list shared among threads. The free block list free_list is implemented as a concurrent stack (see Listing 2).

Note that the code to push a newly carved-out block into free_list is inside a while loop. Also, note that stack2::push() protects concurrent accesses to stack2::top through a mutex lock. That means allocate_blocks() acquires the lock free_list.mtx N times for a strip containing N blocks.

You can reduce that frequency to one per strip by adding a few thread-local instructions. The idea is to build a thread-local list of blocks in the while loop first (using two pointer variables head and tail) and then push the entire list into free_list with a single lock acquisition (see Listing 3). Finally, so that allocate_blocks() can access free_list's private fields, it's declared as a friend of stack2.

Guideline 4: Most of All, Know Your Application.

The guideline that will help you most in practice is to analyze and understand your application using actual use scenarios and representative input sets. Then you can determine what kinds of locks are best used where. Performance analysis tools such as Intel® Parallel Amplifier can help you identify where the hot spots are and fine-tune your application accordingly.

Thanks for your registration, follow us on our social networks to keep up-to-date