Lots about Locks

emember the red telephone box, once a familiar sight on the streets of London? That’s a good example of mutually exclusive access to a shared resource, although you probably didn’t find any locks on them. Why? Because only one person at a time could use one to make a call, and civil persons would not listen to a stranger’s conversation while waiting outside.

Unfortunately, there are no guarantees that programs will be equally civil, so wise programmers use semaphores to keep processes from running amok and leaving shared resources, such as files and I/O devices, in inconsistent states.

Mutual exclusion locks (also, called “mutex locks” or simply locks or mutexes) are a special kind of semaphore. Each protects a single shared resource, qualifying it as a binary semaphore. Concurrent programs use locks to guarantee consistent communication among threads through shared variables or data structures. A piece of program code protected by a mutex lock is called a “critical section.”

Mutex locks are often implemented using an indivisible “test-and-set” instruction in today’s prevalent multi-core systems. Although generally deemed efficient, relying on an indivisible test-and-set instruction incurs a few hidden performance penalties. First, execution of such an instruction requires memory access, so it interferes with other cores’ progress—especially when the instruction is in a tight loop. The effect may be felt even more acutely on systems with a shared memory bus. Another penalty stems from cache-coherency. Because the cache line containing a lock object is shared among cores, one thread’s update to the lock invalidates the copies on the other cores. Each subsequent test of the lock on other cores triggers fetching the cache line. A related penalty is false sharing where an unrelated write to another part of the cache line invalidates the whole cache line. Even if the lock remains unchanged, the cache line must be fetched to test the lock on a different core.

Given all these problems, one might wonder “Why use locks at all? What are the alternatives?” One extreme alternative is to give up on communicating through shared variables and adopt the mantra of “no sharing.” That involves replicating data and communicating via message passing. Unfortunately, the cost of replication and message passing is even greater than the overhead associated with locks on today’s multi-core shared-memory architectures.

Another approach that has been actively pursued recently as an alternative to mutex locks is lock-free/non-blocking algorithms. Researchers have reported some isolated successes in designing practical non-blocking implementations. Nonetheless, non-blocking algorithms are hardly a “holy grail”. Designing efficient non-blocking data structures remains difficult, and the promised performance gain has been elusive at best. You’ll see more about non-blocking algorithms at the end of this article.

With no proven better alternatives at present, it makes sense to make the most of mutex locks until they are rendered no longer necessary. This article discusses some experiences with mutex locks in developing multi-threaded concurrent applications, using the mutex locks provided in Intel® Threading Building Blocks as examples.

Making the Most of Mutex Locks

Mutexes are often vilified as major performance snags in multi-threaded, concurrent application development; however, our experience suggests that mutex locks are the least evil among synchronization methods available today. Even though the nominal overhead appears large, you can harness them to your advantage if you use them in well-disciplined ways. Throughout this article, you’ll see some of the lessons learned, stated as guidelines, the first two of which are:

Guideline 1: Being Frugal Always Pays Off

Minimize explicit uses of locks. Instead, use concurrent containers and concurrent algorithms provided by efficient, thread-safe concurrency libraries. If you still find places in your application that you think benefit from explicit use of locks, then:

Guideline 2: Make Critical Sections as Small as Possible

When a thread arrives at a critical section and finds that another thread is already in it, it must wait. Keep the critical section small, and you will get small waiting times for threads and better overall performance. Examine when a shared data in a critical section is made private and see if you can safely take some of the accesses to the data out of the critical section.

For example, the code in Listing 1 implements a concurrent stack. It defines two methods push() and pop(), each protected using a TBB mutex lock (smtx) that’s acquired in the constructor and released in the destructor. The examples in Listing 1 rely on the C++ scoping rules to delimit the critical sections.

A cursory look at pop() shows that:

  1. If the stack is empty pop() returns false.
  2. If the stack is not empty, the code acquires the mutex lock and then re-examines the stack.
  3. If the stack has become empty since the previous test, pop() returns false.
  4. Otherwise, the code updates the top variable, and copies the old top element.
  5. Finally, pop() releases the lock, reclaims the popped node, and returns true.

Here’s a closer look at the critical section. Copying type T may take a lot of time, depending on T. Because of the lock, you know that, once updated, the old top value cannot be viewed by other threads; it becomes private and local to the thread inside the critical section. Therefore, you can safely yank the copy statement out of the critical section (following guideline 2) as follows.

   bool pop( T& _e ) {       node* tp = NULL;       if( !top ) goto done;       {           tbb::spin_mutex::scoped_lock lock( smtx );           if( !top ) goto done;           tp = top;           top = top->nxt;           // move the next line...           // _e = tp->elt;       }       // ...to here       _e = tp->elt;       delete tp;   done:       return tp!=NULL;   }

As another example, consider implementing a tiny memory management routine. A thread allocates objects from its private blocks and returns objects to their parent block. It is possible for a thread to free objects allocated by another thread. Such objects are added to their parent block’s public free list. In addition, a block with a non-empty public free list is added to a list (i.e., public block list) formed with block_t::next_to_internalize and accessed through block_bin_t::mailbox, if not already in.

The owner thread privatizes objects in a block’s public free list, as needed. Function internalize_next() implements the functionality and is invoked when a thread runs out of private blocks with free objects to allocate. It takes a block bin private to the caller thread as its argument and pops the front block from the list bin->mailbox, if not empty. Then, it internalizes objects in the block’s public free list:

   block_t* internalize_next ( block_bin_t* bin )   {       block_t* block;       {           tbb::spin_mutex::scoped_lock scoped_cs(bin->mailbox_lock);           block = bin->mailbox;           if( block ) {               bin->mailbox = block->next_to_internalize;               block->next_to_internalize = NULL;           }       }       if( block )           internalize_returned_objects( block );       return block;   }

The function’s critical section protects access to bin->mailbox with bin->mailbox_lock. Inside the critical section, if bin->mailbox is not empty it pops the front block to block and resets the block’s next_to_internalize.

Note that block is a local variable. By the time bin->mailbox is updated, block (which points to the old front block) becomes invisible to other threads and access to its next_to_internalize field becomes race-free. Thus, you can safely move the reset statement outside the critical section:

   block_t* internalize_next ( block_bin_t* bin )   {       block_t* block;       {           tbb::spin_mutex::scoped_lock scoped_cs(bin->mailbox_lock);           block = bin->mailbox;           if( block ) {               bin->mailbox = block->next_to_internalize;               // move the next statement...               // block->next_to_internalize = NULL;       }       if( block ) {           // ...to here           block->next_to_internalize = NULL;           internalize_returned_objects( block );       }       return block;   }

Guideline 3: Synchronize as Infrequently as Possible

The idea behind this guideline is that you can amortize the cost of a lock operation over a number of local operations. Doing so reduces the overall execution time because executing atomic instructions tends to consume an order of magnitude more cycles.

Again, suppose you’re designing a memory allocator that allocates objects out of a block. To reduce the number of trips to the operating system to get more memory blocks, the allocator uses a function called allocate_blocks() to get a big strip from the operating system, partition it into a number of blocks, and then put them in the global free block list shared among threads. The free block list free_list is implemented as a concurrent stack (see Listing 2).

Note that the code to push a newly carved-out block into free_list is inside a while loop. Also, note that stack2::push() protects concurrent accesses to stack2::top through a mutex lock. That means allocate_blocks() acquires the lock free_list.mtx N times for a strip containing N blocks.

You can reduce that frequency to one per strip by adding a few thread-local instructions. The idea is to build a thread-local list of blocks in the while loop first (using two pointer variables head and tail) and then push the entire list into free_list with a single lock acquisition (see Listing 3). Finally, so that allocate_blocks() can access free_list’s private fields, it’s declared as a friend of stack2.

Guideline 4: Most of All, Know Your Application.

The guideline that will help you most in practice is to analyze and understand your application using actual use scenarios and representative input sets. Then you can determine what kinds of locks are best used where. Performance analysis tools such as Intel® Parallel Amplifier can help you identify where the hot spots are and fine-tune your application accordingly.

A Smorgasbord of Lock Flavors

Intel® Threading Building Blocks offers a gamut of mutex locks with different traits, because critical sections with different access patterns call out mutex locks with different trade-offs. Other libraries may offer similar choices. You need to know your application to select the most appropriate lock flavor for each critical section.

Spin mutex vs. Queuing Mutex

The most prominent distinguishing property for locks is fairness—whether a lock allows “fair” access to the critical section or not. This is an important consideration when choosing a lock, but its importance may vary depending on circumstances. For example, an operating system should guarantee that no process gets unfairly delayed when multiple processes contend against each other to get into a critical section. By contrast, unfairness among threads in a user process may be tolerable to some degree if it helps boost the throughput.

TBB’s spin_mutex is an unfair lock. Threads entering a critical section with a spin_mutex repeatedly attempt to acquire the lock (they spin-wait until they get into the critical section, thus the name). In theory, the waiting time for a spin_mutex is unbounded. The TBB queuing_mutex, on the other hand, is a fair lock, because a thread arriving earlier at a critical section will get into it earlier than one arriving later. Waiting threads form a queue. A newly arriving thread puts itself at the end of the queue using a non-blocking atomic operation and spin-waits until its flag is raised. A thread leaving the critical section hands the lock over to the next in line by raising the latter’s flag.

Unfortunately there are no cast-in-stone guidelines or criteria that dictate when to use an unfair spin_mutex and when to use a fair queuing_mutex. In general though, guaranteeing fairness costs more. When a critical section is brief and contention is light, the chance of a thread being starved is slim and any additional overhead for unneeded fairness may not be warranted. In those cases, use a spin_mutex.

The TBB queuing_mutex spin-waits on a local cache line and does not interfere with other threads’ memory access. Consider using a queuing mutex for modestly sized critical sections and/or when you expect a fairly high degree of contention.

One report claims that, using a test program, “with spin locks, a difference of up to 2x runtime per thread was observed and some threads were unfairly granted the lock up to 1 million times on an 8-core Opteron machine.” If you suspect your application suffers from unfairness due to a spin_mutex, switching to a fair mutex such as queuing_mutex is your answer. But before switching, back up your decision with concrete measurement data.

Reader-Writer locks

Not all concurrent accesses need to be mutually exclusive. Indeed, accesses to many concurrent data structures are mostly read-accesses, and only occasionally need write-accesses. For these structures, keeping one reader spin-waiting while another reader is in the critical section is not necessary.

TBB reader/writer mutexes allow multiple readers to be in a critical section while giving writers exclusive access to it. The unfair version is called spin_rw_mutex, while the fair version is queuing_rw_mutex. These mutexes also allow readers to upgrade to writers and writers to downgrade to readers.

Under some circumstances, you can replace reader-side locks with less expensive operations (although potentially at the expense of writers). One such example is a sequential lock; another is read-copy-update lock. These locks are less-general reader-writer locks, so using them properly in applications requires more stringent scrutiny.

Mutex and Recursive_Mutex

TBB provides a mutex that wraps around the underlying OS locks, but compared to the native version, adds portability across all supported operating systems. In addition the TBB mutex releases the lock even when an exception is thrown from the critical section.

A sibling, recursive_mutex, permits a thread to acquire multiple locks on the same mutex. The thread must release all locks on a recursive_mutex before any other thread can acquire a lock on it.

Avoiding Lock Pitfalls

There is no shortage of references that warn about the inevitable dangers of using locks, such as deadlocks and livelocks. However, you can reduce the chances of getting ensnared by these problems considerably by instituting a few simple rules.

Avoid explicit use of locks. Instead use concurrent containers and concurrent algorithms provided in well-supported concurrency libraries such as Intel® Threading Building Blocks. If you think your application requires explicit use of locks, avoid implementing your own locks and use well-tested well-tuned locks such as TBB locks.

Avoid making calls to functions (particularly unknown ones) while holding a lock. In general, calling a function while holding a lock is not good practice. For one thing, it increases the size of the critical section, thus increasing the wait-times of other threads. More seriously, you may not know whether the function contains lock acquisition code. Even if it does not now, it may in the future. Such changes potentially lead to a deadlock situation, and when that happens, it’s very difficult to locate and fix. If possible, re-factor the critical section so that it computes the function arguments in the critical section but invokes the function outside the critical section.

Avoid holding multiple locks. Circular lock acquisition is a leading cause of dead lock problems. If you must hold multiple locks, always acquire the locks in the same order and then release them in the same order that they were acquired.

Avoid using recursive locks. You may be able to find some isolated cases where recursive locks make great sense. However, locks don’t compose well. Even a completely unrelated change to a part of your application may lead to a deadlock, and the problem will be very difficult to locate (see this newsgroup thread, and this entry in Joe Duffy’s blog). If, after reading those, you still believe you need recursive locks, use them at your own peril.

Even if you do everything you possibly can to avoid deadlocks and livelocks, problems may still occur. If you suspect your application has a deadlock or race condition, and you cannot locate it quickly, don’t get burned by trying to resolve it by yourself. Use tools such as Intel® Parallel Inspector.

Lock-Free and Non-Blocking Algorithms

As promised earlier, one strategy that avoids locks and their associated problems advocated by some researchers is to use non-blocking synchronization methods such as lock-free/wait-free programming techniques and software transactional memory. These techniques aim to provide “wait-freedom”, thereby addressing issues stemming from the blocking nature of locks without compromising performance.

Unfortunately, our experience with non-blocking algorithms has been (so far) disappointing, and many other developers and researchers agree. Almost all non-blocking algorithms invariably use one or more hardware-supported atomic operations (such as compare-and-swap (CAS) and load-link/store-conditional (LL/SC)). Some even use double-word CAS (DCAS).

Dependence on these atomic primitives makes them difficult to write (see Doherty, Simon, et al, “DCAS is not a silver bullet for nonblocking algorithm design.” Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architectures. 2004, and Herb Sutter’s article “Lock-Free Code: A False Sense of Security“), difficult to validate for correctness (see Gotsman, Alexey, et al, “Proving That Non-Blocking Algorithms Don’t Block,” Symposium on Principles of Programming Languages to appear in 2009), and difficult to port to other platforms. This is probably one reason why non-blocking algorithms have been limited to simple data structures. Furthermore, improved performance over lock-based implementations seems hard to get.

Arguments for the other benefits are not compelling enough to warrant the pain of switching to non-blocking algorithms. Fairness is contingent upon the underlying atomic operations; in some cases, livelock is still possible. For many user applications, benefits such as real-time support and fault tolerance are a good-to-have, not a must-have. In other cases, solutions provided by operating systems are sufficient (e.g., priority inheritance for priority inversion).

Software Transactional Memory (STM) is another alternative to lock-based synchronization. It abstracts away the use of low-level atomic primitives using the notion of transactions, and simplifies synchronizing access to shared variables through optimistic execution and a roll-back mechanism. Like non-blocking algorithms, STM promises performance gains over lock-based synchronization, and also promises to avoid many common locking pitfalls. The results so far are not so favorable. One publication “observes that the overall performance of TM is significantly worse at low levels of parallelism” (see Cascaval, Calin, et al, “Software Transactional Memory: Why is it only a research toy?” ACM Queue. 2008, Vol. 6, 5.) However, STM is a relatively young research area, so the jury is still out.

Lock It Up

Locks have been unfairly vilified as a hindrance to the development of efficient concurrent applications on burgeoning multi-core platforms. However, our experiences suggest that rather than discouraging the use of mutex locks, one should instead promote their well-disciplined use. More often than not, implementations with such locks outperform those with non-blocking algorithms or STM.

The most important consideration for making the best use of mutex locks is understanding the application well, using tools to aid that understanding where necessary, and selecting the best-fitting synchronization method for each critical section. When you do choose a mutex, use it with the recommended guidelines, but keep flexibility in mind. Doing so will prevent most common mutex-related pitfalls without incurring unwarranted performance penalties. Finally, shun the do-it-yourself temptation, and delegate work to well-supported concurrency libraries.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Related Posts