Plan for the Future: Express Parallelism, Don’t Manage It

he days of large increases in clock frequency are behind us. It was great while it lasted, but it’s now clear that we have finally hit a frequency wall. Contrary to popular opinion, that wall is due to heat and energy concerns, not the end of Moore’s Law. Instead, the current trend is to use the additional transistors predicted by Moore’s Law to provide multiple processing cores and to support multiple threads per core.

For developers, this change in architecture means there is extra performance potential available if needed, but it also means that developers must do some extra work to take advantage of it. But wait! Don’t jump off the cliff unless you need to. Not every application needs to be parallel. If performance is not an issue for a particular application, then there’s no need to add parallelism. Even if your application is single-threaded, users can still benefit, because their multicore platforms can execute multiple applications concurrently. However, if exploiting the available performance potential increases application usability or allows you to add new features, it makes sense to take the plunge into parallel programming.

There are many ways to add parallelism to an application. For example, it’s very tempting to use the operating system’s native threading library directly on a multicore platform, such as WinThreads on Windows or Posix Threads on Linux. But this choice is a mistake. You’ll see why in the rest of this article. Instead, an important key to unlocking the performance potential of current and future multicore platforms is to choose a model that lets you express concurrency without requiring that you explicitly manage it. To do that, choose a high-level concurrency platform like Intel Threading Building Blocks (TBB), the OpenMP API, or any one of the many available data parallel languages (see the sidebar “Data Parallel Languages” for links).

By using these platforms, developers express the concurrency in their applications and let the runtime library manage the low-level, platform-specific details of implementing and scheduling concurrency on the hardware. That’s because the implementation and scheduling details will need to change as architectures change. By using a concurrency platform rather than a native threading library, applications are generally easier to write, are more likely to be correct, and can even gain performance as platforms evolve.

This article describes the process of adding parallelism to an example application using both a native threading library and the TBB library. You’ll see that using native threading requires more code, is more difficult to write, has more room for errors, and is less able to adapt to new architectures as they arrive.

A Running Example

The article example is a small search application that uses a genetic algorithm. It’s true that the program is somewhat contrived; it needed to be small but still present multiple opportunities for adding parallelism. You can find the complete set of programs in the accompanying downloadable source code, which contains three versions:

  • serial_ga.cpp: the serial version
  • threaded_ga.cpp: a version parallelized using WinThreads
  • tbb_ga.cpp: a version parallelized using TBB

In the example, each individual in a population has a bit vector of genes. Each gene is a single bit. The goal is to find the fittest individual. The example code uses a trivial fitness function that simply compares a given individual to an ideal solution and returns the number of genes they share. A solution that shares more bits with the ideal solution is deemed fitter than one that shares fewer. While there are simpler ways to search for a match than using a genetic algorithm for this particular fitness function, this simple fitness function suffices for demonstration purposes.

The main loop in serial_ga.cpp is as follows:

   population p(population_size);     for (int g = 0; g < num_generations; ++g) {       p.generate_children();       p.assign_fitness();       p.sort_population();     }

The population evolves through num_generations generations. In each generation, the program selects individuals from the population using a tournament scheme and crosses them to generate new children (generate_children). It also randomly selects some children for mutation, and then evaluates their fitness (assign_fitness) of all children in that generation. Finally, it sorts the entire population, both parents and children, so that the fittest individuals rise to the top and are subsequently more likely to reproduce in future generations (sort_population).

Next, you'll see each step in this loop to see how it might be parallelized, first using native WinThreads, and then the TBB library.

Parallelizing generate_children

Before attempting to parallelize the generate_children function, it's worth studying the serial version briefly.

Serial Version of generate_children

The following code shows the implementation of generate_children in serial_ga.cpp:

   inline void    population::generate_children() {     for ( int i = 0; i < population_size; ++i ) {       my_individuals[population_size + i] =          individual ( select_parent().chromosome,                      select_parent().chromosome );     }   }

Each individual has a bit vector (chromosome) that represents its genes. The preceding loop doubles the population of individuals by creating new individuals and adding them to the population (the space is reserved elsewhere to allow for the doubling of the population). Each loop iteration selects two parents and generates a new child. You can find code for both the selection and the generation of new individuals in the downloadable source but you don't need those for this discussion. For parallelization, it suffices to know that each loop iteration is independent from all others; therefore, this loop can be parallelized.

Natively Threaded Version of generate_children

You need three modifications to the serial code to implement a natively threaded version of this code:

  1. Package the loop so that it can be passed to a thread.
  2. Devise a scheduling scheme to break up the work between these threads.
  3. Add code to make the main thread wait for the worker threads to complete before continuing with the rest of the loop.

The method generate_children creates num_threads native threads using the Windows API call _beginthread. The arguments to this call include the function start_generate_children and t (the thread ID). Each _beginthread call creates a new thread that executes start_generate_children(t). The function generate_children ends by invoking the Windows API call WaitForMultipleObjects to join the threads:

   inline void    population::generate_children() {     for ( int t = 0; t < num_threads; ++t ) {       handles[t] = _beginthread(       &start_generate_children, 0, (void *)t );     }     WaitForMultipleObjects(        num_threads, (HANDLE *)handles,        true, INFINITE );   }

The function start_generate_children is a wrapper that calls in to a population_helper helper class:

   inline void start_generate_children( void *x ) {     population_helper::generate_children( int(x) );   }

In this natively threaded version, the real work of the loop has been moved to the function start_generate_children of this helper class:

   static void    population_helper::generate_children( const int thread_id ) {     size_t begin = 0;     size_t end = population_size;     adjust_begin_end( thread_id, begin, end );     for ( size_t i = begin; i < end; ++i ) {       (*my_individuals)[population_size + i] =          individual (  select_parent().chromosome,                        select_parent().chromosome );     }   }

This code looks very similar to the serial code; there is a singly nested loop that repeatedly creates new individuals from randomly selected parents. The difference is the loop bounds, which are set by a new function named adjust_begin_end:

   static void    population_helper::adjust_begin_end(      int thread_id, size_t &begin, size_t &end ) {         size_t max_end = end;      size_t range_size = end - begin + 1;      size_t chunk_size = range_size / num_threads;      size_t remainder = range_size % num_threads;         if ( remainder >= thread_id ) {         begin = begin + (chunk_size + 1) * thread_id;         if (remainder)            end = begin + chunk_size + 1;         else            end = begin + chunk_size ;         if ( end > max_end ) end = max_end;      } else {           begin = begin + chunk_size * thread_id + remainder;           end = begin + chunk_size;           if ( end > max_end ) end = max_end;      }   }

This scheduling function breaks up the loop range among the worker threads based on their thread ids (0..num_threads). Each thread gets roughly (end—begin+1)/num_threads iterations. The above code is naïve; it does nothing to dynamically detect load imbalance or deal with data locality issues.

This threaded version of generate_children contains a significant amount of new code to implement parallelism. While most of that code is relatively straightforward, it doesn't preclude developers from adding bugs, and the function adjust_begin_end requires some thought.

Author's Note: In fact, our first implementation of adjust_begin_end did have a bug in it—we lost iterations!

Secondly, the total number of worker threads created and the scheduling of iterations are based on num_threads, which raises the issue of how to set num_threads to a good value. During development, we set it to 8, because we were developing on a system with eight cores. But of course, for a real-world application you aren't likely to know how many processors your user's machine has. You also can't know whether the users are willing to commit all of their system resources to this particular application—their systems may be already heavily loaded when this code is run. You can let num_threads be a runtime argument, but users are unlikely to know the appropriate value or even want to set the value. It's clear that setting the num_threads parameter is troublesome.

TBB Version of generate_children

The third version of generate_children uses Intel Threading Building Blocks (TBB) as its concurrency platform. Implementing a parallel version using TBB also requires three modifications to the serial code:

  1. Modify the main thread so it initializes a task_scheduler_init object.
  2. Package the loop for use by the TBB parallel_for template function.
  3. Invoke the parallel_for template function in place of the loop in generate_children.

To accomplish step one, create a task_scheduler_init object in the main function of tbb_ga.cpp. That initializes the TBB worker pool, which is managed by the library:

   tbb::task_scheduler_init init;
Author's Note: The preceding step is required for all applications that use TBB algorithms.

Next package the generate_children loop as a function object that defines a method operator()( const blocked_range & ) const. This method implements the loop, using the range object's begin and end methods to access the range assigned to this call by the TBB library:

   void    generate_children_body::operator() (        const tbb::blocked_range< int > &range ) const {       for ( int i = range.begin(); i != range.end(); ++i) {            my_individuals[population_size + i] =              individual ( select_parent().chromosome,                              select_parent().chromosome );       }   }

Finally, modify the generate_children function to use the TBB parallel_for template function. This template function accepts three arguments:

  1. A range object that represents the iteration space
  2. A function object that implements the loop body
  3. A partitioner object that provides a hint about how to divide up the iterations
   inline void    population::generate_children() {     tbb::parallel_for(tbb::blocked_range(0,           int(population_size) ),           generate_children_body( my_individuals ),          tbb::auto_partitioner() );   }

In the preceding function, the blocked_range object describes a one-dimensional iteration space of integers from 0 to population_size-1. The function object generate_children_body implements the loop body, whose function operator() was provided earlier. The third argument is an auto_partitioner object that informs the library to use its own policy to divide up the iteration space.

The TBB library automatically manages the concurrency, taking the provided range object and using its own auto_partitioner policy to create a number of independent subranges. The independent subranges will have the function object applied to them by whichever of the worker thread(s) that the TBB library deems most appropriate. So using TBB, you express the concurrency but avoid all the error-prone work of managing it.

Note that the TBB code eliminates the need to specify num_threads. This is a key point because it allows the runtime library to make intelligent decisions: It can decide how many threads it will create and how many threads will participate in this particular parallel algorithm. The auto_partitioner also lets it decide how to divide up the work among the participating threads to best balance load and maintain data locality. As new hardware platforms become available, your application source code can stay the same; the TBB library makes all of the necessary adjustments.

Parallelizing assign_fitness

After creating each generation, the sample application must assign a fitness value to each individual in that new generation. The serial implementation of assign_fitness in serial_ga.cpp uses std::for_each to iterate through the new children in the my_individuals vector to assign fitness values to them:

   inline void    population::assign_fitness() {     std::for_each( my_individuals.begin() + population_size,                     my_individuals.end(), set_fitness() );   }

The process to implement a natively threaded version of assign_fitness is similar to that you saw in the preceding section to implement the threaded version of generate_children, so you'll see a condensed version here. First, add code to create and join a set of native threads:

   inline void    population::assign_fittness() {     for ( int t = 0; t < num_threads; ++t ) {       handles[t] = _beginthread( &start_set_fitness, 0, (void *)t );     }     WaitForMultipleObjects( num_threads, (HANDLE *)handles,                              true, INFINITE );   }

Next, package the loop so you can pass it through the Windows threading API:

   inline void start_set_fitness( void *x ) {      population_helper::set_fitness( int(x) );   }

Finally, modify the loop code to use the same adjust_begin_end scheduling routine as generate_children:

   static void    population_helper::set_fitness( const int thread_id ) {     size_t begin = population_size;     size_t end = my_individuals->size();     adjust_begin_end( thread_id, begin, end );     for ( size_t i = begin; i < end; ++i ) {       (*my_individuals)[i].set_fitness();     }   }

The TBB implementation simply replaces std::for_each with tbb::parallel_for by using a blocked_range of iterators:

   inline void assign_fitness() {     tbb::parallel_for( tbb::blocked_range(                 my_individuals.begin() + population_size,                my_individuals.end() ),                set_fitness_body(),               tbb::auto_partitioner() );   }

Implementing set_fitness_body is straightforward:

   struct set_fitness_body {      void operator() (const tbb::blocked_range <       vector_type::iterator >       &range ) const {         for ( vector_type::iterator i = range.begin();             i != range.end(); ++i) {             i->set_fitness();         }      }   };

Again, the natively threaded code is more difficult to write, uses a naive scheduling policy and is tied to the use of num_threads. Of course it's possible to write a set of advanced support routines using native threads to do a better job of managing the concurrency—but if you did that, you'd be writing a concurrency platform instead of focusing on the application's features.

Parallelizing sort_population

The final call in the main generations loop of serial_ga.cpp calls sort_population. The serial implementation of sort_population is just a wrapper around std::sort:

   inline void sort_population() {     std::sort( my_individuals.begin(), my_individuals.end() );   }

Just as with sequential sorting, parallel sorting has been studied for years, and the literature contains many efficient implementations, each with different tradeoffs. To implement a parallel sort_population function using native threads, you could select and hand-code one of those known algorithms.

But before doing that, here's the code for a parallel version of sort_population using TBB:

   inline void sort_population() {     tbb::parallel_sort( my_individuals.begin(), my_individuals.end() );   }

As you can see TBB supplies a parallel sort, so all you need to do is replace the std::sort call in the serial version with the tbb::parallel_sort call. Even without completing the exercise to implement a parallel sort using native threads, it's safe to conclude it will require more code, will likely be more prone to bugs, and the resulting code will suffer from the same performance portability problems as the two previous native code snippets.

Use Your Concurrency Platform Wisely

As the examples in this article show, using a concurrency platform over native threads has several advantages:

  • You can focus more on application features.
  • It's easier to write the code.
  • The library manages the concurrency, allowing the same code to adapt to new hardware platforms (almost) for free as the numbers of cores increase, the memory hierarchies change, or even if architectures change in more radical ways.

But those advantages don't remove the requirement to use concurrency libraries wisely and they don't automatically guarantee future scalability on new platforms. Most libraries provide some features that are more future-proof than others. For example, the TBB library provides tbb::tbb_thread, which is just a thin wrapper around the native OS threading library. That means it's easy to misuse the interface, potentially creating the same types of problems that you saw with the natively threaded example.

To avoid that, here are a few recommended guidelines to consider when using a concurrency platform. Following these will help ensure that your code's performance can adapt to new architectures as they arrive:

  • Avoid using "threads" in your application directly. Use the higher-level parallel algorithms provided by the concurrency platform of your choice instead—or, if you need to create a new algorithm, use its more flexible tasking interfaces.
  • Avoid using thread ids to divide up the data items and/or distribute work. For example, the OpenMP API provides routines that return the executing thread's id as well as the total number of threads in use by the system. It's easy to fall into the trap of scheduling work yourself to improve efficiency by using these APIs. But as future machine architectures emerge, your hand-coded scheduling may no longer be the best. Leave it to the runtime library to make these decisions.
  • Avoid using persistent thread-local storage. Making specific threads hold specific information may force you into scheduling computations in a particular way. The OpenMP API provides a threadprivate directive and the open-source version of TBB provides tbb::enumerable_thread_specific. Both these features have safe uses, but use them sparingly (if at all) to avoid tying specific computations to specific threads.

As multicore architectures become increasingly available, getting increased performance out of these architectures requires some additional work from programmers. A wisely-chosen programming model can result in applications that continue to scale as platforms advance. The key design choice is to express the concurrency in an application without explicitly managing the scheduling of that concurrency onto the hardware. Using a concurrency platform allows such designs; the platform's high-level algorithms and tasking interfaces allow you to express parallelism while leaving the low-level scheduling decisions to the runtime libraries.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Related Posts