RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Plan for the Future: Express Parallelism, Don't Manage It  : Page 2

When adding parallelism, the key design choice is to express concurrency in an application without explicitly managing the scheduling of that concurrency onto the hardware.


Parallelizing generate_children

Before attempting to parallelize the generate_children function, it's worth studying the serial version briefly.

Serial Version of generate_children

The following code shows the implementation of generate_children in serial_ga.cpp:

   inline void 
   population::generate_children() {
     for ( int i = 0; i < population_size; ++i ) {
       my_individuals[population_size + i] = 
         individual ( select_parent().chromosome,
                      select_parent().chromosome );
Each individual has a bit vector (chromosome) that represents its genes. The preceding loop doubles the population of individuals by creating new individuals and adding them to the population (the space is reserved elsewhere to allow for the doubling of the population). Each loop iteration selects two parents and generates a new child. You can find code for both the selection and the generation of new individuals in the downloadable source but you don't need those for this discussion. For parallelization, it suffices to know that each loop iteration is independent from all others; therefore, this loop can be parallelized.

Natively Threaded Version of generate_children

You need three modifications to the serial code to implement a natively threaded version of this code:

  1. Package the loop so that it can be passed to a thread.
  2. Devise a scheduling scheme to break up the work between these threads.
  3. Add code to make the main thread wait for the worker threads to complete before continuing with the rest of the loop.
The method generate_children creates num_threads native threads using the Windows API call _beginthread. The arguments to this call include the function start_generate_children and t (the thread ID). Each _beginthread call creates a new thread that executes start_generate_children(t). The function generate_children ends by invoking the Windows API call WaitForMultipleObjects to join the threads:

   inline void 
   population::generate_children() {
     for ( int t = 0; t < num_threads; ++t ) {
       handles[t] = _beginthread(
       &start_generate_children, 0, (void *)t );
       num_threads, (HANDLE *)handles, 
       true, INFINITE );
The function start_generate_children is a wrapper that calls in to a population_helper helper class:

   inline void start_generate_children( void *x ) {
     population_helper::generate_children( int(x) );
In this natively threaded version, the real work of the loop has been moved to the function start_generate_children of this helper class:

   static void 
   population_helper::generate_children( const int thread_id ) {
     size_t begin = 0;
     size_t end = population_size;
     adjust_begin_end( thread_id, begin, end );
     for ( size_t i = begin; i < end; ++i ) {
       (*my_individuals)[population_size + i] = 
         individual (  select_parent().chromosome, 
                       select_parent().chromosome );
This code looks very similar to the serial code; there is a singly nested loop that repeatedly creates new individuals from randomly selected parents. The difference is the loop bounds, which are set by a new function named adjust_begin_end:

   static void 
      int thread_id, size_t &begin, size_t &end ) {
      size_t max_end = end;
      size_t range_size = end - begin + 1;
      size_t chunk_size = range_size / num_threads;
      size_t remainder = range_size % num_threads;
      if ( remainder >= thread_id ) {
         begin = begin + (chunk_size + 1) * thread_id;
         if (remainder)
            end = begin + chunk_size + 1;
            end = begin + chunk_size ;
         if ( end > max_end ) end = max_end;
      } else {
           begin = begin + chunk_size * thread_id + remainder;
           end = begin + chunk_size;
           if ( end > max_end ) end = max_end;
This scheduling function breaks up the loop range among the worker threads based on their thread ids (0..num_threads). Each thread gets roughly (end—begin+1)/num_threads iterations. The above code is naïve; it does nothing to dynamically detect load imbalance or deal with data locality issues.

This threaded version of generate_children contains a significant amount of new code to implement parallelism. While most of that code is relatively straightforward, it doesn't preclude developers from adding bugs, and the function adjust_begin_end requires some thought.

Author's Note: In fact, our first implementation of adjust_begin_end did have a bug in it—we lost iterations!

Secondly, the total number of worker threads created and the scheduling of iterations are based on num_threads, which raises the issue of how to set num_threads to a good value. During development, we set it to 8, because we were developing on a system with eight cores. But of course, for a real-world application you aren't likely to know how many processors your user's machine has. You also can't know whether the users are willing to commit all of their system resources to this particular application—their systems may be already heavily loaded when this code is run. You can let num_threads be a runtime argument, but users are unlikely to know the appropriate value or even want to set the value. It's clear that setting the num_threads parameter is troublesome.

TBB Version of generate_children

The third version of generate_children uses Intel Threading Building Blocks (TBB) as its concurrency platform. Implementing a parallel version using TBB also requires three modifications to the serial code:

  1. Modify the main thread so it initializes a task_scheduler_init object.
  2. Package the loop for use by the TBB parallel_for template function.
  3. Invoke the parallel_for template function in place of the loop in generate_children.
To accomplish step one, create a task_scheduler_init object in the main function of tbb_ga.cpp. That initializes the TBB worker pool, which is managed by the library:

   tbb::task_scheduler_init init;
Author's Note: The preceding step is required for all applications that use TBB algorithms.

Next package the generate_children loop as a function object that defines a method operator()( const blocked_range & ) const. This method implements the loop, using the range object's begin and end methods to access the range assigned to this call by the TBB library:

   generate_children_body::operator() ( 
       const tbb::blocked_range< int > &range ) const {
       for ( int i = range.begin(); i != range.end(); ++i) { 
           my_individuals[population_size + i] = 
             individual ( select_parent().chromosome, 
                             select_parent().chromosome );
Finally, modify the generate_children function to use the TBB parallel_for template function. This template function accepts three arguments:

  1. A range object that represents the iteration space
  2. A function object that implements the loop body
  3. A partitioner object that provides a hint about how to divide up the iterations
   inline void 
   population::generate_children() {
          int(population_size) ), 
          generate_children_body( my_individuals ),
          tbb::auto_partitioner() );
In the preceding function, the blocked_range object describes a one-dimensional iteration space of integers from 0 to population_size-1. The function object generate_children_body implements the loop body, whose function operator() was provided earlier. The third argument is an auto_partitioner object that informs the library to use its own policy to divide up the iteration space.

The TBB library automatically manages the concurrency, taking the provided range object and using its own auto_partitioner policy to create a number of independent subranges. The independent subranges will have the function object applied to them by whichever of the worker thread(s) that the TBB library deems most appropriate. So using TBB, you express the concurrency but avoid all the error-prone work of managing it.

Note that the TBB code eliminates the need to specify num_threads. This is a key point because it allows the runtime library to make intelligent decisions: It can decide how many threads it will create and how many threads will participate in this particular parallel algorithm. The auto_partitioner also lets it decide how to divide up the work among the participating threads to best balance load and maintain data locality. As new hardware platforms become available, your application source code can stay the same; the TBB library makes all of the necessary adjustments.

Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date