Parallelizing sort_population
The final call in the main generations loop of
serial_ga.cpp calls
sort_population. The serial implementation of
sort_population is just a wrapper around
std::sort:
inline void sort_population() {
std::sort( my_individuals.begin(), my_individuals.end() );
}
Just as with sequential sorting, parallel sorting has been studied for years, and the literature contains many efficient implementations, each with different tradeoffs. To implement a parallel
sort_population function using native threads, you could select and hand-code one of those known algorithms.
But before doing that, here's the code for a parallel version of
sort_population using TBB:
inline void sort_population() {
tbb::parallel_sort( my_individuals.begin(), my_individuals.end() );
}
As you can see TBB supplies a parallel sort, so all you need to do is replace the
std::sort call in the serial version with the
tbb::parallel_sort call. Even without completing the exercise to implement a parallel sort using native threads, it's safe to conclude it will require more code, will likely be more prone to bugs, and the resulting code will suffer from the same performance portability problems as the two previous native code snippets.
Use Your Concurrency Platform Wisely
As the examples in this article show, using a concurrency platform over native threads has several advantages:
- You can focus more on application features.
- It's easier to write the code.
- The library manages the concurrency, allowing the same code to adapt to new hardware platforms (almost) for free as the numbers of cores increase, the memory hierarchies change, or even if architectures change in more radical ways.
But those advantages don't remove the requirement to use concurrency libraries wisely and they don't automatically guarantee future scalability on new platforms. Most libraries provide some features that are more future-proof than others. For example, the TBB library provides
tbb::tbb_thread, which is just a thin wrapper around the native OS threading library. That means it's easy to misuse the interface, potentially creating the same types of problems that you saw with the natively threaded example.
To avoid that, here are a few recommended guidelines to consider when using a concurrency platform. Following these will help ensure that your code's performance can adapt to new architectures as they arrive:
- Avoid using "threads" in your application directly. Use the higher-level parallel algorithms provided by the concurrency platform of your choice instead—or, if you need to create a new algorithm, use its more flexible tasking interfaces.
- Avoid using thread ids to divide up the data items and/or distribute work. For example, the OpenMP API provides routines that return the executing thread's id as well as the total number of threads in use by the system. It's easy to fall into the trap of scheduling work yourself to improve efficiency by using these APIs. But as future machine architectures emerge, your hand-coded scheduling may no longer be the best. Leave it to the runtime library to make these decisions.
- Avoid using persistent thread-local storage. Making specific threads hold specific information may force you into scheduling computations in a particular way. The OpenMP API provides a threadprivate directive and the open-source version of TBB provides tbb::enumerable_thread_specific. Both these features have safe uses, but use them sparingly (if at all) to avoid tying specific computations to specific threads.
As multicore architectures become increasingly available, getting increased performance out of these architectures requires some additional work from programmers. A wisely-chosen programming model can result in applications that continue to scale as platforms advance. The key design choice is to express the concurrency in an application without explicitly managing the scheduling of that concurrency onto the hardware. Using a concurrency platform allows such designs; the platform's high-level algorithms and tasking interfaces allow you to express parallelism while leaving the low-level scheduling decisions to the runtime libraries.