RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Plan for the Future: Express Parallelism, Don't Manage It

When adding parallelism, the key design choice is to express concurrency in an application without explicitly managing the scheduling of that concurrency onto the hardware.

he days of large increases in clock frequency are behind us. It was great while it lasted, but it's now clear that we have finally hit a frequency wall. Contrary to popular opinion, that wall is due to heat and energy concerns, not the end of Moore's Law. Instead, the current trend is to use the additional transistors predicted by Moore's Law to provide multiple processing cores and to support multiple threads per core.

For developers, this change in architecture means there is extra performance potential available if needed, but it also means that developers must do some extra work to take advantage of it. But wait! Don't jump off the cliff unless you need to. Not every application needs to be parallel. If performance is not an issue for a particular application, then there's no need to add parallelism. Even if your application is single-threaded, users can still benefit, because their multicore platforms can execute multiple applications concurrently. However, if exploiting the available performance potential increases application usability or allows you to add new features, it makes sense to take the plunge into parallel programming.

There are many ways to add parallelism to an application. For example, it's very tempting to use the operating system's native threading library directly on a multicore platform, such as WinThreads on Windows or Posix Threads on Linux. But this choice is a mistake. You'll see why in the rest of this article. Instead, an important key to unlocking the performance potential of current and future multicore platforms is to choose a model that lets you express concurrency without requiring that you explicitly manage it. To do that, choose a high-level concurrency platform like Intel Threading Building Blocks (TBB), the OpenMP API, or any one of the many available data parallel languages (see the sidebar "Data Parallel Languages" for links).

By using these platforms, developers express the concurrency in their applications and let the runtime library manage the low-level, platform-specific details of implementing and scheduling concurrency on the hardware. That's because the implementation and scheduling details will need to change as architectures change. By using a concurrency platform rather than a native threading library, applications are generally easier to write, are more likely to be correct, and can even gain performance as platforms evolve.

This article describes the process of adding parallelism to an example application using both a native threading library and the TBB library. You'll see that using native threading requires more code, is more difficult to write, has more room for errors, and is less able to adapt to new architectures as they arrive.

A Running Example

The article example is a small search application that uses a genetic algorithm. It's true that the program is somewhat contrived; it needed to be small but still present multiple opportunities for adding parallelism. You can find the complete set of programs in the accompanying downloadable source code, which contains three versions:

  • serial_ga.cpp: the serial version
  • threaded_ga.cpp: a version parallelized using WinThreads
  • tbb_ga.cpp: a version parallelized using TBB
In the example, each individual in a population has a bit vector of genes. Each gene is a single bit. The goal is to find the fittest individual. The example code uses a trivial fitness function that simply compares a given individual to an ideal solution and returns the number of genes they share. A solution that shares more bits with the ideal solution is deemed fitter than one that shares fewer. While there are simpler ways to search for a match than using a genetic algorithm for this particular fitness function, this simple fitness function suffices for demonstration purposes.

The main loop in serial_ga.cpp is as follows:

   population p(population_size);
     for (int g = 0; g < num_generations; ++g) {
The population evolves through num_generations generations. In each generation, the program selects individuals from the population using a tournament scheme and crosses them to generate new children (generate_children). It also randomly selects some children for mutation, and then evaluates their fitness (assign_fitness) of all children in that generation. Finally, it sorts the entire population, both parents and children, so that the fittest individuals rise to the top and are subsequently more likely to reproduce in future generations (sort_population).

Next, you'll see each step in this loop to see how it might be parallelized, first using native WinThreads, and then the TBB library.

Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date