Getting Started with the .NET Task Parallel Library: Multi-Core Case Studies

There's more to any programming tool than just theory—and the TPL is no exception. Learn about the techniques Rod Stephens used and the problems he encountered in these parallelism case studies.

he forerunner to this article, Getting Started with Task Parallel Library, explained how to use the Task Parallel Library (TPL) to launch swarms of tasks on separate threads that execute on all your computer's available CPUs. That article used basic stripped-down examples to demonstrate TPL concepts. But as usual, there's a difference between theory and practice. While simple examples make understanding the TPL parts of the code easier, they're not very realistic.
As a bridge between learning and practice, this article describes a series of programs originally created as sequential programs, but converted to use the TPL. They demonstrate various techniques for preventing different threads from interfering with each other.

If you've read many of my DevX articles, then you know that I like algorithms. They're interesting, challenging, and can often produce impressive results that would be hard to produce by hand. Many interesting algorithms use big For loops so I thought they would be good candidates for parallelization with Parallel.For.
Unfortunately, many of those algorithms also use data structures that have scope greater than the For loop. For example, each trip through the loop might refer to the same array of possible solutions or to the same bitmap.

Naively moving those data structures to a more global scope so all the threads could see them caused all manner of conflict and grief. In most cases, I needed to restructure the programs to allow the threads to access the data without interfering with each other.
The following sections describe the steps I took to parallelize some of the programs I've previously written for DevX and for other articles. They also describe the benefit I got on my dual-core system. Some of the programs are more amenable to particular approaches than others, so I tried a variety of methods. You can download the source code to follow along or test the various solutions on your own system.

Packing in Parallel
The article Solving the 2D Packing Problem contains the original, sequential code and a full explanation of the problem, but briefly, in Two-Dimensional Bin Packing (2BP), the goal is to pack a set of shapes (in this case, rectangles) in as little area as possible. For example, you might want to cut shapes out of cloth or a sheet of wood with as little waste as possible.
This is an extremely difficult problem to solve in general, and the time required grows exponentially depending on the number of shapes. In other words, if you add one more shape, the time needed to examine every possible solution may increase by a factor of two or more. Examining every possible solution takes so long that an exhaustive calculation of all the possibilities is possible only for relatively small problems.

Figure 1. Perfect Packing: A parallel exhaustive search finds the best possible solution for 2BP.

For larger problems, you need to use heuristics that are likely to give good solutions but that are not guaranteed to produce the best possible result. See the original article for more details on the problem and heuristic solutions.
Figure 1 shows example program Parallel2BP displaying one possible optimal solution for the rectangles shown. (Note that the solution may not be unique.)

To perform a sequential exhaustive search, the program tries placing the first rectangle in each of the positions where it will fit on the sheet of stock. First it places that rectangle in the upper left corner, then tries moving it one square right, then another square right, and so forth. When it reaches the right edge of the stock, the program moves it back to the left edge and down one row. It repeats these steps until it has considered every possible position for the first rectangle.
For each of those positions, the program recursively positions the remaining rectangles in every possible position. The program continues until it has tried every conceivable arrangement.

As it works, the program records each rectangle's position in an array. Unfortunately when I tried to convert this program into a parallel version, that array posed a problem. Each combination of positions required a different arrangement of positions in that array. If the program tried to use different threads to consider different combinations, that array would be a huge source of contention.
To solve this problem, I used a trick that I found useful while parallelizing several other programs: I gave each thread a separate copy of the data structures used by the sequential version. In this program, each thread gets its own array. To let each thread know where its array is, I built an array of arrays. The call to Parallel.For passes each thread its index in this two-dimensional array so it knows which piece it should use.

The parallel version of the algorithm makes a list of all of the possible positions where the first rectangle can fit. It then uses Parallel.For to explore each of the possible combinations, beginning with the first rectangle in each of its possible positions.
In the test shown in Figure 1, that first 6x6 rectangle could be put in 45 different positions (the program assumed that the solution would be at most 14 units tall, so the 6x6 rectangle could fit in 5 positions horizontally and 9 vertically). Therefore, the program used 45 threads to explore the combinations. After positioning the first rectangle, each thread performs the rest of its search sequentially.

The parallel and sequential algorithms found the same solution but the sequential algorithm took roughly 29.05 seconds while the parallel version took only about 16.28 seconds, a pretty impressive improvement!
This is still a very difficult problem however, so even decreasing the run time by 50 percent won't help much for bigger problems. For more rectangles, you'll still need to turn to heuristics. The heuristics described in the original article are very fast so there's really no need to parallelize them.