he forerunner to this article, Getting Started with Task Parallel Library, explained how to use the Task Parallel Library (TPL) to launch swarms of tasks on separate threads that execute on all your computer’s available CPUs. That article used basic stripped-down examples to demonstrate TPL concepts. But as usual, there’s a difference between theory and practice. While simple examples make understanding the TPL parts of the code easier, they’re not very realistic.

As a bridge between learning and practice, this article describes a series of programs originally created as sequential programs, but converted to use the TPL. They demonstrate various techniques for preventing different threads from interfering with each other.

If you’ve read many of my DevX articles, then you know that I like algorithms. They’re interesting, challenging, and can often produce impressive results that would be hard to produce by hand. Many interesting algorithms use big For loops so I thought they would be good candidates for parallelization with Parallel.For.

Unfortunately, many of those algorithms also use data structures that have scope greater than the For loop. For example, each trip through the loop might refer to the same array of possible solutions or to the same bitmap.

Naively moving those data structures to a more global scope so all the threads could see them caused all manner of conflict and grief. In most cases, I needed to restructure the programs to allow the threads to access the data without interfering with each other.

The following sections describe the steps I took to parallelize some of the programs I’ve previously written for DevX and for other articles. They also describe the benefit I got on my dual-core system. Some of the programs are more amenable to particular approaches than others, so I tried a variety of methods. You can download the source code to follow along or test the various solutions on your own system.

**Packing in Parallel**

The article Solving the 2D Packing Problem contains the original, sequential code and a full explanation of the problem, but briefly, in Two-Dimensional Bin Packing (2BP), the goal is to pack a set of shapes (in this case, rectangles) in as little area as possible. For example, you might want to cut shapes out of cloth or a sheet of wood with as little waste as possible.

This is an extremely difficult problem to solve in general, and the time required grows exponentially depending on the number of shapes. In other words, if you add one more shape, the time needed to examine every possible solution may increase by a factor of two or more. Examining every possible solution takes so long that an exhaustive calculation of all the possibilities is possible only for relatively small problems.

? | |

Figure 1. Perfect Packing: A parallel exhaustive search finds the best possible solution for 2BP. |

For larger problems, you need to use heuristics that are *likely* to give good solutions but that are not *guaranteed* to produce the best possible result. See the original article for more details on the problem and heuristic solutions.

Figure 1 shows example program Parallel2BP displaying one possible optimal solution for the rectangles shown. (Note that the solution may not be unique.)

To perform a sequential exhaustive search, the program tries placing the first rectangle in each of the positions where it will fit on the sheet of stock. First it places that rectangle in the upper left corner, then tries moving it one square right, then another square right, and so forth. When it reaches the right edge of the stock, the program moves it back to the left edge and down one row. It repeats these steps until it has considered every possible position for the first rectangle.

For each of those positions, the program recursively positions the remaining rectangles in every possible position. The program continues until it has tried every conceivable arrangement.

As it works, the program records each rectangle’s position in an array. Unfortunately when I tried to convert this program into a parallel version, that array posed a problem. Each combination of positions required a different arrangement of positions in that array. If the program tried to use different threads to consider different combinations, that array would be a huge source of contention.

To solve this problem, I used a trick that I found useful while parallelizing several other programs: I gave each thread a separate copy of the data structures used by the sequential version. In this program, each thread gets its own array. To let each thread know where its array is, I built an array of arrays. The call to Parallel.For passes each thread its index in this two-dimensional array so it knows which piece it should use.

The parallel version of the algorithm makes a list of all of the possible positions where the first rectangle can fit. It then uses Parallel.For to explore each of the possible combinations, beginning with the first rectangle in each of its possible positions.

In the test shown in Figure 1, that first 6×6 rectangle could be put in 45 different positions (the program assumed that the solution would be at most 14 units tall, so the 6×6 rectangle could fit in 5 positions horizontally and 9 vertically). Therefore, the program used 45 threads to explore the combinations. After positioning the first rectangle, each thread performs the rest of its search sequentially.

The parallel and sequential algorithms found the same solution but the sequential algorithm took roughly 29.05 seconds while the parallel version took only about 16.28 seconds, a pretty impressive improvement!

This is still a very difficult problem however, so even decreasing the run time by 50 percent won’t help much for bigger problems. For more rectangles, you’ll still need to turn to heuristics. The heuristics described in the original article are very fast so there’s really no need to parallelize them.

**Traveling Parallel Paths**

The article Four Heuristic Solutions to the Traveling Salesperson Problem contains the original sequential code and a full explanation of the problem. Applying parallel techniques to this problem works quite well.

In the traveling salesperson problem (TSP), the goal is to find the shortest route that visits a set of points and then returns to its starting point. Figure 2 shows the sample program ParallelTSP displaying an optimal route found by a parallel exhaustive search. If you look at the numbers, you’ll see that the exhaustive and parallel exhaustive searches both found an optimal solution with the same total route length, but the parallel search took a lot less time. (Notice also that the “Two-Opt” heuristic got lucky and found an optimal solution in a *lot* less time.)

? | |

Figure 2. Remarkable Routing: Program ParallelTSP finds the shortest route that visits a set of points and returns to its starting point. |

Like 2BP, the exhaustive search for TSP uses an array to keep track of test solutions. To build the parallel TSP algorithm, I used an approach similar to the one I used for 2BP, giving each thread its own copy of the array.

In TSP, you can pick any point as the route’s starting point and the parallel search then considers every other point as the second point to visit. For each choice of second point, the program used a separate thread to look for good solutions assuming the assignments for the first and second points. If the points are numbered 1, 2, 3, and so forth, then the threads looked for solutions starting with 1-2, 1-3, 1-4, 1-5, and so forth.

Although the parallel version of this program is much faster than the sequential version, the time needed still grows exponentially as the problem size grows. If you add a few more points to the problem, you won’t be able to perform a parallel exhaustive search even if you have a fast quad-core computer.

As is the case with 2BP, for larger problems you’ll need to turn to heuristics. Fortunately the heuristics for TSP are reasonably good, so you can get reasonable solutions for fairly large problems.

The heuristics for TSP are random. They pick a random solution and then try to improve it by making small changes until they decide they have gotten a reasonably good solution. They then start over with a new random solution and improve that one. The process continues until you decide you’re not going to get a better solution in a reasonable amount of time.

These sorts of random algorithms are ideal for parallelization. Each random solution is independent of the others so different threads can work on different possible solutions without relying on shared data structures. For example, the program could launch a series of calls to the Two-Opt or Mix heuristics on separate threads. For big problems, that would allow you to find better solutions in a reasonable amount of time.

**Multi-Core Mandelbrot Sets**

To draw a Mandelbrot set, you consider each pixel in the image separately. Each pixel corresponds to a point in the complex plane C = x + y * i. To find a pixel’s color, the program repeatedly iterates this equation:

` Z(n) = Z(n-1) * Z(n-1) + C`

In that equation, Z(0) = 0 + 0 * i and C is the value x + y * i corresponding to the pixel at position (x, y). The program iterates this equation until it reaches some maximum number of iterations (the sample program uses 64 by default) or the magnitude |Z(n)| starts heading towards infinity.

At that point, the program uses the number of steps it took |Z(n)| to exceed 2 to determine the pixel’s color. For example, if |Z(n)| exceeds 2 after 30 iterations and you are drawing the Mandelbrot set with 16 colors, then you would assign this pixel color number 30 mod 8, or 6.

Figure 3 shows example program ParallelMandelbrot in action.

? | |

Figure 3. Marvelous Mandelbrot: Program ParallelMandelbrot generates the Mandelbrot set. |

You can find some other Mandelbrot programs on my VB Helper Web site. Go to the search page and look for “Mandelbrot.” With a little effort, you can also find other Mandelbrot programs scattered all over the Web.

Because the program calculates each pixel’s color independently, generating a Mandelbrot set seems like an ideal candidate for parallelization. You can simply divide the image up into pieces and build each piece in a different thread. Each pixel’s value is independent on the values of the others, so the threads should not interfere with each other.

Unfortunately there’s a big problem: to draw the pieces, every thread must access the same bitmap. You might think they would not interfere, because the threads would be writing to separate pieces of the image, but sadly, the Bitmap class is not thread-safe, so accessing it with multiple threads does cause problems.

You could use a lock to limit access to the bitmap to just one thread at a time?but that would require a lot of locks. The 600×600 pixel image in Figure 3 contains 36,000 pixels, meaning the program would have to create and release 36,000 locks, and that would hurt performance.

Alternatively, you could try chopping the image into separate bitmaps and making each thread build one bitmap. But there’s a lot of overhead in chopping the image apart and merging the results back together. I tried this approach and the simplest implementation was slower than the sequential program.

Rather than following either of these approaches, I built an array of integers to hold the pixels’ iteration counts. Each thread fills in the values for its pixels without interfering with the other threads. When all the threads finish, the main program uses the iteration counts to set the pixels’ colors in the bitmap.

This parallel version of the algorithm divides the image into one-pixel wide vertical stripes, generating each in a separate thread. The image in Figure 3 is 600 pixels wide, so the program builds it with 600 threads.

Note that it takes much longer to generate some parts of the Mandelbrot set than others. The dark green pixels on the left edge of the figure head towards infinity quickly, but the red pixels in the middle of the figure have still not headed towards infinity even after 512 iterations.

Because some strips take longer to generate than others, it’s a good thing the program uses a lot of threads. If I had divided the image into big chunks, for example, it’s possible that a thread might be assigned to a chunk that took a very long time to build. In the worst case, other threads might finish their chunks first and the program would be left running only the slow thread on one CPU, while other CPUs were idle. Chopping the image into lots of smaller pieces reduces the time to build any one piece, so that approach guarantees that all CPUs will be kept relatively busy.

However, it’s also important not to break the work into too many threads. Generating each pixel’s image in a separate thread would require a lot more overhead and would probably hurt performance. The lesson here is that dividing a task increasingly finely reaches a point of diminishing returns.

For any particular task, you may need to do some experimentation to find the right level of parallelism. Keep in mind that you may someday want to run the program on the latest 64-core system, so you may want to use some extra threads. Alternatively, you may want to make the number of threads easy to change so you can increase the number when you get hardware with more cores. (The ParallelBootyDivider program described later uses that approach.)

As the title bar in Figure 3 shows, it took the program 0.27 seconds to generate the image. The sequential version took 0.43 seconds?almost twice as long. You can press the P and S keys to switch between parallel and sequential calculations.

**Image Filtering**

When you apply a filter to an image, you consider each pixel individually, multiplying the values of the pixels in an area surrounding that pixel by values in a matrix. You add up the results, possibly divide by a scale factor, and then use the final result to determine the new color for that pixel.

The values used in the filter’s matrix determine the filter type. For example, some filters highlight edges, others make the image smoother, and others sharpen changes in color.

? | |

Figure 4. Interesting Images: Program ParallelImageFilter makes an embossed image of my cat Merlin shredding the Rocky Mountains. |

Figure 4 shows the sample program ParallelImageFilter after applying an embossing filter to a picture.

Like the Mandelbrot set, the filtering algorithm generates its image one pixel at a time, but the algorithms use the image’s pixels differently. In the Mandelbrot set, each pixel’s value depends only on its coordinates; the algorithm doesn’t need to consider the neighboring pixels. In contrast, an image filtering algorithm must look at the neighboring pixels. The filtering algorithm doesn’t need to modify the neighboring pixels, just to look at them. It sets only the color of the pixel it is generating at the time.

Because this program also needs to create or modify a bitmap, an approach similar to the one used by the ParallelMandelbrot program would work for this example as well?you could make an array containing the numeric red, green, and blue values for each pixel and then let each thread fill in a piece of the array.

However, ParallelImageFilter uses a different approach. Rather than creating the array of pixel values, it lets the Bitmap object do that, using its LockBits method, which copies the bitmap’s pixel values into an array. Conversely, the UnlockBits method copies the values in an array back into the bitmap.

It turns out that those methods get and set pixel values much faster than external code. The ParallelMandelbrot program uses the Bitmap.SetPixel method to set pixel values. Even though that’s a lot slower than UnlockBits, that program spends most of its time *calculating* pixel values, not setting them. Using UnlockBits would save a little time but overall, the process would still be slow. In contrast, image filtering requires relatively quick and simple calculations, so building the bitmap takes a large part of the total time used by the program. Therefore, using UnlockBits for image filtering is much more worthwhile. (You might find it an interesting exercise to convert the Mandelbrot program to use UnlockBits to compare the two approaches.)

The top button in Figure 4 makes ParallelImageFilter filter the image sequentially by using the Bitmap object’s GetPixel and SetPixel methods. The middle button also makes the program filter the image sequentially, but it uses LockBits and UnlockBits instead. You can see from Figure 4 that it makes a huge difference.

When you click the bottom button, the program uses LockBits to make an array of bitmap data. It then divides the image into vertical strips and processes them in parallel. In this case, the difference in time is not as great as it is when moving from GetPixel/SetPixel to LockBits/UnlockBits, but it’s still significant.

**Parallelizing the Partition Problem**

This program turned out to be the most interesting (in other words, most difficult and annoying) of the programs I converted with TPL.

The article Making Delicate Decisions explains how you can model many decisions as searches in a tree.

For example, in the Booty Division Problem (which more serious researchers who are less interested in pirates call the Partition Problem) you have a pile of treasure containing items of various values. You need to divide the treasure into two piles of equal value so you can split the booty evenly with your first mate. (A more generalized problem divides the treasure into more piles for other crewmembers but this version is complicated enough.)

You can model this problem as a binary tree. Each node in the tree corresponds to an item in the treasure. The left branch leaving that node corresponds to putting the item in the first pile and the right branch corresponds to putting the item in the second pile. A complete path through the tree represents assignments of every item to one of the two piles. Your job is to find a path that gives the two piles equal values.

If you only have a few items, the problem is easy, but the decision tree grows exponentially, so it’s impossible to search the entire tree for big treasures. If the pile of booty contains N items, then the tree contains 2^N nodes. For example, if you have 100 items, you would need to examine more than 1.26E30 paths through the tree.

See the original article for more discussion, other decision tree problems, and a description of the original sequential program.

Because each path through the tree seems like a possible solution, it would make sense to try to search those paths in parallel. However, when I actually tried to do this, I encountered a couple of problems.

First, the original program uses an array to keep track of which path it is searching in the tree. The Kth entry in the array tells whether item K is in the first or second pile. That is similar to previous examples in this article where a program uses an array to keep track of what it is doing, and when you run across such problems, you can use the same solution described here: give each thread its own copy of the array.

The parallel version of the program builds the top part of the decision tree. It makes arrays representing each path through this partial tree and launches a thread to search further down the bigger tree from the current location.

For example, suppose the pile contains 10 items. Consider only the first two, giving you four possible combinations. If the items are numbered 0 and 1, then the first pile could contain any of the combinations {0, 1}, {0}, {1}, or {}. The second pile would contain the remaining items in each case. After generating these four combinations, the program would launch four threads to consider the remaining eight items in the treasure and see which produced the best division of the spoils.

? | |

Figure 5. Puzzling Performance: In this version, the parallel versions of the Exhaustive and “Branch & Bound” (B & B) searches are slower than the sequential versions. |

At first I tried using only two threads (I have a dual-core system). Oddly, the parallel version took a lot longer than the sequential version. At that point, I thought that perhaps I hadn’t used enough threads so I rewrote the program so I could easily change the number of items that are pre-assigned, and thus change the number of threads.

But that didn’t help. The program showed that the parallel and sequential versions were visiting roughly the same number of nodes (there’s some difference in counting the pre-assigned nodes so they don’t match exactly) but the discrepancy isn’t nearly big enough to account for the difference shown in Figure 5.

The Difference column in Figure 5 shows the difference in value between the two piles of treasure. In this case, there is no perfect division of the treasure so the best you can do is make the piles differ by one doubloon.

The node counts in Figure 5 for the parallel and sequential version of the exhaustive and branch & bound searches are close enough to be correct but the parallel versions are still slower.

After many piratical oaths and threatening to keelhaul my laptop, I finally figured out what was happening. The code that examined the tree’s nodes used a simple counter to keep track of the number of nodes visited. Because the nodes were visited by different threads, I knew that I couldn’t just let them all update the counter value or they would cause race conditions, so (like a good little programmer) I used the Interlocked object. The following code shows how each thread updated the m_NodesVisited variable.

` Interlocked.Add(m_NodesVisited, 2)`

That is correct but behind the scenes the Interlocked.Add method must create a lock to coordinate with other threads. Normally, that wouldn’t be a big deal, but if you look at Figure 5 you’ll see that the exhaustive search visited more than 33 million nodes, meaning that the program was creating and freeing more than 33 million locks! While each lock is relatively quick, using locks that many times was killing the performance.

? | |

Figure 6. Performance Perfected: By not tracking nodes visited, this version avoids millions of locks, and the parallel version out-performs the sequential version. |

Figure 6 shows a new version of the program that doesn’t track the number of nodes visited and thus doesn’t need to create all of those millions of locks. In this version, the parallel algorithms beat the sequential versions handily.

As a wrap-up, my initial attempt at parallelizing the program was more or less correct. It split up the shared data so the threads wouldn’t interfere with each other and it used Interlocked.Add to avoid race conditions. However, it took me a while to realize that the performance problems stemmed from creating so many locks. Completely removing the calculation that caused the potential race condition let me drop the locks, and now the parallel version provides a nice speed improvement.

**A Few Things to Remember**

As you can see from the case studies here, turning your sequential single-core code into multi-core capable code using the Task Parallel library can provide substantial performance improvements. But the conversion isn’t quite as straightforward as you might wish. Here are some points to take away:

- Different threads can share variables safely as long as they don’t try to read and write the same values. In the Mandelbrot and image filtering examples, threads write into different parts of the same arrays.
- Giving threads different copies of shared data is a useful technique for parallelizing complex algorithms. That technique worked well for the 2D Bin Packing, Traveling Salesperson, and the Booty Division problems. As long as each thread works on its own copy, they shouldn’t interfere.
- Taking exceptional action to prevent race conditions can get you into trouble. Using Interlocked.Add properly prevents race conditions but, as I learned the hard way, too many locks can ruin performance. In such cases, look for other ways to prevent contention, such as completely removing the calculation.

This article showed several useful techniques for parallelizing complex algorithms. But even if you’re currently developing applications for single-CPU systems, you might want to keep these techniques in mind as you write future applications. If you isolate the data used across passes through For loops, you’ll be one step closer to making the changes you need to take advantage of the Task Parallel Library. Then, when you finally get that 64-core system for your birthday, you’ll be ready to hit the ground running.