TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
 Specialized Dev Zones Research Center eBook Library .NET Java C++ Web Dev Architecture Database Security Open Source Enterprise Mobile Special Reports 10-Minute Solutions DevXtra Blogs Slideshow

# Getting Started with the .NET Task Parallel Library: Multi-Core Case Studies : Page 3

## There's more to any programming tool than just theory—and the TPL is no exception. Learn about the techniques Rod Stephens used and the problems he encountered in these parallelism case studies.

 by Rod Stephens
 Sep 29, 2008
 Page 3 of 3
Image Filtering
When you apply a filter to an image, you consider each pixel individually, multiplying the values of the pixels in an area surrounding that pixel by values in a matrix. You add up the results, possibly divide by a scale factor, and then use the final result to determine the new color for that pixel.

The values used in the filter's matrix determine the filter type. For example, some filters highlight edges, others make the image smoother, and others sharpen changes in color.

 Figure 4. Interesting Images: Program ParallelImageFilter makes an embossed image of my cat Merlin shredding the Rocky Mountains.
Figure 4 shows the sample program ParallelImageFilter after applying an embossing filter to a picture.

Like the Mandelbrot set, the filtering algorithm generates its image one pixel at a time, but the algorithms use the image's pixels differently. In the Mandelbrot set, each pixel's value depends only on its coordinates; the algorithm doesn't need to consider the neighboring pixels. In contrast, an image filtering algorithm must look at the neighboring pixels. The filtering algorithm doesn't need to modify the neighboring pixels, just to look at them. It sets only the color of the pixel it is generating at the time.

Because this program also needs to create or modify a bitmap, an approach similar to the one used by the ParallelMandelbrot program would work for this example as well—you could make an array containing the numeric red, green, and blue values for each pixel and then let each thread fill in a piece of the array.

However, ParallelImageFilter uses a different approach. Rather than creating the array of pixel values, it lets the Bitmap object do that, using its LockBits method, which copies the bitmap's pixel values into an array. Conversely, the UnlockBits method copies the values in an array back into the bitmap.

It turns out that those methods get and set pixel values much faster than external code. The ParallelMandelbrot program uses the Bitmap.SetPixel method to set pixel values. Even though that's a lot slower than UnlockBits, that program spends most of its time calculating pixel values, not setting them. Using UnlockBits would save a little time but overall, the process would still be slow. In contrast, image filtering requires relatively quick and simple calculations, so building the bitmap takes a large part of the total time used by the program. Therefore, using UnlockBits for image filtering is much more worthwhile. (You might find it an interesting exercise to convert the Mandelbrot program to use UnlockBits to compare the two approaches.)

The top button in Figure 4 makes ParallelImageFilter filter the image sequentially by using the Bitmap object's GetPixel and SetPixel methods. The middle button also makes the program filter the image sequentially, but it uses LockBits and UnlockBits instead. You can see from Figure 4 that it makes a huge difference.

When you click the bottom button, the program uses LockBits to make an array of bitmap data. It then divides the image into vertical strips and processes them in parallel. In this case, the difference in time is not as great as it is when moving from GetPixel/SetPixel to LockBits/UnlockBits, but it's still significant.

Parallelizing the Partition Problem
This program turned out to be the most interesting (in other words, most difficult and annoying) of the programs I converted with TPL.

The article Making Delicate Decisions explains how you can model many decisions as searches in a tree.

For example, in the Booty Division Problem (which more serious researchers who are less interested in pirates call the Partition Problem) you have a pile of treasure containing items of various values. You need to divide the treasure into two piles of equal value so you can split the booty evenly with your first mate. (A more generalized problem divides the treasure into more piles for other crewmembers but this version is complicated enough.)

You can model this problem as a binary tree. Each node in the tree corresponds to an item in the treasure. The left branch leaving that node corresponds to putting the item in the first pile and the right branch corresponds to putting the item in the second pile. A complete path through the tree represents assignments of every item to one of the two piles. Your job is to find a path that gives the two piles equal values.

If you only have a few items, the problem is easy, but the decision tree grows exponentially, so it's impossible to search the entire tree for big treasures. If the pile of booty contains N items, then the tree contains 2^N nodes. For example, if you have 100 items, you would need to examine more than 1.26E30 paths through the tree.

See the original article for more discussion, other decision tree problems, and a description of the original sequential program.

Because each path through the tree seems like a possible solution, it would make sense to try to search those paths in parallel. However, when I actually tried to do this, I encountered a couple of problems.

First, the original program uses an array to keep track of which path it is searching in the tree. The Kth entry in the array tells whether item K is in the first or second pile. That is similar to previous examples in this article where a program uses an array to keep track of what it is doing, and when you run across such problems, you can use the same solution described here: give each thread its own copy of the array.

The parallel version of the program builds the top part of the decision tree. It makes arrays representing each path through this partial tree and launches a thread to search further down the bigger tree from the current location.

For example, suppose the pile contains 10 items. Consider only the first two, giving you four possible combinations. If the items are numbered 0 and 1, then the first pile could contain any of the combinations {0, 1}, {0}, {1}, or {}. The second pile would contain the remaining items in each case. After generating these four combinations, the program would launch four threads to consider the remaining eight items in the treasure and see which produced the best division of the spoils.

 Figure 5. Puzzling Performance: In this version, the parallel versions of the Exhaustive and "Branch & Bound" (B & B) searches are slower than the sequential versions.
At first I tried using only two threads (I have a dual-core system). Oddly, the parallel version took a lot longer than the sequential version. At that point, I thought that perhaps I hadn't used enough threads so I rewrote the program so I could easily change the number of items that are pre-assigned, and thus change the number of threads.

But that didn't help. The program showed that the parallel and sequential versions were visiting roughly the same number of nodes (there's some difference in counting the pre-assigned nodes so they don't match exactly) but the discrepancy isn't nearly big enough to account for the difference shown in Figure 5.

The Difference column in Figure 5 shows the difference in value between the two piles of treasure. In this case, there is no perfect division of the treasure so the best you can do is make the piles differ by one doubloon.

The node counts in Figure 5 for the parallel and sequential version of the exhaustive and branch & bound searches are close enough to be correct but the parallel versions are still slower.

After many piratical oaths and threatening to keelhaul my laptop, I finally figured out what was happening. The code that examined the tree's nodes used a simple counter to keep track of the number of nodes visited. Because the nodes were visited by different threads, I knew that I couldn't just let them all update the counter value or they would cause race conditions, so (like a good little programmer) I used the Interlocked object. The following code shows how each thread updated the m_NodesVisited variable.

``````   Interlocked.Add(m_NodesVisited, 2)
``````
That is correct but behind the scenes the Interlocked.Add method must create a lock to coordinate with other threads. Normally, that wouldn't be a big deal, but if you look at Figure 5 you'll see that the exhaustive search visited more than 33 million nodes, meaning that the program was creating and freeing more than 33 million locks! While each lock is relatively quick, using locks that many times was killing the performance.

 Figure 6. Performance Perfected: By not tracking nodes visited, this version avoids millions of locks, and the parallel version out-performs the sequential version.
Figure 6 shows a new version of the program that doesn't track the number of nodes visited and thus doesn't need to create all of those millions of locks. In this version, the parallel algorithms beat the sequential versions handily.

As a wrap-up, my initial attempt at parallelizing the program was more or less correct. It split up the shared data so the threads wouldn't interfere with each other and it used Interlocked.Add to avoid race conditions. However, it took me a while to realize that the performance problems stemmed from creating so many locks. Completely removing the calculation that caused the potential race condition let me drop the locks, and now the parallel version provides a nice speed improvement.

A Few Things to Remember
As you can see from the case studies here, turning your sequential single-core code into multi-core capable code using the Task Parallel library can provide substantial performance improvements. But the conversion isn't quite as straightforward as you might wish. Here are some points to take away:

• Different threads can share variables safely as long as they don't try to read and write the same values. In the Mandelbrot and image filtering examples, threads write into different parts of the same arrays.
• Giving threads different copies of shared data is a useful technique for parallelizing complex algorithms. That technique worked well for the 2D Bin Packing, Traveling Salesperson, and the Booty Division problems. As long as each thread works on its own copy, they shouldn't interfere.
• Taking exceptional action to prevent race conditions can get you into trouble. Using Interlocked.Add properly prevents race conditions but, as I learned the hard way, too many locks can ruin performance. In such cases, look for other ways to prevent contention, such as completely removing the calculation.
This article showed several useful techniques for parallelizing complex algorithms. But even if you're currently developing applications for single-CPU systems, you might want to keep these techniques in mind as you write future applications. If you isolate the data used across passes through For loops, you'll be one step closer to making the changes you need to take advantage of the Task Parallel Library. Then, when you finally get that 64-core system for your birthday, you'll be ready to hit the ground running.

Rod Stephens is a consultant and author who has written more than a dozen books and two hundred magazine articles, mostly about Visual Basic. During his career he has worked on an eclectic assortment of applications for repair dispatch, fuel tax tracking, professional football training, wastewater treatment, geographic mapping, and ticket sales. His VB Helper web site receives more than 7 million hits per month and provides three newsletters and thousands of tips, tricks, and examples for Visual Basic programmers.
Email Author