When your only tool's a hammer, every problem looks like a nail, the old saw goes. Threading may be the most viable, if unwieldy, solution today for maximizing multi-core performance, but in some cases the processing problem really is a nail to threading's hammer. Take Vegas Pro 8 software, the award-winning nonlinear video editing software released this month by Sony Creative Software, a subsidiary of Sony Corporation of America (www.sonycreativesoftware.com). Equipped with a new 32-bit floating point video engine, the fully Microsoft Vista-compatible software delivers broadcast-ready, high definition content.
John Freeborg, director of engineering at the Madison, Wisc.-based company, and Dennis Adams, the software development lead on Vegas Pro, were more than willing to share their threading insights with DevX. Freeborg, who has a BS in computer science and math from the University of Wisconsin at Madison, and an MBA from Edgewood College, has been with Sony for six years; Adams has been with the company since 1999, and specializes in video processing with a background in computer graphics.
DevX: What's your experience with multithreading and concurrency?
Dennis: The audio engine in Vegas Pro has always been multithreaded, but in the early days, customers could only take advantage of that if they bought some heavy iron with dual processors. Then we went through the hyperthreading years and we started to see that this could be big. 'Round about version 5—three years ago—we multithreaded the video engine, which provided big wins on video processing. And then about a year ago we began working with Intel application engineers, exploring the OpenMP library. That helped us add some fine-grained multithreading which got us a lot of wins, such as improved playback rates.
DevX: Has this given you a competitive advantage?
John: We've always been a software-only nonlinear editor, and some of our competition would bundle themselves to a video or add-in card. We always focused on the concept that the processors were going to get a lot faster and instead spent time optimizing the application architecture with multithreaded performance in mind. As a result, we don't have legacy support problems, and that's paid off for us as well. We've also been really attractive for laptop use in the field because of our lightweight install, but in the past you always had that tradeoff of poor battery life and processor speed. Now even laptops have multi-core chips.
DevX: Is the problem your software solves particularly suited to concurrency?
Dennis: We count ourselves extremely lucky. The "single instruction, multiple data" type of problem we face plays into SSE and OpenMP really well. It spreads across multiple cores, or vector registers in SSE, well.
DevX: What's your approach to threading?
John: We use OpenMP as well as coarse-grained, or very manual, threading—directly starting threads for various things. We'll also start up OpenMP and see its effect, pro or con, on the program. I know Intel supports OpenMP and developed their Threading Building Blocks, which we've just started looking at. We were skeptical at first, but we've seen a big improvement.
DevX: Such as?
Dennis: Playback rates improved, as we mentioned. Also, some of this multithreading enabled new features in the latest version, such as a multicamera editing mode. You can shoot an event with multiple cameras and then when you put it on the timeline, say you have four cameras, you can see them as tiles in a two-by-two grid. You're effectively calling camera switches during playback with the mouse or keyboard and when you stop, Vegas Pro has adjusted the project accordingly. The multithreading allows us to play back multiple video streams and scale them with enough performance to enable the feature.
Also, the video engine has an option to run in 32-bit floating point mode. That's four times the bandwidth of the 8-bit integer mode that must be processed. If it wasn't for the threading, I don't think we could have those features.
DevX: How do you decide what to thread?
John: We developed some workloads that we could run repetitively, allowing us to regression test with different mixes of OpenMP parameters to get a nice adaptive approach to threading. OpenMP has a thread pool that essentially spreads the tasks that you tell it to across the number of available cores. In order to fine tune, we used Intel VTune and the Intel Thread Profiler quite a bit.
Dennis: Based on that profiling, we ID'd some hot spots, like audio and video digital signal processing. You look at the code, and if it doesn't have a lot of dependencies, you can literally sprinkle in #pragma omp parallel for [clauses].
DevX: What can go wrong in such a scenario?
Dennis: You need to check the loop and see if there's any shared data, and if so, you need to be a lot more careful about what the loop is doing. In graphics processing, it's common to loop over a bunch of video data. Say you have a pair of nested for loops, for y and for x. In typical C, you might initialize a pointer outside that loop and increment it inside the loop. It turns out that pointer is now really a piece of shared data. The very easy solution is, don't initialize outside the loop.
Inside the loop, for each scan line you process, recompute the pointer. Now it's an index to a scan line and no longer a contended, shared variable.
DevX: One argument Herb Sutter and others have made is that threading, as you're doing, is fine for a small number of cores but will become totally impractical once you get to tens of cores. Do you agree?
Dennis: No. I think the number of cores is irrelevant. Our multi-core threaded engine has a certain amount of state data, which limits it. It took us no additional work to go from two to four cores, but because of state data (memory addressing and other resources), we can't easily go to eight cores with a 32-bit operating system. However, with fine-grained parallelism, which has less state data, you are far less limited to the number of cores you can scale to.
I think there's an ultimate scaling runoff that has to do with any shared resources, but I don't see 16 or 32 cores as being an insurmountable issue.
John: Maybe with 32 and beyond, I think the scaling issues do become a challenge as the synchronization overhead starts to become significant. I would expect cache invalidation across lots of cores to become a real issue with memory access, and we may have to figure out other ways to partition the problem space to be more NUMA-friendly at that point.
DevX: Is learning to deal with concurrency too complex for most developers?
John: It is a new level of requirement for an engineer to understand concurrency, but people kind of get used to it and move forward. It's like moving from functional to object-oriented programming. You think you're kind of approaching it, and then you reach these little epiphany steps.
Whenever you're looking at code, you're always thinking, "Which thread is this running on?"—at least with the coarse-grained stuff. OpenMP is another kind of tool in the tool bag.
I used to think the neat part of threads was the launching of them and what they're working on. Now I realize it's all about the synchronization points, and when do you have to have a thread gracefully kill itself.
Dennis: Having learned programming from a linear functions and object-oriented point of view, you approach problems in that linear fashion. Gaining insight into how a problem should be divided up is definitely a skill that you have to learn. You can certainly write a lot of bad multithreading code that doesn't understand communication, passing messages, stuffing values into shared resources.
DevX: Do you use many locks and critical sections to void race conditions and other bugs?
John: We try to minimize locks and critical sections as much as we can. Clearly, for some of our operations we require them. The locking problem is 90% design. You can think through how much contention there will be. But from an OpenMP standpoint, you don't need locks.
DevX: Speaking of bugs, how do you find them?
John: With the coarse-grained threading, whenever you're writing or debugging you always keep in mind which code is running on which threads. Are the threads mindless or purposeful? In the debugger there's a thread window—it's getting used to what a deadlock looks like in the window, and examining the different threads to see where the problem is.
Dennis: Most of the time in a deadlock, they're stuck waiting for a resource. I usually grab a sheet of paper and write down what all those resources are. The point of contact is the area that requires your focus.
We also have to be disciplined about shared memory among threads. It is a difficult problem because the bugs you have are not simple; they're only going to hit you occasionally.
DevX: Could someone take the C++ Standard Template Library and parallelize it?
Dennis: You could think of syntax to do that. That's sort of equivalent to putting OpenMP on all those functions.
John: I think Threading Building Blocks has something like that.
Dennis: But most of the interesting things that you do in a program are not found in the STL.
John: It would be like parallelizing cout or printf; you're not going to be spending much time there.
Dennis: The STL does have a rich set of containers and iterators. A way to make those parallelizable would be interesting, but you could also just wrap that with OpenMP.
John: Some of those sorting or searching algorithms could be interesting.
DevX: To ensure thread-safety, have you had to code lockless queues, where the writing thread modifies the head pointer and the reading thread modifies the tail?
Dennis: There's still a lock there, even if it's buried under three levels of abstraction. Your atomic memory update in this case is a lock, but it's extremely lightweight. Some of our threads communicate through queues, but by using standard operating system locks such as critical sections.
John: I wouldn't go to that fine-tuning expense unless your application requires it. Clearly you're into assembly language tuning there. The memory model among different processors can get a little dicey. If you get it wrong, there are huge consequences.
Dennis: A lockless queue is important where every single element you process goes through that. In the cases where we use coarse threading, the units of work are large enough that the communication between the threads isn't a hot spot. We don't, but we could choose the slowest possible lock and it wouldn't affect the performance.
We mainly use the synchronization mechanisms that are built into the OS, like message queues and critical sections.
DevX: Now that you've been using it for a while, what would you say is the biggest problem with OpenMP?
Dennis: OpenMP's biggest problem is they have a 250-page spec. If you're doing protein folding or climate modeling, you need to read that whole spec. But if not, you can learn very little of OpenMP and get significant wins.