Think about parallelism in desktop software, and you should be thinking smallsmall systems, of course, but also the small intervals of time you can run parallel between user interactions, a small number of cores (for now), and, in most cases, a small threaded codebase. Some of these "small" factors are the reasons that multi-core processors will make an even more decisive impact on client and consumer applications than they have on server software. On desktop and laptop systems, the move to parallel software is just getting underway, so the opportunity to differentiate with parallel performance has never been greater. And if you can keep it scalable, your application can continue to dazzle users as these machines track new processor generations with more cores per package.
From a half-empty perspective, you could say the opportunity for parallel improvement is so great because desktop teams are starting with so little. In comparison to HPC and server software, few client or consumer applications are threaded. Threaded desktop applications are most commonly threaded by task, with tasks such as communicating with a long-running server process, or loading graphics textures, or handling I/O, relegated to a background thread to keep the main thread from blocking. That's not the kind of threading, the threading of compute intensive operations, that's the key to scalable performance on multi-core systems.
Before multi-core, there was little incentive for data parallelism on the desktop. Now it's critical if you intend to scale out to more cores, not only in obvious application categories like games and graphic effects, but in thick client and productivity packages as well. As the development manager, how do you guide your team through that transition? In these last two segments of our management series, we'll discuss approaches to introducing parallelism, and some of the factors that make the desktop so different from parallel HPC and server configurations.
Finding the Right Targets
Let's look at a single-threaded legacy application from two perspectives, code-centric and data-centric. You can find opportunities for threading from both perspectivesbut each also imposes its own constraints on how threading changes can be made.
From a code-centric perspective, an application looks like a stack of modules, core modules on the bottom, application-specific modules that make use of core services on top. The natural break for parallelism occurs at different levels in this stack for different applications. A couple of examples may make this more concrete.
In a game, the first level of parallelism is very coarse. You've got critters running around, you want each critter to operate in parallel. You can thread a game like this at a high level, if not one thread per critter, then perhaps one thread for a group of interacting critters. In any case, the change occurs far away from the core modules, up in game-specific logic.
On the other hand, you won't find such high-level parallelism in a media playeryou're only playing back one DVD, after all. Effects are instead better handled in a pipeline, with a thread per stage. Unlike the game, the natural place to thread the media player is at a very low level, down in the video codec itself.
The level at which parallelism naturally occurs will determine how you can best introduce threading. If there's high-level parallelism, you can thread at that level with little direct impact on the rest of the application. (There may be an indirect impact, though, if you find that underlying modules are not thread-safe and would need to be made thread-safe or replaced with thread-safe versions.) If parallelism is at a low level, and you need to thread a core module, the direct cost of the change may be greater. The benefits of threading the application must be measured against that cost.
Similarly, from a data-centric perspective, threading changes must be measured against how they affect the data model. Rearranging the data model to accommodate more parallel structure may be a high-cost choice, or it may simply not be an option in a legacy system. However, it may be possible to thread the application without modifying the data model, if application code can be refactored so that the threaded sections work only with isolated portions of the data. In this case, the data model can be partitioned into threaded-access and non-threaded access segments.
One or Two Experts
As we've discussed before in this series, the ideal parallel software development team has parallel experience at every position, from the software architect to the test engineer. It's hard to get near that ideal in client and consumer software, where threading veterans are hard to find. Instead, most desktop development teams must rely on the advice and experience of one or two threading experts as they make the move to threaded development.
A common practice when there's a large team and a few threading experts is to give the experts global access to the code and simply have them make all thread-related changes. There are two problems with this approach. First, threading experts can't have deep experience with most modules, yet they are expected to make changes with deep impact. Second, it's difficult to coordinate thread-related and feature-related or functional changes when both must be done in the same revision cycle, which is often the case.
A better first place to cultivate parallel expertise is in the application expert, the member of the technical team with the greatest understanding of the application domain and user expectations. This position may have different titles in different companies, but the intent is to move parallel understanding to the team member that first translates new creative ideas into technical designs.
By making the application expert the threading expert, you are more likely to get the parallelism model right, to get threading applied at the right level, and have the least impact on the data model. With limited expertise to go around, that is a better option than focusing experts on threaded implementation at the expense of design.
Digital Content Creation and the Threading Process
Digital Content Creation (DCC) software includes tools for 3D modeling, rendering, graphic effects, and animation. DCC software demands performance, both to respond quickly to user input and to cut run times on effects generation and high-quality renders. Intel application engineers have worked with several software vendors in this category to thread their applications for effective multi-core performance.
A typical experience is illustrated by one such vendor who has enjoyed significant success in threading portions of its DCC application since kicking off its multi-core threading effort nearly two years ago. Its product is an integrated package with a number of modules, and the development team's choices of where to introduce threading (and where to leave it out) are especially interesting.
The team's first target was a partial differential equation solver used in simulations for graphics effects. The solver runs equations across large array data and is a natural for threading through domain decomposition. The solver was an ideal place to introduce threading, not only because of its readily-parallelized data, but because of its limited impact on the core data model. It took two person-months to thread this first module.
With new releases of the product, additional sections have been threaded. But some key elements have not been (and may never be) threaded. These are elements that touch a great many nodes in the data model. Any modification would have implications throughout the system, and the development team decided that the benefits of threading these modules, at least to this point, weren't worth the impact on the product.
The team, with Intel engineers in a consulting role, followed a pattern of slow and careful introduction of parallelism, in actual threading implementation as well as in choosing what to thread. They found that OpenMP was an excellent way to quickly prototype parallel changes. Since OpenMP is so easy to turn on and off, it also provided an easy means to compare parallel performance to that of the serial implementation.
When implementing multiple pipelines, engineers prototyped parallel sections with OpenMP and later went back and re-coded using Intel TBB. In this case, Intel TBB gave an added performance edge that the company's engineers decided was worth the additional coding effort. TBB is still relatively new, and the team expects to be able to use TBB for prototyping as it becomes more established.
Introducing threading is a significant challenge, but it's one that can be met with a careful, pragmatic approach. The first steps are choosing the right level for threading and educating your team. In our next installment we'll look at some additional multi-core desktop issues, including changes to the development cycle, and some design considerations unique to client and consumer software.