Intel Go Parallel
Intel
Getting Started Concurrent Programming Community And Opinion Tools and Tips Advanced Concepts Go Parallel RSS Feed
 Print Print

Managing Multi-Core Projects, Part 6: Diving Deeper on Consumer Software
Threading raises some not-so-obvious design issues in desktop software, issues that might have been overlooked before the era of multi-core. Part 6 concludes the series with a discussion of these issues, and with a closer look at how the threading process relates to development methodology. 

More Resources
  • Part 1 of this Series: HPC on the Parallelism Frontier
  • Part 2 of this Series: HPC Project Implementation
  • Part 3 of this Series: Multi-Core Development in the Enterprise
  • Part 4 of this Series: The Enterprise Development Cycle Meets Multi-Core
  • Part 5 of this Series: Desktop Software Considerations
  • Product Review: Intel Threading Building Blocks
  • This final installment in our multi-core management series is all about diving deeper. We've discussed Intel Developer Products Division's (DPD) steps for threaded development before, but here we'll go further into how we might adapt these steps for multi-core development to a specific development methodology. Then, returning our focus to consumer and client software, we'll get into some of the architectural and design issues of threading desktop applications.

    By way of review, Intel DPD has a four-step design and development process for parallel programs. Intel DPD's process has emerged from the years of experience the group has in threading HPC, visualization, and other customer applications. Briefly, the four steps are to discover the natural mode of parallelism for an application, to express that mode as a parallel programming model, to gain confidence that the model is effective, and finally to optimize for performance.

    The four-step process is general enough that it can be applied within any programming system. It's a way to emphasize the importance of the parallelism model and the importance of continuous revision in the creation of threaded programs.

    Let's take a look at how these four steps might work within Extreme Programming (XP). Even if you're in an XP shop, your team may follow some, but not all XP practices, and not all XP practices are going to be sensitive to parallel vs. serial development ( for example, keeping a sustainable pace is unrelated to the type of code you are creating). What follows are selected XP practices with notes on how they relate to Intel DPD's four-step process. The items in this list are adapted from the Rules and Practices page at ExtremeProgramming.org.

    Small Releases: XP breaks larger projects down into deliverable iterations that can be completed in a short period of time. Keep the scope of threading changes small by focusing on single modules if possible. In that way, programmers can run through each step within a single XP iteration.

    Spike Solutions: Spike solutions are small, independent programs written to explore solutions outside the main body of code. These may be used as part of the discover step to test approaches to parallel decomposition. The "parallel design patterns" you develop through spike solutions are important. They help in understanding what sections are scalable in estimating performance improvements using Amdahl's law. They can also help outline where synchronization is needed, and what synchronization methods are appropriate. Often, the best way to improve parallel performance is to restructure the data or algorithm to reduce synchronization.

    No Early Functionality: The natural mode of parallelism may vary from module to module. In the express step, programmers should focus on proper implementation of the parallel model, not on performance optimization or general-purpose routines.

    Refactor: The parallel development process is iterative. Refactor the parallel model when appropriate, moving back to further discovery when you see potential for improvement later on. Refactoring to avoid synchronization is a good example. In a larger sense, "refactoring" may apply to the threading of serial functions. That larger refactorization from serial to parallel would include working through all four steps.

    Unit Test First: In XP, initial unit tests are created with the code. This means either in discovery, to test the scalability of the parallel model, or as part of the express step, to ensure that threaded code properly represents the model. Additional tests will be needed to find threading conflicts as you test for confidence.

    Optimize Last: Both XP and Intel DPD's four steps advise against premature optimization. Only after testing has developed confidence in the model should you move on to tuning and optimization.

    XP and discover, express, confidence, and optimize share the central idea that development is an iterative process. In building parallel software, it's essential that you get the parallel model right. Ideally this happens in discovery, but the model can be adjusted later on in the process. Use the four-step process within an XP iteration to deliver threaded changes built on a tested parallel model. Further refine and refactor the parallel model in subsequent XP iterations.

    Architectural Options
    Desktop applications are unlike server software and typical HPC programs in lots of ways, but the critical difference for our purposes is that desktop applications must be interactive. There's a user on the other side of that monitor and keyboard, and users like to see real-time responses. In an interactive application, you have only short intervals in which you can run parallel before you need to synchronize threads to update the user's view. The shorter the interval, the more responsive your application will appear, but the less time it can spend in pure parallel sections.

    Games and animation packages will generally need to synchronize once per frame, for two reasons. First, the view needs to be in a consistent state at the start of each frame so that it is ready for display. Second, both OpenGL and DirectX require that only a single thread access each library. With a frame rate of 60-100 FPS, the parallel intervals are short, and synchronization overhead may become significant. Contrast this artificial synchronization requirement to that of a long-running HPC process, where threads or nodes are only synchronized as the algorithm requires.

    You may not need to synchronize once per frame in a productivity package, but you've still got to be responsive. We know anecdotally that users prefer longer execution time with feedback to absolute minimum wall-clock time-to-solution. Sometimes it will be necessary to synchronize threads in order to provide that feedback on screen, even at the expense of greater parallelism.

    The synchronization requirement limits your architectural options in designing the parallel model for desktop systems. One option is data parallelism, where data is split across several threads during a parallel section but all data threads rejoin the main thread at the end of each interval (each frame, for a game). Then the main thread synchronizes with a rendering thread to update the display. This approach is relatively straightforward, and it's a natural fit both for OpenMP's fork-join model and for Intel Threading Building Blocks (TBB).

    Data parallelism is effective on dual-core systems, but performance improvements begin to level out on more than two cores. With data parallelism, all data threads must wait on the slowest data thread as they synchronize at the end of each interval, limiting scalability.

    For games, or other performance-critical software, domain decomposition provides a more scalable alternative. In domain decomposition, related tasks are grouped into domains ("related" so that communication, and therefore synchronization, is minimized between domains). The application repartitions data across threads for load-balancing at the start, and periodically after that, but not with every frame. Domain decomposition eliminates synchronization between data threads and the main thread, leaving only the rendering thread synchronization to be completed for each frame.

    Design for Threading
    Design new code for threading. That's not the same as saying that new code should be threaded wherever possible—while in many cases that might be a good idea, there are cases where it's not. If you're fielding a new, complex algorithm, it's probably better to make sure the serial version works as expected before attempting to derive a parallel implementation.

    But, as we've said before, parallelism is a design consideration, not a performance optimization. So, whether code is threaded or not, design it to be threadable. For example, partition modules so that tasks with frequent shared-memory communication are in a single module, and minimize communication between modules.

    It's often the case that new features or lead customer requirements are compute-intensive and anticipate the latest hardware. This may help to raise the importance of considering threading issues in design, among both developers and customers.

    If your desktop application supports plug-ins for user customizations, you know that plug-in compatibility puts a constraint on changes you can make to your software, both internally and to the plug-in API. Multi-core adds another consideration: plug-ins may themselves be threaded, and plug-in threads plus internal threads makes oversubscription more likely.

    Consider exposing thread management as part of your plug-in API. That way, you can ensure that both application and plug-in threads come from a common pool of the appropriate size. You'll also make threading easier for plug-in developers.

    Take an evolutionary approach to adding thread support. Start with the basics, including just a GetThread call, a ReleaseThread, and a critical section. You can follow that with advanced capabilities and other synchronization objects in later releases.

    Multi-Core and Management
    There are a few points that we've made throughout this series that apply as well to HPC and server software as they do to client and consumer applications. No matter what kind of development project you're steering, you can take best advantage of the power of multi-core if you keep these items in mind:

    Develop Parallel Skills Throughout the Team: Parallelism needs to be considered in every phase of development. The team is made much more effective when each member has the right amount of parallel expertise for his or her role.

    Take an Incremental Approach: Discover the natural parallel model for your project, then incrementally improve the model as you test and gain experience.

    Consider Libraries and Tools: Don't underestimate the importance of libraries, debuggers, and optimization tools. These are especially valuable as your team makes the transition from single-threaded to multi-threaded, and as you thread legacy applications.

    Parallelism is Design, not Optimization: It's most important that you get the parallel model right, for performance, for scalability, and to reduce conflicts. Threading is too fundamental a change to be handled as an optimization.

       
    Steve Apiki is senior developer at Appropriate Solutions, Inc., a Peterborough, NH consulting firm that builds server-based software solutions for a wide variety of platforms using an equally wide variety of tools. Steve has been writing about software and technology for over 15 years.
    Submit article to:
    Ever wonder why we don't hear more from threading practitioners about how they managed to grok concurrency? Perhaps it's because they're too busy enjoying the performance increases. They won't say it's easy, but the Vegas Pro developers at Sony Creative Software are understandably proud of their growing expertise in threading and OpenMP. »
    While threading can be a challenge, new software development tools help simplify the process by identifying thread correctness issues and performance opportunities. We present a methodology that has been used to successfully thread many applications and discuss tools that can assist in developing multi-threaded applications. »
    This paper describes the performance analysis phase of the threading methodology we presented in our previous paper, "Best Practices for Developing and Optimizing Threaded Applications." »
    Understanding Dual Processors, Hyper-Threading Technology, and Multi-Core Systems
    Multi-Threading in a Java Environment
    » More Personalized Content
    Getting Started (94)
    Concurrent Programming (110)
    Community and Opinion (50)
    Tools and Tips (86)
    Advanced Concepts (60)
    What concurrency info do you need right now?
    (Choose your top answer.)
    An introduction
    Threading basics
    Advanced parallelism concepts
    Optimization tools and techniques

    View Results
    Past Votes