Intel Go Parallel
Intel
Getting Started Concurrent Programming Community And Opinion Tools and Tips Advanced Concepts Go Parallel RSS Feed
 Print Print
Threaded Building Blocks: The Insider's Perspective (cont'd)

Does TBB apply optimizations at runtime?

James Reinders: TBB uses a dynamic scheduling system which can split tasks and steal tasks between processors even after it is distributed to keep idle processors busy. This is very effective.

The concept of runtime optimization was applied even further when 'auto partitioning' was added to TBB, which made one of the difficult parts of TBB (specification of grainsize) optional. TBB can now 'guess' a value, and then dynamically refine it during a run to approach optimal based on observed behavior. This is all in keeping with making TBB effective and easy to use.

What happens if the code runs on a system with a single CPU (one core)?

James Reinders: It will work, there will be some overhead because the program 'can' use multiple cores. It is very important in the design of TBB that it works across any number of cores (even one) correctly and reasonable efficiently. The overhead of using TBB in a program and running on a single core is very small (I expect well under < 10 percent, probably under 1 percent for most applications).

Please tell us... have you already had a chance to play with one of those 80-core Teraflops Research Processors? How was it?

James Reinders: I have seen it, and I held one. Since I helped design the first TeraFLOP system (ASCI Red) a decade ago—it was a humbling/exciting thing to hold a TeraFLOP chip. I have not run any code on it, although I know people who have done so. The 80-core chip is much more about looking at certain key hardware functions and issues than trying software. We understand a lot of what we'd like the hardware to be able to do—which software needs—but building it is a different matter. The test chip helps us test out particular solutions we are thinking of in the hardware and characterize how well they work in practice.

How much is having future technology available useful when you develop software for the present?

James Reinders: Thinking hard about what the future will be like is very important, and I don't think you can do that very well if you don't have people investigating the future. For our customers, "Future proofing" is an extraordinarily important concept—we need to have a sense that our investment in adding concurrency today won't need to be reinvented over and over. It is very certain that this is much more likely to happen when you use abstractions (TBB, threaded libraries, OpenMP, etc) than if you code with raw threading packages (pthreads, etc). This is a big factor in why we promote using abstractions so much—I'm very afraid of how poorly, in just a few years, people will like their code they write without abstractions for parallelism.

Considering that the number of cores per computer is going up fast, software written today for and with dual-core CPUs might run on systems with eight or more cores in a few years. I am sure that using TBB might help to be future-proof, but how much does TBB performance actually scale?

James Reinders: There is no limit, due to TBB, as to how well a program using TBB could scale. The limits will be due to the application itself and the work it has to do. Amdahl's Law can predict gloom here—but only if you don't expect to want more and more work done in the future with computers. Assuming we will only run what we are running today, in the future—is not a reasonable assumption. So I often point to Gustafson's observations about Amdahl's Law... as the better way to think about things. I cover this in chapter 2 of my book.

TBB includes Scalable Memory allocator code developed by the researchers of the Tera-scale research team at Intel. What can you tell us about it?

James Reinders: A group at Intel Research, that is looking at 'many core' challenges, wrote a 'many core run time' (McRT). The scalable allocator was part of their work. I have the references in Chapter 12 of my book to the work—they have a paper published on it.

Did you think about security too when developing TBB? Not only security in your code, but also avoiding concurrency problems that might become vulnerabilities... this paper is a recent example.

James Reinders: Yes, we think about them. Our current thinking is that TBB itself allows enough control from the application writer that we aren't forcing any vulnerabilities. Some of our other products, especially crypto libraries in IPP (Integrated Performance Primitives), have had to do specific things related to concurrency to avoid issues. Unlike the attacks in the paper you point to, the attacks on applications tend to focus on observing the timing/behavior of other code/processes to infer enough information to reduce the complexity of an attack. The threat being that the observation will greatly increase the odds that an attack will succeed.

How does TBB interact with the OS scheduler? I guess you read the recent discussion about two Linux schedulers...

James Reinders: TBB sits on top of the threading interfaces offered by operating systems. An enhancement we are working on, is to tackle the problem of interaction with the OS by providing 'affinity' requests to lock threads to particular processors. This seems like an obvious optimization, but once you try it you find it is anything but obvious sometimes. There was a paper a couple years ago which showed that using affinity made runtimes more predictable but raised the average runtime because it stole control for the OS—and that the OS was making optimization which the program did not. So leaving it to the OS the runtimes varied more widely, worse and best cases were more extreme, average was better—and overall it looked less predictable than using affinity. This whole area needs a lot more investigation.

What is your opinion on threads scheduling management? Should the OS be the only interested party, or should we be able to choose how to distribute them at application level?

James Reinders: If I could get ONE wish fulfilled—it would be for OS scheduling to focus on processes, and not threads, for scheduling. And demand that processes manage their scheduling of threads. Why? Because an effective parallel program is going to assume, in general, that all threads are either running or stopped. It is messy to write a parallel program when the OS may be scheduling and unscheduling individual threads which are trying to cooperate.

What type of control can developers have with TBB in their software? For example, can they limit the number of cores used by their software? Or maybe ask to map it to a particular core?

James Reinders: Limiting the number of cores—yes, that is possible but we hope the main use is debugging. The initialization routine takes a parameter which is normally omitted. If specified, it overrides the default behavior of creating threads for each processor—and it creates the number of threads specified by the argument.

For "lock down to processors" TBB doesn't offer these interfaces. We think that should be outside TBB.

There is a lot of opportunity for operating systems to offer these types of control in the "running of applications" interfaces. I'd like an OS to let me specify the 'world' my application runs in (which processors, how many, etc.)

These interfaces are available in Windows at run time (the task manager will let you adjust where a running task can go).

I'd like to have more global tools to specify and adjust policies (8-core machine—run "only Outlook" here, run applications on these 4 cores, OS only here, explorer here, etc.)

Is there any context where TBB shows better results? What about video games?

James Reinders: The key to parallelism is scaling. A sequential program will only use a single core, and so it won't speed-up at all. A program using TBB can expect to scale—how well is a function of the program and how much parallelism is expressed for TBB to access.

Programs which process a lot of data—including videogames—would be good candidates to show good results most easily.

I heard that a lot of game developers have problems sharing the load among the various cores of Playstation 3. Considering that Linux can run on PS3, does Intel have any plan to support Cell too?

James Reinders: I've actually had this conversation with a few people who might try to do a port. It will probably take more interest—our forums on threadingbuildingblocks.org would be a good place for persons interested in helping to announce this and look for others.

I don't think a simple "port" of TBB will be effective with the current Cell architecture—because of the complexities of moving data to/from the cores. TBB will need some extensions probably—and that requires some serious thinking. The lack of a true shared memory for all the cores brings up interesting issues. The same issues which affect programmers today, affect the ability for us to get a TBB port for Cell.

I hope some will decide to contribute—and will take a serious look at these issues, and maybe even suggest if a few extensions to TBB are needed to help them implement support for Cell. This is one of the reasons we open sourced TBB 2.0—to give others the opportunity to do additional ports.

Page 2 of 2
Previous Page: Getting Acquainted  
Page 1: Getting AcquaintedPage 2: Digging Deeper
Submit article to:
Ever wonder why we don't hear more from threading practitioners about how they managed to grok concurrency? Perhaps it's because they're too busy enjoying the performance increases. They won't say it's easy, but the Vegas Pro developers at Sony Creative Software are understandably proud of their growing expertise in threading and OpenMP. »
While threading can be a challenge, new software development tools help simplify the process by identifying thread correctness issues and performance opportunities. We present a methodology that has been used to successfully thread many applications and discuss tools that can assist in developing multi-threaded applications. »
This paper describes the performance analysis phase of the threading methodology we presented in our previous paper, "Best Practices for Developing and Optimizing Threaded Applications." »
How Can Theory of Constraints Help in Software Optimization?
Performance Scaling in the Multi-Core Era
» More Personalized Content
Getting Started (91)
Concurrent Programming (108)
Community and Opinion (48)
Tools and Tips (85)
Advanced Concepts (59)
What concurrency info do you need right now?
(Choose your top answer.)
An introduction
Threading basics
Advanced parallelism concepts
Optimization tools and techniques

View Results
Past Votes