Does TBB apply optimizations at runtime?
James Reinders:
TBB uses a dynamic scheduling system which can split tasks and steal
tasks between processors even after it is distributed to keep idle
processors busy. This is very effective.
The concept of runtime optimization was applied even further when 'auto
partitioning' was added to TBB, which made one of the difficult parts of
TBB (specification of grainsize) optional. TBB can now 'guess' a value,
and then dynamically refine it during a run to approach optimal based on
observed behavior. This is all in keeping with making TBB effective
and easy to use.
What happens if the code runs on a system with a single CPU (one core)?
James Reinders:
It will work, there will be some overhead because the program 'can' use
multiple cores. It is very important in the design of TBB that it works
across any number of cores (even one) correctly and reasonable
efficiently. The overhead of using TBB in a program and running on a
single core is very small (I expect well under < 10 percent, probably under 1 percent
for most applications).
Please tell us... have you already had a chance to play with one of
those 80-core Teraflops Research Processors? How was it?
James Reinders:
I have seen it, and I held one. Since I helped design the first
TeraFLOP system (ASCI Red) a decade ago—it was a humbling/exciting
thing to hold a TeraFLOP chip.
I have not run any code on it, although I know people who have done so.
The 80-core chip is much more about looking at certain key hardware
functions and issues than trying software. We understand a lot of what
we'd like the hardware to be able to do—which software needs—but
building it is a different matter. The test chip helps us test out
particular solutions we are thinking of in the hardware and characterize
how well they work in practice.
How much is having future technology available useful when you develop
software for the present?
James Reinders:
Thinking hard about what the future will be like is very
important, and I don't think you can do that very well if you don't have
people investigating the future. For our customers, "Future proofing" is
an extraordinarily important concept—we need to have a sense that our
investment in adding concurrency today won't need to be reinvented over
and over. It is very certain that this is much more likely to happen
when you use abstractions (TBB, threaded libraries, OpenMP, etc) than if
you code with raw threading packages (pthreads, etc). This is a big
factor in why we promote using abstractions so much—I'm very afraid of
how poorly, in just a few years, people will like their code they write
without abstractions for parallelism.
Considering that the number of cores per computer is going up fast,
software written today for and with dual-core CPUs might run on systems with
eight or more cores in a few years. I am sure that using TBB might help to be
future-proof, but how much does TBB performance actually scale?
James Reinders:
There is no limit, due to TBB, as to how well a program using TBB could
scale. The limits will be due to the application itself and the work it
has to do.
Amdahl's Law
can predict gloom here—but only if you don't
expect to want more and more work done in the future with computers.
Assuming we will only run what we are running today, in the future—is
not a reasonable assumption. So I often point to
Gustafson's observations
about Amdahl's Law... as the better way to think about
things. I cover this in chapter 2 of my book.
TBB includes Scalable Memory allocator code developed by the
researchers of the Tera-scale research team at Intel. What can you tell us about it?
James Reinders:
A group at Intel Research, that is looking at 'many core' challenges,
wrote a 'many core run time' (McRT). The scalable allocator was part of
their work. I have the references in Chapter 12 of my book to the work—they have a
paper published on it.
Did you think about security too when developing TBB?
Not only security in your code, but also avoiding concurrency problems that
might become vulnerabilities...
this paper is a recent example.
James Reinders:
Yes, we think about them. Our current thinking is that TBB itself
allows enough control from the application writer that we aren't forcing
any vulnerabilities. Some of our other products, especially crypto libraries
in IPP (Integrated Performance Primitives), have had to do specific things related to concurrency to avoid
issues. Unlike the attacks in the paper you point to, the attacks on
applications tend to focus on observing the timing/behavior of other
code/processes to infer enough information to reduce the complexity of
an attack. The threat being that the observation will greatly increase
the odds that an attack will succeed.
How does TBB interact with the OS scheduler?
I guess you read the recent discussion about two Linux schedulers...
James Reinders:
TBB sits on top of the threading interfaces offered by operating systems.
An enhancement we are working on, is to tackle the problem of
interaction with the OS by providing 'affinity' requests to lock threads
to particular processors. This seems like an obvious optimization, but
once you try it you find it is anything but obvious sometimes. There
was a paper a couple years ago which showed that using affinity made
runtimes more predictable but raised the average runtime because it
stole control for the OS—and that the OS was making optimization which
the program did not. So leaving it to the OS the runtimes varied more
widely, worse and best cases were more extreme, average was better—and overall it looked less predictable than using affinity. This whole area
needs a lot more investigation.
What is your opinion on threads scheduling management? Should the OS
be the only interested party, or should we be able to choose how to distribute
them at application level?
James Reinders:
If I could get ONE wish fulfilled—it would be for OS scheduling to
focus on processes, and not threads, for scheduling. And demand that
processes manage their scheduling of threads. Why? Because an
effective parallel program is going to assume, in general, that all
threads are either running or stopped. It is messy to write a parallel
program when the OS may be scheduling and unscheduling individual
threads which are trying to cooperate.
What type of control can developers have with TBB in their software?
For example, can they limit the number of cores used by their software?
Or maybe ask to map it to a particular core?
James Reinders:
Limiting the number of cores—yes, that is possible but we hope the
main use is debugging. The initialization routine takes a parameter
which is normally omitted. If specified, it overrides the default
behavior of creating threads for each processor—and it creates the
number of threads specified by the argument.
For "lock down to processors" TBB doesn't offer these interfaces. We
think that should be outside TBB.
There is a lot of opportunity for operating systems to offer these types
of control in the "running of applications" interfaces. I'd like an OS
to let me specify the 'world' my application runs in (which processors,
how many, etc.)
These interfaces are available in Windows at run time (the task manager
will let you adjust where a running task can go).
I'd like to have more global tools to specify and adjust policies
(8-core machine—run "only Outlook" here, run applications on these 4
cores, OS only here, explorer here, etc.)
Is there any context where TBB shows better results? What about
video games?
James Reinders:
The key to parallelism is scaling. A sequential program will only use a
single core, and so it won't speed-up at all. A program using TBB can
expect to scale—how well is a function of the program and how much
parallelism is expressed for TBB to access.
Programs which process a lot of data—including videogames—would be
good candidates to show good results most easily.
I heard that a lot of game developers have problems sharing the load
among the various cores of Playstation 3. Considering that Linux can run on
PS3, does Intel have any plan to support Cell too?
James Reinders:
I've actually had this conversation with a few people who might try to
do a port. It will probably take more interest—our forums on
threadingbuildingblocks.org would be a good place for persons interested
in helping to announce this and look for others.
I don't think a simple "port" of TBB will be effective with the current
Cell architecture—because of the complexities of moving data to/from
the cores. TBB will need some extensions probably—and that requires
some serious thinking. The lack of a true shared memory for all the
cores brings up interesting issues. The same issues which affect
programmers today, affect the ability for us to get a TBB port for Cell.
I hope some will decide to contribute—and will take a serious look at
these issues, and maybe even suggest if a few extensions to TBB are
needed to help them implement support for Cell. This is one of the
reasons we open sourced TBB 2.0—to give others the opportunity to do
additional ports.