Intel officially released the source code of Threading Building Blocks (TBB) during
OSCON 2007.
Federico Biancuzzi interviewed the Chief Evangelist for Intel's Software Development Products to learn more about the project.
Could you introduce yourself?
James Reinders:
I'm Intel's Chief Evangelist for Intel's Software
Development Products, and Director of Sales and Marketing (for Intel
Software Development Products). Another way to look at it is I'm an
engineer who joined Intel in
1989 'cause I thought it would be a cool
place to work for a few years. I joined working on parallel
supercomputers, and now I get to work on multi-core which brings
parallelism to everyone. Not a bad deal.
What are Intel
Threading Building Blocks?
James Reinders:
Extensions for C / C++ for parallel programming. Most important—these
extensions offer an abstraction which removes the programmer from thread
management—and that is very important. These extensions work with all
compilers, because it is implemented as a template library.
Why did you choose to use templates?
James Reinders:
We wanted to work with every C++ compiler immediately. C++ templates
offer the perfect method to do this—immediate and still very efficient
due to the strong support for generic programming that C++ offers.
Extensions, such as OpenMP,
take years to be universally available. In
the case of OpenMP, it was about a decade from when we first implemented
before it was available from all the popular compilers. We didn't want
to wait ten years, and we didn't see a need to wait.
What type of projects would get the biggest advantages using TBB?
I am thinking about existing codebase, where maybe they are already using
native functions to manage concurrency, or maybe they are not taking
advantage of multi-core at all.
James Reinders:
C or C++ programs with parallelism which is not just simple
loop-oriented data-parallel parallelism.
If you have loop-oriented data-parallel parts in a program (C or
Fortran)—use OpenMP if possible. Otherwise, when you have a little
more complex program in C or C++, you'll find TBB offers a great deal
of flexibility that makes many forms of parallelism easy to represent.
OpenMP cannot offer that.
How much time can developers save using TBB?
James Reinders:
They save time four ways—implementation, debugging, tuning and
updating for the future. What we actually see is TBB makes the
difference between the problem being approachable, and not getting done
at all because it is not approachable.
That said, I've seen accomplishments by programmers new to parallelism
in a day which I doubt the same programmer would have gotten done in two
weeks if they had studied hard and worked each day. And that's just
considering the work to write, debug and tune the application to be
equivalent. It ignores the "future proofing" which an abstraction (like
TBB) offers—because you can assume that TBB will evolve to support new
hardware with little or no effort for applications using it. Whereas,
handwritten code will need to be rewritten as the hardware changes in
ways originally not anticipated.
How would you debug software that is using TBB?
James Reinders:
You can reasonably expect to get your job done with current tools
because TBB leads you to programs more likely to 'just work' than less
abstract ways of implementing parallelism.
However, I would strongly recommend some debugging tools and tuning
tools to help.
Does Intel provide any special tool?
James Reinders:
Yes, I recommend the
Intel Thread Checker
(to directly pinpoint
potential deadlock and potential race conditions) and
Intel VTune
Performance Analyzer (with the Intel Thread Profiler included with VTune)
for performance tuning.
TBB has an extra option to be used with a few extra hooks for the
checking tool which makes the task even better.
Since Intel is also a major CPU creator, is there any secret that TBB
is using to provide better performance?
James Reinders:
I assume you mean "secrets connected to using the hardware".
The thing closest to the hardware is the very careful implementation of
locks and atomic operations. Getting those right means knowing the best
way to do them for a particular piece of hardware. If that interests
you, then look at the implementation in the
source
for x86, x86-64, Itanium, or G5 that are already coded up there.
But outside that very low level aspect—TBB's power is higher level
constructs which don't lean on a precise hardware coding method. The
coolest things are probably: (a) task stealing, (b) coding to be cache
invariant, (c) scalable memory allocation. If these are of interest, I
speak to them some in
my book on TBB
along with references to further
reading... and the source code is there to look at too.
I was reading the FAQ and here
I found that "Sun and Intel are working together to support the Solaris
platform using Sun Studio software. This contribution is expected during
the latter half of 2007." Any news?
James Reinders:
The current build works for Solaris on x86. Sun is looking at building
with their compilers (instead of gcc) and supporting SPARC as well. The
effort is on track to have binaries posted this year, hopefully sooner
than later. It's a bit of a 'time available' effort—community fashion—so I don't have a fixed date. I know there aren't any issues other
than finding time to finish it up and post it.
Intel CPUs share a big L2 cache among the cores, while AMD ones have
a smaller L2 cache for each core. Does TBB handle this too? How?
James Reinders:
TBB works to use cache invariant algorithms—which means algorithms
which automatically fit into the cache available by design (not by
computing size and adjusting). Our math libraries are designed the same
way.
The other thing is the load balancing in TBB handles the difference in
performance different tasks will have, and juggling to keep all
processors busy. This is really the bigger one, I think, for TBB—
because cache variations make static partitioning of a problem unlikely
to be optimal everywhere—so dynamic adjustments are very important.
What type of resources does TBB require? It cuts down the time spent
to write the software, but how much does it weigh on performance and
memory usage?
James Reinders:
Minimal impact on footprint (the libraries are not very large), and the
only 'overhead' on performance would come from the 'dynamic' manner of
coding parallelism so TBB can do load balancing. Most generally, this
is a performance win which can't be eliminated. However, if a static
schedule will work for you and you have a lot of processing to do (very
coarse grain)—the overhead of TBB can be noticeable—but I think
we're still talking 10-20 percent at most. Such programs might be better
written in OpenMP using the static scheduling directive. I've seen some
people port an OpenMP application where they used static scheduling with
OpenMP, and then complain about slowdowns. This is not the right test
for TBB—since TBB can code many, many problems which OpenMP cannot. We
made TBB co-exist with OpenMP and hand coded threads, so you can use
each if that is your preference even in a single application.