Data Races Are Always Dangerous
In C++0x, a "Data Race" is a particular kind of race condition where two threads both access a non-atomic variable without synchronization, and at least one of those accesses is a write. A data race results in undefined behavior, so if you have a data race in C++0x then your program really could do anything at all.
It is a disturbingly common misconception that such data races are not problematic in practice. They are. In the absence of synchronization such as a mutex lock or atomic operations, compilers are free to optimize code such that variable accesses occur in a different order than the order written in your code. Not only that, but even if the instructions are generated in the sequence you expect, the actual memory accesses performed by the CPU may occur out of order. This is a particular issue with modern CPUs that have long instruction pipelines, branch prediction, and prefetching. For instance, the actual memory access for a load may occur several instructions prior to the load.
In order to demonstrate some of the problems with data races in C++0x, I wrote the following simple program:
unsigned const increment_count=2000000;
unsigned const thread_count=2;
std::cout<<thread_count<<" threads, Final i="<<i
If you compile this code with optimization turned off, the generated assembly language looks much like you would expect. The loop in func is essentially a single INC instruction on an x86 CPU. You might therefore expect that the final value of your global variable i is simply the number of increments performed by all threads (thread_count * increment_count), which is not the case. The INC instruction is not atomic, so if you run this code on a multicore or multiprocessor system, then the final value of i will often be much less than the number of increments.
To demonstrate this point, here is the output of five consecutive runs of this code on my dual-core x86 laptop:
2 threads, Final i=2976075, increments=4000000
2 threads, Final i=3097899, increments=4000000
2 threads, Final i=4000000, increments=4000000
2 threads, Final i=3441342, increments=4000000
2 threads, Final i=2942251, increments=4000000
Because the code increments i 4,000,000 times (2,000,000 times on each thread), and it starts at zero, you might naively expect to see a final value of 4000000 (which one of the runs does produce). However, this is not the case; most of the time, you get far less. This is because the non-atomic increments on the different threads interfere with each other.
On x86 architectures, non-atomic increment operations are just a simple memory read, followed by a simple memory write of the new value. If another thread updates the value between the read and the write, then that value will be overwritten. The consequences might be different on other architectures. For instance, you might get values that are some combination of the values written, or you might get a processor exception.