16 June 2009

Multi-Threaded Synchronization Performance in C#

I spend a lot of time studying and debating the many ways application performance can be improved. While working with applications that involves network usage or any other high-latency read process, making them multi-threaded is one of the best ways to decrease turnaround time. While waiting for more data to come in through the network, it is almost universally a Good Thing to start processing those jobs in the background. The next few posts will be dedicated to multi-threading performance, detailing the creation of a simple producer/consumer system and a reader/writer lock.

Let's start with the basics: A worker thread needs to communicate with its parent and while these interactions should be minimized, they are, at some point, necessary. What is the best way to get information from one thread to another? What is the least expensive synchronization mechanism?

After perusing the internet, there appears to be a shortage of real-world benchmarks for different methods of synchronization. This likely comes from the fact that performance is so incredibly varied. The same test may perform better with one algorithm on a dual-core system and reverse on a single-core. Algorithms that work for all cases on an x86 or x87 platform may flat-out fail on other architectures such as Power. Universally, the choice of synchronization solution depends on variables and conditions not directly related to the algorithm. But, being an American, I love to see how things fare in head-to-head combat.

To have our battle to the death of algorithms, we must first choose what our code should accomplish. I've chosen to increment a shared long integer, as the actual operation we are attempting to accomplish will take relatively few cycles as compared to the synchronization mechanisms used to generate the correct result. A shared long also ensures that memory reads and writes cannot happen atomically without extra overhead.

In summary: There will be 1 000 threads incrementing a shared long integer 100 000 times.

Just so you know how things are working, this is basically how threads will be run:

TimeSpan ThreadTest(ThreadStart threadStart, out long value)
// start our long at 0
m_value = 0;

// create all the threads to point at threadStart --
// which should be a function pointer to an incrementer
List<Thread> threads = new List<Thread>();
for (int i = 0; i < 1000; i++)
Thread t = new Thread(threadStart);

// collect garbage now to minimize interference

// start each of the threads
DateTime start = DateTime.Now;
foreach (Thread t in threads)

// join each of the threads until completion
foreach (Thread t in threads)
DateTime finish = DateTime.Now;

value = m_value;
return finish - start;
So: how can we accomplish this goal? The simplest solution would be to have 1000 threads looping and incrementing 100 000 times, like so:

private void NoConcurrency()
for (int i = 0; i < INCREMENTS; i++)
This process, however, will not give you consistent results and with this many threads and increments, m_value will never finish with the correct value of 100 000 000. The reason for this is that m_value++ is not an atomic operation, it actually looks more like:
long tempVal = m_value + 1;
m_value = tempVal;
With many threads all reading an writing, the value of m_value could easily have changed since when it was read. Ultimately, the final value will be much less than the desired 100 000 000.

The classic solution to this problem is to use explicit locks through mutexes. C# has a convenient construct for this, wherein one simply calls lock(object) { ... } and a mutex designated by the given reference object is acquired and before the code block and released afterwards. Changing our original non-concurrent method is quite simple:
private void MutexBlock()
for (int i = 0; i < INCREMENTS; i++)
lock (lockObject)
Wow! That was easy! Now m_value will end at the desired value. This .NET's convenient mutual-exclusion lock and can be used for any reference object. It means that only one thread can obtain a lock on the given lock object at a time. Since we're assuming all access to m_value is done in this function, it means that the read, increment and write-back of the value happens atomically (technically not 100% true: if somebody Ctrl+Alt+Deletes the process, that notion flies out the window).

But incrementing something is such a common operation and locking is slow...surely there must be a faster way of performing this operation without the use of locks. Of course! Enter .NET's convenient batch of functions inside the Interlocked class.
private void InterlockedIncrement()
for (int i = 0; i < INCREMENTS; i++)
Interlocked.Increment(ref m_value);
That's neat - somebody made it so you can just increment a value - and in a single operation, too! Be wary, though, this does not compile directly to an x86 assembly statement like lock inc [eax]. There is a lot more stuff going on behind the scenes, with memory barriers and some volatile writing so when we get off of x86 and move to something more interesting like PowerPC, our assumptions of this working hold true.

I'm sure you can see the shortcoming of the Interlocked.Increment function: you can only increment an integer or a long by 1. Okay, so there is an Interlocked.Add function which lets you add (or subtract) values, but none of these are very interesting. No, it doesn't matter for the current operation, but I'm sure you can see that it would be beneficial to have some flexibility.

What we need is a system of locking that doesn't actually use the lock construct. Luckily, there are a million ways to do this. One of them is through the use of Interlocked.Exchange.
private void InterlockedExchangeBlock()
long val;
while ((val = Interlocked.Exchange(ref m_value, -1) == -1) ;
Interlocked.Exchange(ref m_value, val);
What is happening here? First, we're swapping out the value at m_value with -1 and checking if the value we received is -1. We keep trying again and again until we get a value that is not -1. Then, we increment the value we found in m_value and save it back.

The keen observer will notice that this is basically the lock construct, except we are wasting CPU cycles by spinning. Why would we do this instead of using an explicit lock? The answer is actually quite complicated. When a .NET thread is unable to obtain a lock for a mutex, it yields to let the operating system reschedule the thread. That is fine, but the thread could be rescheduled a long time in the future and if we know that the value is going to become readable very soon, we should spin-wait until we can grab it. Yes, we'll waste some CPU cycles, but it's all in the name of making a super bad-ass program.

But we've definitely freed ourselves of the problem with using Interlocked.Increment - sure, we're only incrementing, but we could have multiplied m_value by 10 or found the m_valueth Fibonacci number before storing it back in its position. The only real problem we run across is if m_value changes while we're performing the math on it. Is an atomic way to check for this sort of thing? Of course! That's what Interlocked.CompareExchange is for - it takes one more argument, which is compared to the value in memory and the exchange is only performed if the value in memory is equal to the third argument.
private void InterlockedCompareExchange()
for (int i = 0; i < INCREMENTS; i++)
long orig, newval;
orig = m_value;
newval = orig + 1;
} while (Interlocked.CompareExchange(ref m_value, newval, orig) != orig);
This is a very strange-looking function. We grab the original value from the memory location, perform the math and then to something funny. What that while thing means in English is:
  1. Compare the current value of m_value with what we originally pulled
  2. If they are equal, put the new value into m_value
  3. If they are not, retry the loop
This is true lockless programming. At no point is it possible for a thread to stall - if something bad happens and m_value is changed, the loop will be repeated, which is not really that big of a deal. This operation is persistent and atomic.

But it turns out that the biggest advantage of this algorithm is also its biggest weakness. In this case, redoing the operation is trivial - reading from memory and adding 1 is really, really cheap. When that work is thrown out, we have wasted very few cycles. But if we were doing something more expensive, such as downloading a web site and performing actions based on the received information, throwing out that information represents a huge waste. Also, as the time span between the initial read and the final write-back increases, the likelihood of that original read becoming invalid increases. It's really an art.

Okay, enough talk. What are the results? I ran the tests described here as well as "batch" versions, which obtain the lock and then perform the math. See the link at the bottom of the post for the exact source code. All of the tests were run under DEBUG and RELEASE compilation, both with the Visual Studio debugger attached and by themselves...here is a table of results:

Test NameRunning Time (in seconds)
DEBUG, AttachedDEBUG, CleanRELEASE, AttachedRELEASE, Clean
No Concurrency *3.374 (91%)0.499 (75%)3.843 (93%)0.281 (34%)
AutoResetEvent **1:35.6551:54.9992:22.6701:02.773
No Concurrency (batch) *4.203 (76%)0.218 (50%)3.630 (78%)0.910 (51%)
Interlocked.CompareExchange (batch)4.2180.3903.3830.137
lock(object) (batch)4.6200.4214.6560.265
AutoResetEvent (batch)2.8280.4372.8120.281
Interlocked.Exchange (batch)4.7650.3744.5930.780
* Non-synchronizing algorithm yeilds incorrect results -- number in paraenthesis gives correct percentage.
** The Auto Reset Event Block is calculating 1/100th of the data of the other algorithms, due to general slowness

The most interesting results are the batch operations of Interlocked.Exchange and Interlocked.CompareExchange: they perform significantly faster than their non-concurrent counterparts. How could that be? These algorithms spend a portion of their time spinning in while loops and repeating work. There are a couple things going on here to cause this result.

The first thing is CPU burst. Remember, these 1000 threads are not actually executing in parallel, but each is given a certain amount of time with the CPU before getting interrupted by the operating system. This means that the internal for loop takes approximately 300000 cycles to complete (100000 loops * (2 increments/loop + 1 JZ/loop)), assuming it compiled into something like:
mov ecx, 100000
mov ebx, 1
$lStart: sub ecx, ebx
jz SHORT $lFinish
add eax, ebx
jmp SHORT $lStart
$lFinish: ; whatever...

On my 3GHz computer, this would take approximately 0.094 milliseconds to complete.

This is where the second factor comes in: locality for run-time optimization. Both of these functions perform math on a stack-allocated local variable that is not passed to any other function. This is an ideal candidate to simply be put in a register at run-time, like in my assembly example. The CLR does not compile the byte-code to exactly my assembly, but you get the idea. Knowing the general idea of what the runtime environment is doing and knowing how you can help to optimize it is an important component in high-performance managed coding for any language.

Another interesting result is the batch job for no concurrency takes longer than the straight non-concurrent method. However, this only seems to apply during the RELEASE build. In DEBUG, the batch non-concurrent method is consistently faster than the RELEASE build and always the fastest algorithm. The IL output for the two methods is identical, so something fishy must be going on. The Internet seems to be lacking information, so if anybody knows why emitting debug information seems to speed up the process, I would really like to know. The result is wrong, but it would still be nice to know.

The best conclusion that we can make, however, is that RELEASE without an attached debugger is the best monitor of threading behavior. It generally performs better than other build and run combinations (as it should) and, if nothing else, is the build and run environment that you should eventually be deploying to a production environment.

So what happens if we vary things like how much we're incrementing with each pass or how many threads are performing the incrementing? I modified the code to run through each of the tests with 10, 50, 100, 500, 1000, 2000 and 5000 threads and 25000, 50000, 75000, 100000, 150000 and 200000 increments. I also stopped testing AutoResetEvent because it sucks.

We could look directly at the numbers, but it's hard to decipher what the data means. Let's look at some pretty graphs brought to you by our good friend Excel! I thought the most interesting statistic would be a measure of throughput: How many effective increments are being performed each second?

First, the control group: Our non-concurrent method:

Non-concurrent performance graph
As you can see, throughput is higher with less threads and less incrementing. Standard allocation prefers to grab and let go. Perhaps this is more meaningful with something to compare to:

Interlocked.CompareExchange performance graph
With Interlocked.CompareExchange, you can see that it remains pretty consistently around 20 million increments/second. There is a slight dip with 50 threads and 75000 increments, although it certainly doesn't break the overall trend.

Interlock.CompareExchange batch performance graph
The batch version, however, almost doubles the throughput of the non-concurrent method and appears to be steadily increasing as the number of increments increases. Not only that, but the number of threads do not seem to effect the performance of the algorithm in the slightest.

Interlocked.Exchange batch performance graph
The algorithm with the highest throughput is none other than the batch version using Interlocked.Exchange, peaking at over a billion increments each second. The best reason for this is that we're only performing 200000 increments at a time, which is quite a cheap operation, all things considered.

If we constrain the amount of threads to 1000 (enough threads for heavy contention without completely thrashing the scheduler), we can make a nice 2-D graph of how different algorithms behave:
Performance summary chart

That graph pretty much summarizes it: batch operations are way faster than trying to synchronize on each increment. But if you can not divide the work into separate modules, the Interlocked.Exchange method is the fastest, trailed at quite a distance by explicit locking. Good to know.

Source Code for the shared long test
The book of results

Come back next week to use this information for something useful!

1 comment:

  1. Oh my god! These are pretty graphs! You must have spent at least 10 or more hours making them.