Travis Gockel: lockless

In June, I did a series of micro-benchmarks using the .NET platform on a Windows PC. As fun as that is, I decided to take my tests to the Xbox 360 to see how the same algorithms on a different architecture would play out.

The first thing I noticed when porting my StaticLongTest to Xbox 360 is that my favorite functions from the Interlocked class are suddenly missing. In fact, we are only allowed to operate the methods on two 32-bit types: reference objects and integers (not unsigned integers). This leaves us with explicit locks and wait handles (ManualResetEvent and AutoResetEvent), neither of which are very interesting. So, I resigned to the fact that I would have to alter my test to be the SharedIntTest, which really decreases my interest in the subject, since 32-bit values can almost univerally be read atomically. Bleh.

However, I journeyed onward, knowing that something interesting would have to come out of this. There really is not too much to expand on, the algorithms are all identical from my previous post, they are just using integers instead of long integers. I must say that I had quite a frustrating time, as I came across a unique error whose appearance coincides with the most recent Xbox Live update, wherein the Game Studio Connect informs me: "The game you're trying to play is missing or corrupt" (someone else on the XNA forms had the problem, too). Symptoms include your breakpoints suddenly failing to work - the IDE might tell you that the assembly you're trying to run is different from the one you have. Or it might not. In any case, the solution I found was to delete the file on the Xbox and delete any and all cache files and binaries on my PC.

Once I actually started deploying things on the 360, I noticed that my old testing classes would not work. The reason? The Xbox cannot handle more than around 500 threads - and even then it does not handle them very well. After much trial-and-error and repeatedly overloading my Xbox (it would not respond to X gem presses), I landed on a 100 thread limit. Much more than this and performance slows to a crawl. Realistically, this is not a big deal, as you really should never have more than 4 threads (one per available core) going.

But enough complaining, let's see some results!

Test Name	Running Time (seconds)
Test Name	PC	Xbox 360	% Slower
No Concurrency *	0.109 (92.7%)	1.639 (96.5%)	1 504%
Interlocked.Increment	0.141	1.425	1 011%
Interlocked.CompareExchange	0.203	3.612	1 779%
lock (object)	0.328	136.989	41 765%
Interlocked.Exchange	0.625	93.868	15 018%
No Concurrency (batch)*	0.031 (100.0%)	0.467 (54.0%)	1 506%
Interlocked.CompareExchange (batch)	0.203	2.168	1 068%
lock(object) (batch)	0.047	1.990	4 234%
AutoResetEvent (batch)	0.047	2.343	4 985%
Interlocked.Exchange (batch)	0.031	0.926	2 987%
* Non-synchronizing algorithm yeilds incorrect results -- number in paraenthesis gives correct percentage.

Something caught my eye almost immediately: Despite having 6 cores (as opposed to my PC's 2), the PC blows the Xbox out of the water on every single test. The 360 is at least 10 times slower than the PC. Why? An interesting question - it seems synchronization is just more difficult to do on the Xbox. But I'm not a fan of hand-waving a problem off, so I'll turn to my old pal, the XNA Remote Performance Monitor.

Perhaps you've heard of the .NET Compact Framework Garbage Collector and how it is not your friend and it is horrible and slow? Well, it's really not that bad (at least here). With the performance monitor, I can see that for the 10 tests run, the garbage collector ran 11 times, all but one of which were explicitly called by me right before the test was run. The one I did not call ran during the lock (object) test and really only occurred because the test took so damn long. As it turns out, this doesn't affect anything, as my code boxes 1 value type per frame, which is a non-issue for the garbage collector to clean.

I'm really not sure why Xbox synchronization performance is so slow. The problem most likely stems from the fact that the architecture of this particular gaming console centers around the GPU, whereas x86 architecture centers around the CPU - and thread synchronization is a very CPU-oriented problem.

As you can see, all memory access goes through the GPU, whereas on your PC, memory access is basically direct from the CPU:

This provides the Xbox with huge advantages when it comes to graphics performance, as many graphics calls that take time on the PC are essentially free on the 360. Much like on your typical GPU, there is an advanced memory store operation that performs swizzling for free. However, as far as I can tell, we lowly XNA developers do not have access to this functionality (aside from in your HLSL shaders).

Since it does not appear as if Microsoft has any plans to open the 360 for assembly-level work, we have to make do with what we have. There are two functions that do not totally suck: Interlocked.Increment and the batch version of Interlocked.Exchange.

Interlocked.Increment is not very interesting. As I've said before, it is a highly-specialized function that you have to be extremely creative with to use to do anything more than increment some variable. That said, it is very good at incrementing a variable.

However, for those of us that are sick, twisted individuals that enjoy a good time programming with some non-specific functions, the Interlocked.Exchange looks very neat. Some might recall that this very method was the uncontested winner of the Threaded Throughput Throw-down as the amount of data grew significantly large. This actually demonstrates an important lesson for Xbox programming: Never leave a core idle. Even though we are duplicating work (as opposed to relying on explicit locking mechanisms), this method of do-it-now-and-check-later outstrips the competition.

Okay, not all the competition. The non-concurrent batch method is twice as fast on the 360 than the Interlocked.Exchange method. Why? The answer is simple: The cache. All interlocked operations require leaving the CPU environment and actually interacting with outside things. Sure, the values are certainly in the cache, but other significant values could have changed outside our happy little environment. From this, there is another important lesson: Try to make your processes interact as little as possible. As the saying goes: multithreading is not inherently difficult, it is the interaction which causes the problems.

In conclusion, this was a pretty useless post. I was really hoping to find a solution that really blew the others out of the water or at least showed a significant variation from running on the PC. Well, I guess now we know not to rely on explicit locks.

I spend a lot of time studying and debating the many ways application performance can be improved. While working with applications that involves network usage or any other high-latency read process, making them multi-threaded is one of the best ways to decrease turnaround time. While waiting for more data to come in through the network, it is almost universally a Good Thing to start processing those jobs in the background. The next few posts will be dedicated to multi-threading performance, detailing the creation of a simple producer/consumer system and a reader/writer lock.

Let's start with the basics: A worker thread needs to communicate with its parent and while these interactions should be minimized, they are, at some point, necessary. What is the best way to get information from one thread to another? What is the least expensive synchronization mechanism?

After perusing the internet, there appears to be a shortage of real-world benchmarks for different methods of synchronization. This likely comes from the fact that performance is so incredibly varied. The same test may perform better with one algorithm on a dual-core system and reverse on a single-core. Algorithms that work for all cases on an x86 or x87 platform may flat-out fail on other architectures such as Power. Universally, the choice of synchronization solution depends on variables and conditions not directly related to the algorithm. But, being an American, I love to see how things fare in head-to-head combat.

To have our battle to the death of algorithms, we must first choose what our code should accomplish. I've chosen to increment a shared long integer, as the actual operation we are attempting to accomplish will take relatively few cycles as compared to the synchronization mechanisms used to generate the correct result. A shared long also ensures that memory reads and writes cannot happen atomically without extra overhead.

In summary: There will be 1 000 threads incrementing a shared long integer 100 000 times.

Just so you know how things are working, this is basically how threads will be run:

TimeSpan ThreadTest(ThreadStart threadStart, out long value)
{
  // start our long at 0
  m_value = 0;

  // create all the threads to point at threadStart --
  // which should be a function pointer to an incrementer
  List<Thread> threads = new List<Thread>();
  for (int i = 0; i < 1000; i++)
  {
    Thread t = new Thread(threadStart);
    threads.Add(t);
  }

  // collect garbage now to minimize interference
  GC.Collect();

  // start each of the threads
  DateTime start = DateTime.Now;
  foreach (Thread t in threads)
    t.Start();

  // join each of the threads until completion
  foreach (Thread t in threads)
    t.Join();
  DateTime finish = DateTime.Now;

  value = m_value;
  return finish - start;
}

So: how can we accomplish this goal? The simplest solution would be to have 1000 threads looping and incrementing 100 000 times, like so:

private void NoConcurrency()
{
  for (int i = 0; i < INCREMENTS; i++)
    m_value++;
}

This process, however, will not give you consistent results and with this many threads and increments, m_value will never finish with the correct value of 100 000 000. The reason for this is that m_value++ is not an atomic operation, it actually looks more like:

long tempVal = m_value + 1;
m_value = tempVal;

With many threads all reading an writing, the value of m_value could easily have changed since when it was read. Ultimately, the final value will be much less than the desired 100 000 000.

The classic solution to this problem is to use explicit locks through mutexes. C# has a convenient construct for this, wherein one simply calls lock(object) { ... } and a mutex designated by the given reference object is acquired and before the code block and released afterwards. Changing our original non-concurrent method is quite simple:

private void MutexBlock()
{
  for (int i = 0; i < INCREMENTS; i++)
  lock (lockObject)
    m_value++;
}

Wow! That was easy! Now m_value will end at the desired value. This .NET's convenient mutual-exclusion lock and can be used for any reference object. It means that only one thread can obtain a lock on the given lock object at a time. Since we're assuming all access to m_value is done in this function, it means that the read, increment and write-back of the value happens atomically (technically not 100% true: if somebody Ctrl+Alt+Deletes the process, that notion flies out the window).

But incrementing something is such a common operation and locking is slow...surely there must be a faster way of performing this operation without the use of locks. Of course! Enter .NET's convenient batch of functions inside the Interlocked class.

private void InterlockedIncrement()
{
  for (int i = 0; i < INCREMENTS; i++)
    Interlocked.Increment(ref m_value);
}

That's neat - somebody made it so you can just increment a value - and in a single operation, too! Be wary, though, this does not compile directly to an x86 assembly statement like lock inc [eax]. There is a lot more stuff going on behind the scenes, with memory barriers and some volatile writing so when we get off of x86 and move to something more interesting like PowerPC, our assumptions of this working hold true.

I'm sure you can see the shortcoming of the Interlocked.Increment function: you can only increment an integer or a long by 1. Okay, so there is an Interlocked.Add function which lets you add (or subtract) values, but none of these are very interesting. No, it doesn't matter for the current operation, but I'm sure you can see that it would be beneficial to have some flexibility.

What we need is a system of locking that doesn't actually use the lock construct. Luckily, there are a million ways to do this. One of them is through the use of Interlocked.Exchange.

private void InterlockedExchangeBlock()
{
  long val;
  while ((val = Interlocked.Exchange(ref m_value, -1) == -1) ;
  val++;
  Interlocked.Exchange(ref m_value, val);
}

What is happening here? First, we're swapping out the value at m_value with -1 and checking if the value we received is -1. We keep trying again and again until we get a value that is not -1. Then, we increment the value we found in m_value and save it back.

The keen observer will notice that this is basically the lock construct, except we are wasting CPU cycles by spinning. Why would we do this instead of using an explicit lock? The answer is actually quite complicated. When a .NET thread is unable to obtain a lock for a mutex, it yields to let the operating system reschedule the thread. That is fine, but the thread could be rescheduled a long time in the future and if we know that the value is going to become readable very soon, we should spin-wait until we can grab it. Yes, we'll waste some CPU cycles, but it's all in the name of making a super bad-ass program.

But we've definitely freed ourselves of the problem with using Interlocked.Increment - sure, we're only incrementing, but we could have multiplied m_value by 10 or found the m_valueth Fibonacci number before storing it back in its position. The only real problem we run across is if m_value changes while we're performing the math on it. Is an atomic way to check for this sort of thing? Of course! That's what Interlocked.CompareExchange is for - it takes one more argument, which is compared to the value in memory and the exchange is only performed if the value in memory is equal to the third argument.

private void InterlockedCompareExchange()
{
  for (int i = 0; i < INCREMENTS; i++)
  {
    long orig, newval;
    do
    {
      orig = m_value;
      newval = orig + 1;
    } while (Interlocked.CompareExchange(ref m_value, newval, orig) != orig);
  }
}

This is a very strange-looking function. We grab the original value from the memory location, perform the math and then to something funny. What that while thing means in English is:

Compare the current value of m_value with what we originally pulled
If they are equal, put the new value into m_value
If they are not, retry the loop

This is true lockless programming. At no point is it possible for a thread to stall - if something bad happens and m_value is changed, the loop will be repeated, which is not really that big of a deal. This operation is persistent and atomic.

But it turns out that the biggest advantage of this algorithm is also its biggest weakness. In this case, redoing the operation is trivial - reading from memory and adding 1 is really, really cheap. When that work is thrown out, we have wasted very few cycles. But if we were doing something more expensive, such as downloading a web site and performing actions based on the received information, throwing out that information represents a huge waste. Also, as the time span between the initial read and the final write-back increases, the likelihood of that original read becoming invalid increases. It's really an art.

Okay, enough talk. What are the results? I ran the tests described here as well as "batch" versions, which obtain the lock and then perform the math. See the link at the bottom of the post for the exact source code. All of the tests were run under DEBUG and RELEASE compilation, both with the Visual Studio debugger attached and by themselves...here is a table of results:

Test Name	Running Time (in seconds)
Test Name	DEBUG, Attached	DEBUG, Clean	RELEASE, Attached	RELEASE, Clean
No Concurrency *	3.374 (91%)	0.499 (75%)	3.843 (93%)	0.281 (34%)
Interlocked.Increment	12.124	3.687	12.890	3.749
Interlocked.CompareExchange	10.249	4.812	11.156	4.249
lock(object)	14.460	4.531	13.671	4.390
Interlocked.Exchange	15.265	11.796	16.234	9.421
AutoResetEvent **	1:35.655	1:54.999	2:22.670	1:02.773
No Concurrency (batch) *	4.203 (76%)	0.218 (50%)	3.630 (78%)	0.910 (51%)
Interlocked.CompareExchange (batch)	4.218	0.390	3.383	0.137
lock(object) (batch)	4.620	0.421	4.656	0.265
AutoResetEvent (batch)	2.828	0.437	2.812	0.281
Interlocked.Exchange (batch)	4.765	0.374	4.593	0.780
* Non-synchronizing algorithm yeilds incorrect results -- number in paraenthesis gives correct percentage.
** The Auto Reset Event Block is calculating 1/100th of the data of the other algorithms, due to general slowness

The most interesting results are the batch operations of Interlocked.Exchange and Interlocked.CompareExchange: they perform significantly faster than their non-concurrent counterparts. How could that be? These algorithms spend a portion of their time spinning in while loops and repeating work. There are a couple things going on here to cause this result.

The first thing is CPU burst. Remember, these 1000 threads are not actually executing in parallel, but each is given a certain amount of time with the CPU before getting interrupted by the operating system. This means that the internal for loop takes approximately 300000 cycles to complete (100000 loops * (2 increments/loop + 1 JZ/loop)), assuming it compiled into something like:

mov ecx, 100000
mov ebx, 1
$lStart: sub ecx, ebx
jz SHORT $lFinish
add eax, ebx
jmp SHORT $lStart
$lFinish: ; whatever...

On my 3GHz computer, this would take approximately 0.094 milliseconds to complete.

This is where the second factor comes in: locality for run-time optimization. Both of these functions perform math on a stack-allocated local variable that is not passed to any other function. This is an ideal candidate to simply be put in a register at run-time, like in my assembly example. The CLR does not compile the byte-code to exactly my assembly, but you get the idea. Knowing the general idea of what the runtime environment is doing and knowing how you can help to optimize it is an important component in high-performance managed coding for any language.

Another interesting result is the batch job for no concurrency takes longer than the straight non-concurrent method. However, this only seems to apply during the RELEASE build. In DEBUG, the batch non-concurrent method is consistently faster than the RELEASE build and always the fastest algorithm. The IL output for the two methods is identical, so something fishy must be going on. The Internet seems to be lacking information, so if anybody knows why emitting debug information seems to speed up the process, I would really like to know. The result is wrong, but it would still be nice to know.

The best conclusion that we can make, however, is that RELEASE without an attached debugger is the best monitor of threading behavior. It generally performs better than other build and run combinations (as it should) and, if nothing else, is the build and run environment that you should eventually be deploying to a production environment.

So what happens if we vary things like how much we're incrementing with each pass or how many threads are performing the incrementing? I modified the code to run through each of the tests with 10, 50, 100, 500, 1000, 2000 and 5000 threads and 25000, 50000, 75000, 100000, 150000 and 200000 increments. I also stopped testing AutoResetEvent because it sucks.

We could look directly at the numbers, but it's hard to decipher what the data means. Let's look at some pretty graphs brought to you by our good friend Excel! I thought the most interesting statistic would be a measure of throughput: How many effective increments are being performed each second?

First, the control group: Our non-concurrent method:

Non-concurrent performance graph

As you can see, throughput is higher with less threads and less incrementing. Standard allocation prefers to grab and let go. Perhaps this is more meaningful with something to compare to:

Interlocked.CompareExchange performance graph

Interlocked.CompareExchange performance graph

With Interlocked.CompareExchange, you can see that it remains pretty consistently around 20 million increments/second. There is a slight dip with 50 threads and 75000 increments, although it certainly doesn't break the overall trend.

Interlock.CompareExchange batch performance graph

Interlock.CompareExchange batch performance graph

The batch version, however, almost doubles the throughput of the non-concurrent method and appears to be steadily increasing as the number of increments increases. Not only that, but the number of threads do not seem to effect the performance of the algorithm in the slightest.

Interlocked.Exchange batch performance graph

Interlocked.Exchange batch performance graph

The algorithm with the highest throughput is none other than the batch version using Interlocked.Exchange, peaking at over a billion increments each second. The best reason for this is that we're only performing 200000 increments at a time, which is quite a cheap operation, all things considered.

If we constrain the amount of threads to 1000 (enough threads for heavy contention without completely thrashing the scheduler), we can make a nice 2-D graph of how different algorithms behave:
Performance summary chart

That graph pretty much summarizes it: batch operations are way faster than trying to synchronize on each increment. But if you can not divide the work into separate modules, the Interlocked.Exchange method is the fastest, trailed at quite a distance by explicit locking. Good to know.

Source Code for the shared long test
The book of results

Come back next week to use this information for something useful!

Travis Gockel

09 August 2009

Multi-Threaded Synchronization Performance on Xbox 360

16 June 2009

Multi-Threaded Synchronization Performance in C#

Labels

Blog Archive

Contributors

Followers