09 August 2009

Multi-Threaded Synchronization Performance on Xbox 360

In June, I did a series of micro-benchmarks using the .NET platform on a Windows PC. As fun as that is, I decided to take my tests to the Xbox 360 to see how the same algorithms on a different architecture would play out.

The first thing I noticed when porting my StaticLongTest to Xbox 360 is that my favorite functions from the Interlocked class are suddenly missing. In fact, we are only allowed to operate the methods on two 32-bit types: reference objects and integers (not unsigned integers). This leaves us with explicit locks and wait handles (ManualResetEvent and AutoResetEvent), neither of which are very interesting. So, I resigned to the fact that I would have to alter my test to be the SharedIntTest, which really decreases my interest in the subject, since 32-bit values can almost univerally be read atomically. Bleh.

However, I journeyed onward, knowing that something interesting would have to come out of this. There really is not too much to expand on, the algorithms are all identical from my previous post, they are just using integers instead of long integers. I must say that I had quite a frustrating time, as I came across a unique error whose appearance coincides with the most recent Xbox Live update, wherein the Game Studio Connect informs me: "The game you're trying to play is missing or corrupt" (someone else on the XNA forms had the problem, too). Symptoms include your breakpoints suddenly failing to work - the IDE might tell you that the assembly you're trying to run is different from the one you have. Or it might not. In any case, the solution I found was to delete the file on the Xbox and delete any and all cache files and binaries on my PC.

Once I actually started deploying things on the 360, I noticed that my old testing classes would not work. The reason? The Xbox cannot handle more than around 500 threads - and even then it does not handle them very well. After much trial-and-error and repeatedly overloading my Xbox (it would not respond to X gem presses), I landed on a 100 thread limit. Much more than this and performance slows to a crawl. Realistically, this is not a big deal, as you really should never have more than 4 threads (one per available core) going.

But enough complaining, let's see some results!

Test NameRunning Time (seconds)
PCXbox 360% Slower
No Concurrency *0.109 (92.7%)1.639 (96.5%)1 504%
Interlocked.Increment0.1411.4251 011%
Interlocked.CompareExchange0.2033.6121 779%
lock (object)0.328136.98941 765%
Interlocked.Exchange0.62593.86815 018%
No Concurrency (batch)*0.031 (100.0%)0.467 (54.0%)1 506%
Interlocked.CompareExchange (batch)0.2032.1681 068%
lock(object) (batch)0.0471.9904 234%
AutoResetEvent (batch)0.0472.3434 985%
Interlocked.Exchange (batch)0.0310.9262 987%
* Non-synchronizing algorithm yeilds incorrect results -- number in paraenthesis gives correct percentage.

Something caught my eye almost immediately: Despite having 6 cores (as opposed to my PC's 2), the PC blows the Xbox out of the water on every single test. The 360 is at least 10 times slower than the PC. Why? An interesting question - it seems synchronization is just more difficult to do on the Xbox. But I'm not a fan of hand-waving a problem off, so I'll turn to my old pal, the XNA Remote Performance Monitor.

Perhaps you've heard of the .NET Compact Framework Garbage Collector and how it is not your friend and it is horrible and slow? Well, it's really not that bad (at least here). With the performance monitor, I can see that for the 10 tests run, the garbage collector ran 11 times, all but one of which were explicitly called by me right before the test was run. The one I did not call ran during the lock (object) test and really only occurred because the test took so damn long. As it turns out, this doesn't affect anything, as my code boxes 1 value type per frame, which is a non-issue for the garbage collector to clean.

I'm really not sure why Xbox synchronization performance is so slow. The problem most likely stems from the fact that the architecture of this particular gaming console centers around the GPU, whereas x86 architecture centers around the CPU - and thread synchronization is a very CPU-oriented problem.



As you can see, all memory access goes through the GPU, whereas on your PC, memory access is basically direct from the CPU:



This provides the Xbox with huge advantages when it comes to graphics performance, as many graphics calls that take time on the PC are essentially free on the 360. Much like on your typical GPU, there is an advanced memory store operation that performs swizzling for free. However, as far as I can tell, we lowly XNA developers do not have access to this functionality (aside from in your HLSL shaders).

Since it does not appear as if Microsoft has any plans to open the 360 for assembly-level work, we have to make do with what we have. There are two functions that do not totally suck: Interlocked.Increment and the batch version of Interlocked.Exchange.

Interlocked.Increment is not very interesting. As I've said before, it is a highly-specialized function that you have to be extremely creative with to use to do anything more than increment some variable. That said, it is very good at incrementing a variable.

However, for those of us that are sick, twisted individuals that enjoy a good time programming with some non-specific functions, the Interlocked.Exchange looks very neat. Some might recall that this very method was the uncontested winner of the Threaded Throughput Throw-down as the amount of data grew significantly large. This actually demonstrates an important lesson for Xbox programming: Never leave a core idle. Even though we are duplicating work (as opposed to relying on explicit locking mechanisms), this method of do-it-now-and-check-later outstrips the competition.

Okay, not all the competition. The non-concurrent batch method is twice as fast on the 360 than the Interlocked.Exchange method. Why? The answer is simple: The cache. All interlocked operations require leaving the CPU environment and actually interacting with outside things. Sure, the values are certainly in the cache, but other significant values could have changed outside our happy little environment. From this, there is another important lesson: Try to make your processes interact as little as possible. As the saying goes: multithreading is not inherently difficult, it is the interaction which causes the problems.

In conclusion, this was a pretty useless post. I was really hoping to find a solution that really blew the others out of the water or at least showed a significant variation from running on the PC. Well, I guess now we know not to rely on explicit locks.