24 June 2009

Open Source and Cheating

I've been thinking a lot about Vapor's main project: Orion. For those of you that do not yet know, Orion is a real-time strategy game built on the XNA Framework. We're taking inspiration from Starcraft, Red Alert, Supreme Commander, Open Real-Time Strategy, Team Fortress 2 and a million other places. Basically, we are attempting to build an incredibly skill-oriented, luckless, fast-paced and fun RTS. I'm sure I'll talk more about it in the future.

Vapor Gaming is meant to be an open-source gaming company, but I am having serious thoughts about releasing the source code to the "flagship product." There are about a million reasons why I would really like to, a billion that do not influence me either way and a single reason why I do not want to. The single reason is: cheaters.

It isn't that I care about potential hackers looking at the source code, since security through obscurity is a flawed concept. The thing that I care about is guaranteeing that when two people are playing over the internet, those two people are running the same executable. More specifically, I care that every process in the application's current run domain is supposed to be there and has not been altered. The .NET Framework has quite a few systems to help this in a local context, using a mixture of obfuscation, strong-name signatures and watermarking (and these are pretty tough, the 1024-bit RSA key has never been broken: see Bruce Schneier). However, all of these systems are designed to protect the user from the code, not the other way around.

Let's assume that we release the source code and people can rebuild it at will. Without including the private keys used to encrypt and sign the assembly a hacker can recompile the assembly to do whatever he wants, yet tell the remote system that it is perfectly valid. Is there a way to prevent this? Even John Carmack can't figure it out.

I do not want to treat all people as "potential hackers," but that is the reality of the situation. Ill treatment of users (or people in general) is not a good thing, but reducing in-game hacking is an overall benefit to the community (as opposed to DRM systems), especially since too much hacking can completely destroy a game - it is no fun to lose because the other player hacked their way to victory.

Client-side solutions will not work, because of the fact that code can be trivially recompiled. A method such as TestAllAssembliesValid() could just be replaced with a system that returns true (or true after a slight delay, so as to emulate actually performing the checks). The craziest case I could come up with is the server sending an algorithm in script to run against the assemblies and return the result in a secret key, but a "bad" assembly could just load up the legitimate copy and run the algorithm against that. It would not matter if things were changed, as it would just be a matter of time before the next round of hacks came out and I would rather not have an arms race.

A solution that would work decently is to host all ranked games on Vapor-owned servers. In this case, if clients attempt to do something seedy, such as raise their resource count, the server would not know or care about it, so the hacker would not be able to reap the benefits of it (there is no Vapor Engine network code for 'Please set my resources to ____' and all gameplay state changes have to go through the server). However, this might not be feasible from a cost standpoint, as the amount of servers needed to run such a system has yet to be determined. Furthermore, this would not secure client-only hacks, such as map hacks and map reveals.

Another option is to use server-side heuristics to determine if someone is cheating. Since ranked match results will be communicated to the server, it would be possible to apply some sort of algorithm of the given conditions to analyze if someone is cheating or not. However, the accuracy of such a method is questionable at best. I am not entirely convinced that it is possible to accurately determine anything, even if an entire game replay is uploaded (as the replay system could be tinkered with on recompilation).

It's really starting to look like there isn't a good way to programmatically catch cheaters. So what if there is an easy system to report cheaters in? The ultimate issue with this is the sheer amount of noise generated by users. I've played enough RTS in my day to know that any time there is a cheating board, it is quickly filled with "OMG HE MAP HACKS!" and enough FPS to know there are a ton of "OMG WALL HACK" and "OMG AIM BOT" accusations. Giving people an easier means to commit this stupidity would only generate more noise.

However, there are already a ton of systems for noise filtering. Some of these are community-based. YouTube's inappropriate content flag is a good example of one; Slashdot's moderation system is a more advanced example of collaborative flagging. These solve the problem of noise filtration by shifting much of the work to the people playing the game. It allows community members to participate in game moderation. A system could feasibly analyze "cheat" reports. If a user's last three games have been flagged as "cheating" by three different people, then the user is probably cheating.

So what do we do when we catch a cheater? Ban him for life. We are trying to build a game based on game skills, not hacking skills (those are good for other things). If you are not going to have fun playing by the rules, we do not want you in our community.

16 June 2009

Multi-Threaded Synchronization Performance in C#

I spend a lot of time studying and debating the many ways application performance can be improved. While working with applications that involves network usage or any other high-latency read process, making them multi-threaded is one of the best ways to decrease turnaround time. While waiting for more data to come in through the network, it is almost universally a Good Thing to start processing those jobs in the background. The next few posts will be dedicated to multi-threading performance, detailing the creation of a simple producer/consumer system and a reader/writer lock.

Let's start with the basics: A worker thread needs to communicate with its parent and while these interactions should be minimized, they are, at some point, necessary. What is the best way to get information from one thread to another? What is the least expensive synchronization mechanism?

After perusing the internet, there appears to be a shortage of real-world benchmarks for different methods of synchronization. This likely comes from the fact that performance is so incredibly varied. The same test may perform better with one algorithm on a dual-core system and reverse on a single-core. Algorithms that work for all cases on an x86 or x87 platform may flat-out fail on other architectures such as Power. Universally, the choice of synchronization solution depends on variables and conditions not directly related to the algorithm. But, being an American, I love to see how things fare in head-to-head combat.

To have our battle to the death of algorithms, we must first choose what our code should accomplish. I've chosen to increment a shared long integer, as the actual operation we are attempting to accomplish will take relatively few cycles as compared to the synchronization mechanisms used to generate the correct result. A shared long also ensures that memory reads and writes cannot happen atomically without extra overhead.

In summary: There will be 1 000 threads incrementing a shared long integer 100 000 times.

Just so you know how things are working, this is basically how threads will be run:

TimeSpan ThreadTest(ThreadStart threadStart, out long value)
{
// start our long at 0
m_value = 0;

// create all the threads to point at threadStart --
// which should be a function pointer to an incrementer
List<Thread> threads = new List<Thread>();
for (int i = 0; i < 1000; i++)
{
Thread t = new Thread(threadStart);
threads.Add(t);
}

// collect garbage now to minimize interference
GC.Collect();

// start each of the threads
DateTime start = DateTime.Now;
foreach (Thread t in threads)
t.Start();

// join each of the threads until completion
foreach (Thread t in threads)
t.Join();
DateTime finish = DateTime.Now;

value = m_value;
return finish - start;
}
So: how can we accomplish this goal? The simplest solution would be to have 1000 threads looping and incrementing 100 000 times, like so:

private void NoConcurrency()
{
for (int i = 0; i < INCREMENTS; i++)
m_value++;
}
This process, however, will not give you consistent results and with this many threads and increments, m_value will never finish with the correct value of 100 000 000. The reason for this is that m_value++ is not an atomic operation, it actually looks more like:
long tempVal = m_value + 1;
m_value = tempVal;
With many threads all reading an writing, the value of m_value could easily have changed since when it was read. Ultimately, the final value will be much less than the desired 100 000 000.

The classic solution to this problem is to use explicit locks through mutexes. C# has a convenient construct for this, wherein one simply calls lock(object) { ... } and a mutex designated by the given reference object is acquired and before the code block and released afterwards. Changing our original non-concurrent method is quite simple:
private void MutexBlock()
{
for (int i = 0; i < INCREMENTS; i++)
lock (lockObject)
m_value++;
}
Wow! That was easy! Now m_value will end at the desired value. This .NET's convenient mutual-exclusion lock and can be used for any reference object. It means that only one thread can obtain a lock on the given lock object at a time. Since we're assuming all access to m_value is done in this function, it means that the read, increment and write-back of the value happens atomically (technically not 100% true: if somebody Ctrl+Alt+Deletes the process, that notion flies out the window).

But incrementing something is such a common operation and locking is slow...surely there must be a faster way of performing this operation without the use of locks. Of course! Enter .NET's convenient batch of functions inside the Interlocked class.
private void InterlockedIncrement()
{
for (int i = 0; i < INCREMENTS; i++)
Interlocked.Increment(ref m_value);
}
That's neat - somebody made it so you can just increment a value - and in a single operation, too! Be wary, though, this does not compile directly to an x86 assembly statement like lock inc [eax]. There is a lot more stuff going on behind the scenes, with memory barriers and some volatile writing so when we get off of x86 and move to something more interesting like PowerPC, our assumptions of this working hold true.

I'm sure you can see the shortcoming of the Interlocked.Increment function: you can only increment an integer or a long by 1. Okay, so there is an Interlocked.Add function which lets you add (or subtract) values, but none of these are very interesting. No, it doesn't matter for the current operation, but I'm sure you can see that it would be beneficial to have some flexibility.

What we need is a system of locking that doesn't actually use the lock construct. Luckily, there are a million ways to do this. One of them is through the use of Interlocked.Exchange.
private void InterlockedExchangeBlock()
{
long val;
while ((val = Interlocked.Exchange(ref m_value, -1) == -1) ;
val++;
Interlocked.Exchange(ref m_value, val);
}
What is happening here? First, we're swapping out the value at m_value with -1 and checking if the value we received is -1. We keep trying again and again until we get a value that is not -1. Then, we increment the value we found in m_value and save it back.

The keen observer will notice that this is basically the lock construct, except we are wasting CPU cycles by spinning. Why would we do this instead of using an explicit lock? The answer is actually quite complicated. When a .NET thread is unable to obtain a lock for a mutex, it yields to let the operating system reschedule the thread. That is fine, but the thread could be rescheduled a long time in the future and if we know that the value is going to become readable very soon, we should spin-wait until we can grab it. Yes, we'll waste some CPU cycles, but it's all in the name of making a super bad-ass program.

But we've definitely freed ourselves of the problem with using Interlocked.Increment - sure, we're only incrementing, but we could have multiplied m_value by 10 or found the m_valueth Fibonacci number before storing it back in its position. The only real problem we run across is if m_value changes while we're performing the math on it. Is an atomic way to check for this sort of thing? Of course! That's what Interlocked.CompareExchange is for - it takes one more argument, which is compared to the value in memory and the exchange is only performed if the value in memory is equal to the third argument.
private void InterlockedCompareExchange()
{
for (int i = 0; i < INCREMENTS; i++)
{
long orig, newval;
do
{
orig = m_value;
newval = orig + 1;
} while (Interlocked.CompareExchange(ref m_value, newval, orig) != orig);
}
}
This is a very strange-looking function. We grab the original value from the memory location, perform the math and then to something funny. What that while thing means in English is:
  1. Compare the current value of m_value with what we originally pulled
  2. If they are equal, put the new value into m_value
  3. If they are not, retry the loop
This is true lockless programming. At no point is it possible for a thread to stall - if something bad happens and m_value is changed, the loop will be repeated, which is not really that big of a deal. This operation is persistent and atomic.

But it turns out that the biggest advantage of this algorithm is also its biggest weakness. In this case, redoing the operation is trivial - reading from memory and adding 1 is really, really cheap. When that work is thrown out, we have wasted very few cycles. But if we were doing something more expensive, such as downloading a web site and performing actions based on the received information, throwing out that information represents a huge waste. Also, as the time span between the initial read and the final write-back increases, the likelihood of that original read becoming invalid increases. It's really an art.

Okay, enough talk. What are the results? I ran the tests described here as well as "batch" versions, which obtain the lock and then perform the math. See the link at the bottom of the post for the exact source code. All of the tests were run under DEBUG and RELEASE compilation, both with the Visual Studio debugger attached and by themselves...here is a table of results:
































Test NameRunning Time (in seconds)
DEBUG, AttachedDEBUG, CleanRELEASE, AttachedRELEASE, Clean
No Concurrency *3.374 (91%)0.499 (75%)3.843 (93%)0.281 (34%)
Interlocked.Increment12.1243.68712.8903.749
Interlocked.CompareExchange10.2494.81211.1564.249
lock(object)14.4604.53113.6714.390
Interlocked.Exchange15.26511.79616.2349.421
AutoResetEvent **1:35.6551:54.9992:22.6701:02.773
No Concurrency (batch) *4.203 (76%)0.218 (50%)3.630 (78%)0.910 (51%)
Interlocked.CompareExchange (batch)4.2180.3903.3830.137
lock(object) (batch)4.6200.4214.6560.265
AutoResetEvent (batch)2.8280.4372.8120.281
Interlocked.Exchange (batch)4.7650.3744.5930.780
* Non-synchronizing algorithm yeilds incorrect results -- number in paraenthesis gives correct percentage.
** The Auto Reset Event Block is calculating 1/100th of the data of the other algorithms, due to general slowness



The most interesting results are the batch operations of Interlocked.Exchange and Interlocked.CompareExchange: they perform significantly faster than their non-concurrent counterparts. How could that be? These algorithms spend a portion of their time spinning in while loops and repeating work. There are a couple things going on here to cause this result.

The first thing is CPU burst. Remember, these 1000 threads are not actually executing in parallel, but each is given a certain amount of time with the CPU before getting interrupted by the operating system. This means that the internal for loop takes approximately 300000 cycles to complete (100000 loops * (2 increments/loop + 1 JZ/loop)), assuming it compiled into something like:
mov ecx, 100000
mov ebx, 1
$lStart: sub ecx, ebx
jz SHORT $lFinish
add eax, ebx
jmp SHORT $lStart
$lFinish: ; whatever...

On my 3GHz computer, this would take approximately 0.094 milliseconds to complete.

This is where the second factor comes in: locality for run-time optimization. Both of these functions perform math on a stack-allocated local variable that is not passed to any other function. This is an ideal candidate to simply be put in a register at run-time, like in my assembly example. The CLR does not compile the byte-code to exactly my assembly, but you get the idea. Knowing the general idea of what the runtime environment is doing and knowing how you can help to optimize it is an important component in high-performance managed coding for any language.

Another interesting result is the batch job for no concurrency takes longer than the straight non-concurrent method. However, this only seems to apply during the RELEASE build. In DEBUG, the batch non-concurrent method is consistently faster than the RELEASE build and always the fastest algorithm. The IL output for the two methods is identical, so something fishy must be going on. The Internet seems to be lacking information, so if anybody knows why emitting debug information seems to speed up the process, I would really like to know. The result is wrong, but it would still be nice to know.

The best conclusion that we can make, however, is that RELEASE without an attached debugger is the best monitor of threading behavior. It generally performs better than other build and run combinations (as it should) and, if nothing else, is the build and run environment that you should eventually be deploying to a production environment.

So what happens if we vary things like how much we're incrementing with each pass or how many threads are performing the incrementing? I modified the code to run through each of the tests with 10, 50, 100, 500, 1000, 2000 and 5000 threads and 25000, 50000, 75000, 100000, 150000 and 200000 increments. I also stopped testing AutoResetEvent because it sucks.

We could look directly at the numbers, but it's hard to decipher what the data means. Let's look at some pretty graphs brought to you by our good friend Excel! I thought the most interesting statistic would be a measure of throughput: How many effective increments are being performed each second?

First, the control group: Our non-concurrent method:

Non-concurrent performance graph
As you can see, throughput is higher with less threads and less incrementing. Standard allocation prefers to grab and let go. Perhaps this is more meaningful with something to compare to:

Interlocked.CompareExchange performance graph
With Interlocked.CompareExchange, you can see that it remains pretty consistently around 20 million increments/second. There is a slight dip with 50 threads and 75000 increments, although it certainly doesn't break the overall trend.

Interlock.CompareExchange batch performance graph
The batch version, however, almost doubles the throughput of the non-concurrent method and appears to be steadily increasing as the number of increments increases. Not only that, but the number of threads do not seem to effect the performance of the algorithm in the slightest.

Interlocked.Exchange batch performance graph
The algorithm with the highest throughput is none other than the batch version using Interlocked.Exchange, peaking at over a billion increments each second. The best reason for this is that we're only performing 200000 increments at a time, which is quite a cheap operation, all things considered.

If we constrain the amount of threads to 1000 (enough threads for heavy contention without completely thrashing the scheduler), we can make a nice 2-D graph of how different algorithms behave:
Performance summary chart

That graph pretty much summarizes it: batch operations are way faster than trying to synchronize on each increment. But if you can not divide the work into separate modules, the Interlocked.Exchange method is the fastest, trailed at quite a distance by explicit locking. Good to know.

Source Code for the shared long test
The book of results

Come back next week to use this information for something useful!

08 June 2009

Open Source Software

Vapor Gaming's first release, currently known as under the code name Aquila, will be released as open-source software. We're hosting it through Google Project Hosting, our repository can be found here. But that's all just facts. How did we get to that point?

After a fairly lengthy discussion with my fellow developer, Sean, we decided on what we want the Aquila project to be. Yes, we want to make a web game and we want it to be cool. We want a user not to have to use any plug-ins -- requiring only a supported browser. Even Flash is off-limits, even though support for it is almost ubiquitous. We want to use only the browser, which means pushing Javascript as far as possible. But that's really just client-side.

We want an open model where people where free to build their own extensions even their own client. We'll provide the service and the first client, but what we are really trying to do is get people to have fun together. We are all on the Internet, most of us use Facebook, some of us use Twitter and by the time this post is published, there will be a million people on some new service. We want to make it possible for anyone to integrate our application in any place where it would be conceivable. There is no way that two developers can possibly track and develop plug-ins for all the places that people could want it.

From observation, when people use and enjoy a service, programmers among them love to integrate that service into other things, even if it is difficult. When I first started playing with programming (beyond "Hello World"), I created a console-based AIM client in C by adapting some awful VisualBASIC 6.0 library. When I started playing on the Internet, I spent quite some time crafting GETs for XmlHttpRequests and picking apart the HTML returned to me. Was it useful? Probably not, but it was fun to message my friends using something I had created.

But, while hand-crafting and parsing strings was doable, it was really just a means to an end. I ended up with applications that were not exactly hardy, and could break when someone changed the page layout of their web site. It was neat to see, but creating a robust application that could qualify as "releasable" was impossible to do. When things are exposed as web services, where people can easily send querystring GETs, SOAP or JSON and get back some easy-to-parse XML data, it is easy for people to create applications with the information. Most people enjoy creating their own extension applications more than writing and maintaining adaptor code for other people's junk. More enjoyment in creating extensions is equivalent to more people creating extensions, which means applications will reach a wider audience. Twitter is a great example of this - by exposing posting and reading as simple POSTs and RSS feeds, Twitter can go anywhere with a web connection (and with a little bit of adaptation, even to places that can't: SMS).

For this reason (and many others), there has been a large push for companies to expose functionality as web services. But there is nothing inherently open-source about this action. The Yahoo and Google query APIs are easy to use, yet the backing data is provided by a secretive, proprietary system. So we're really back at square one: Why open-source?

When it comes down to it, many of the advantages of open-source software are lost when providing the ultimate product as a service. The chance of forking drops severely because forking the project means losing all the users, which are the real value of any community. Modifications and security updates have to eventually go back to one spot: the server. One company effectively has the power to control all development, because updates not applied to trunk will never reach the majority of the community.

In reality, the two largest factors in our decision to open source the Aquila Engine come down to vanity and laziness. We genuinely think what we are doing is cool and we think that you think it is cool, too. We hope that when you use whatever Aquila turns out to be, you will think "That's neat," and wonder how we did it. And guess what? The source code will be availible for you to see how! A little self-centered, I know.

Furthermore, we enjoy writing software, not writing documentation APIs. We document in code as much as we can, because that is easy and genuinely helpful, but creating an independant page dedicated to documentation? Yuck! Having to copy and paste that documentation from our code to that page? Spare me! If someone wants to create better documentation than we can, great! Until that time, we can just point people towards the JSP pages that field service calls and say: here's how it's done. The self-documenting code that people see will be superior to some semi-descriptive documentation on an API reference page, as I am not particuarily skilled at writing API docs.

Beyond this, there are some philosophical reasons for releasing software as free and open-source, but these topics have been covered a million times over by people who are better with words than I.

In summary, Aquila will be released open source, under the GPL v3 licence (although I've been debating AGPL). In either case, things are looking fantastic!