Travis Gockel: 2009

01 November 2009

Visual Studio 2010: Property Sheets and C++ Directories

I love external libraries. A good library (one that is well-tested and interoperable) literally saves years of development time. Since this post revolves around C++, I'm going to mention Boost as one of the most helpful libraries ever made. Visual Studio provides a fairly obtuse (but working) system for referencing libraries on a per-user, per-system basis, accessable from the Tools > Options menu:

It doesn't look the most user-friendly and it's a little bit painful to get to (especially the first time), but it got the job done. Theoretically, you only go into this menu (or you write a script to do it for you) once per library and never touch it again.

Naturally, when I began testing Visual Studio 2010, the first thing I wanted to do was set up my include and library paths. So, I opened the options dialog only to find that Microsoft saw fit to remove this feature. My first reaction was: "WHAT? WHY?" and I went on a hunt to see where they had moved that option to (because it is basically impossible to live without). It turns out they had moved -- to the property page of the project (just right-click the project and open properties):

This is basically an extension (duplicate?) to the Visual Studio 200x options for the C++ compiler settings, which let you specify more directories to search through when looking for header files (a similar option exists for the linker in "Additional Library Directories"):

But before you start adding references to "C:\Path\To\Library" in all of your projects, let me tell you about a feature of Visual C++ called "Property Pages." Personally, I have not seen them used very much in various projects for a number of reasons, the number one being that they are quite difficult to find, number two being that their visibility is not enabled by default, number three being that you do not have to know about them to get compiling in VC++ and lastly because their functionality is a little odd.

How to Use Property Sheets
These instructions will work with Visual Studio 2005, 2008 and 2010 (and probably beyond).

Anyway, the first thing you need to do is turn on the Property Manager toolbox. Do this by hitting "Property Manager" under the "View" menu. The toolbox will probably attach to your Solution Explorer, but it could just appear in the middle of the screen. In either case, you should see a tree list of your projects, which have a cross-joined list of all solution configurations and platforms as children, which have a collection of settings sheets:

Okay, let's add a property sheet that defines the symbol _DEBUG and add it to all the debug configurations. Right-click on any debug configuration and select "Add New Project Property Sheet."

If you are planning on using this sheet in multiple projects, it is a good idea to put the file in a directory that is not project-specific. I made a directory called "Properties" in the solution directory to store all my property sheets. A special note: At the time of writing, Visual Studio will give you an exception if you try to make a new property sheet in a directory that does not exist (error reported to Microsoft Connect), so you have to make the folder in Explorer or something first.

Open the newly-created file by double clicking and you should see a window that looks almost identical to what you would see if you opened up the "Properties" dialog from the Solution Explorer, save that there are a million options to configure. The reason there are so many options to configure is because property sheets are not tied to any project, so all configurable options show up in the GUI.

Go into C++ > Preprocessor settings, add "_DEBUG" to the definition list and click OK. Now you can add this .vsprops file to multiple projects. You can actually share the .vsprops files between Visual Studio 2005 and Visual Studio 2008, if you are planning on maintaining two solutions for two separate consumer groups. Unfortunately, Visual Studio 2010 creates a .props file, which is completely different XML and cannot be shared with previous versions of the IDE (but it can convert them). A fun quirk is that you have to right-click on the individual sheet to explicitly save them (do not worry if you forget to do this, you will be asked when you exit the IDE).

Okay, so making a whole file and screwing around with all this stuff just to define another preprocessor macro is pretty pointless. However, you can see that you have complete control over all the properties of VC++ projects, which means you could make another one for release that defines specific optimization settings for every project or a general setting that tells all projects to output to some shared output and intermediate files (I use a variation of "$(SolutionDir)/bin/$(ConfigurationName)/$(PlatformName)/$(ProjectName)" and "$(SolutionDir)/obj/$(ConfigurationName)/$(PlatformName)/$(ProjectName)"). Property sheets let you consolidate shared settings to arbitrary levels of granularity.

So if you have two property sheets, one which sets the output directory to "$(SolutionDir)/bin/$(ConfigurationName)/$(PlatformName)" and one that sets it to "$(SolutionDir)/bin/$(ConfigurationName)-$(PlatformName)", what is the value of $(OutDir)? The answer is whichever one was evaluated last. You can tell the order of evaluation by the order the sheets are listed in the Property Manager toolbox. Evaluation occurs in a bottom-up order, so those sheets on top are evaluated last.

Back to the origninal problem: How can we use property sheets to make up for the lack of global VC++ Directories? Even if we make a property sheet that adds a reference for #includes to "F:\Applications\Dev\boost_1_40_0", what happens when another developer has their Boost library in "C:\SDKs\boost1.40.0"? Project property pages have not solved this problem, since changes to the file would get propagated over source control and screw up the settings for everyone else. Wouldn't it be nice if there was a user-specific property page that was defined outside of the project?

In Visual Studio 2010 there is such a file: Microsoft.Cpp.Win32.user, which exists somewhere in your user application settings. The good part about this change is that settings now reside in an overridable format. If your project uses a modified version of some library, the project property sheet can add a search path to "$(SolutionDir)/ModifiedLibrary", which will be searched before the user files. Granted, the change significantly changes the configuration system (in a way that takes a while to explain), but I think it's for the best. There are also still some quirks with how adding different configurations will propagate certain sheets, but its still in beta, so hopefully they will be smoothed out.

13 October 2009

Model-Context-Interface

I have long been an opponent of the Model-View-Controller (MVC) philosophy of user interface design. Okay, it is not a horrible thing and it is certainly a good philosophy to know: keep your programming model separate from the way your users interact with it. The idea is to represent the domain model with objects, present views of said objects and allow a controller to manipulate those objects.

The first question is: Should the "model" be the actual underlying structure of the model or the user's mental model of how the program works? The answer to this question is quite obvious: The model is the actual program model. A user's mental model is not only crude and most likely incorrect, but it most likely varies from user to user. However, since UI design should be user-centric, we must never forget that the user's mental model of the process is important.

On to the View, which provides the user with a representation of the model. The separation allows one to easily add multiple views to the same data model, like viewing data through a pie chart or a bar chart. When Trygve Reenskaug conceived of MVC in 1979, the ability to easily view data in different charts was impressive. Modern users expect so much more! They want the ability to control and manipulate or otherwise interact with the data model. And the view has to be consistent through a desktop GUI application, text interface, an iPhone and Twitter.

This is where the Controller comes in: It is an abstraction of a method to manipulate the model. Unfortunately, even the Pragmatic Programmer will admit that the Controller and View are tightly coupled. This makes a whole lot of intuitive sense, since the things users see should directly relate to the things users do, but I am not sure it ever entered the minds of Xerox employees (I do not blame them, they were pretty busy making copiers and printers).

So we are really stuck with two systems: a Model and a View-Controller. Okay, so there is also Model-View-Presenter, which does not solve any problems, because it is pretty much the same thing (okay, the arrows are pointed a little differently). There has to be a better way to write interface code!

I first considered a new system wherein all data in the system was considered part of the Model, including data for interfaces. A model forms the foundation of information and the processing of that data on a small-scale. From pragmatic philosophy and MVC, the idea is that the model should only interact with what it owns. Data processing and functions therein can only interact with data that they create. For example, a Vector may hold values for X, Y and Z and could have a function named Length, which would represent the length of the vector. However, it should not have an interaction function called DistanceFrom(Vector) : float, which tells the distance between this vector and another. Keep in mind that "interaction" means that objects should not work with things they do not own, but something like a line, which consists of (owns) two points, can validly find the distance between them.

So, that simple example does not really show anything useful, but imagine a more complex system like a bank account transfer. There are rules regarding what the various accounts can and cannot do. For this world, let us say that it is illegal for a BankAccount to have less than $0 or more than $1,000,000 and that all transactions should be logged. We could strap methods to BankAccount like GiveTo(BankAccount) and Recieve(Money), but this leaves unanswered questions, the most imporant of which is: Who is responsible for rolling back on error? Anyone familiar with databases already knows the answer is that this should obviously take place inside of a managed transaction, where the system is responsible for rolling back failures.

A generalization of this transaction system is a Context, which provides a management layer for the data objects to communicate. This answers the second question of our previous problem: the context is responsible for logging the account transfer. This provides a convenient security mechanism as well: since the context is responsible for all interactions, we can easily make a rule that does not allow transfers to occur unless they are logged (and this is especially trivial if the logging system in integrated into the database).

Context provides a layer of abstraction for your objects to communicate through and can be seen as the network of communicating objects in your model. While the model represents things, context represents a system of communication, which begs the question: Why is this layer not named "communication." Not only would it continue to provide the same 'C' letter for an acronym, but it seems to be a better definition. However, using the word "communication" would imply that this layer conveys information, which is not true. While the context-layer provides a means for communication and performs communication by sending messages between systems, it is a stateful management layer - much more than a means of communication.

The final layer is how the context communicates the model to the outside world - the Interface. It is not an interface in the C-family sense of the word; it encompasses all of what you present to the user. An easy mistake is to assume that this only means a graphical- or console-user interface systems, but an interface can be a SOAP or JSON web service, a shared object library, operating system service or whatever method you use to interface with the world outside of your program.

This fits well into a service-oriented architecture, where the model provides local representations of domain objects, the context provides a system for those objects to communicate and an interface to work with the outside world. It is also quite reasonable to expect a context to work with other interfaces aside from the public one your application is presenting. For example, the context of a database-backed system would interact with the interface of an SQL server to provide the data. If the context is properly designed for configuration (Pragmatic Programmer says "Configure, Don't Integrate"), the interface does not need to know where the context gets information from, nor should it matter, and the system will be easily testable and debugable. The next obvious step is to extend this system to realise that we can chain these systems together however we need, so long as we do not let an interface directly manipulate the model (perhaps a better name would be (MC)*M(CI)*CI or the more catchy M(CM)*(CI)+).

But this whole thing was started with talk of end-user interfaces in terms of graphics and consoles. It is a natural and correct assumption that these user interfaces fit into the "interface" component of MCI. The goal is that it should be incredibly simple to swap out a given interface, be it a "system interface" that other system components work with, a "programmer interface" that others are meant to work with or a "user interface" for those regular people. Since they share underlying context components, it should be easy for multiple interface systems to seamlessly interact - changes made through one interface should be automatically reflected in another. However, do not forget that these components are also software. A web application which needs to remember user states or any other packet of information probably needs a model and context to remember it with. Just keep in mind that this should be a different model and context from the data you are serving, since they are fundementally different systems. This is a good design decision, mostly due to the separation of conserns, and will lead to a more service-oriented architecture (which will help when you make the AJAX or mobile version of the site).

Other tips for building data and context? Fill them with as much structural and descriptive metadata as possible. Java annotations (@interfaces) and .NET custom attributes are wonderful for this. In languages that do not support assembly-attached metadata, this functionality can be duplicated with automatically-generated XML files (if you're using Make, MSBuild, Ant or Maven, it is pretty easy to plug in a custom build step) or you could come up with another cool way, so long as semantic information is preserved in a runtime-readable format. This way, it becomes possible to automatically generate user-interface elements based on the program's metadata. Sound to good to be true? In the .NET Windows Forms system, there is a PropertyGrid, which is a single-object editor which designs itself using run-time reflection of attributes from the System.ComponentModel namespace. Java does not have a component like this built-in, but people have made programs in the Java Platform such as L2FProd's JPropertySheet (Apache-lisenced), which duplicates the functionality of PropertyGrid and the Java-Bean-Examiner (GPL-liscenced), which is specialized for EJBs.

There is another architectural pattern which is very similar to the MCI system, which plays off the three-letter acronym (3LA) system of named architectural patterns. It is called Data-Context-Interaction (DCI) and there was a fairly large page dedicated to it in March of 2009. If you are interested, I would highly recommend reading the linked page. Another good read is "The Common Sense of Object-Oriented Programming." Keep in mind that both of these papers quite Squeak-centric (the second, much more so), to the point that it is very difficult to abstract the core concepts out of their papers (and you really have to continually bypass statements like: "Smalltalk does everything for everyone in a clean and efficient system"). If you really like Smalltalk/Squeak, try out BabyIDE, which is a perspective-based IDE for Squeak (those familiar with Eclipse will feel close enough to home).

What is the difference between MCI and DCI? Aside from the different names, not much (I almost prefer "data" over "model" since it more clearly distances it from MVC, but I claim that the first letter of the acronym is much more than simple data). In reality, neither of these patterns are what I would call "mind-blowing." Essentially, it is better-defining the "model" from the MVC pattern by splitting into two parts, the "model" (or "data") and the "context." Furthermore, it gives up on the idea that the "view" and "controller" were ever separate entities by compressing them into an "interface" (or "interaction"). While this is a fairly subtle distinction, but it's the context that makes all the difference in the world.

09 August 2009

Multi-Threaded Synchronization Performance on Xbox 360

In June, I did a series of micro-benchmarks using the .NET platform on a Windows PC. As fun as that is, I decided to take my tests to the Xbox 360 to see how the same algorithms on a different architecture would play out.

The first thing I noticed when porting my StaticLongTest to Xbox 360 is that my favorite functions from the Interlocked class are suddenly missing. In fact, we are only allowed to operate the methods on two 32-bit types: reference objects and integers (not unsigned integers). This leaves us with explicit locks and wait handles (ManualResetEvent and AutoResetEvent), neither of which are very interesting. So, I resigned to the fact that I would have to alter my test to be the SharedIntTest, which really decreases my interest in the subject, since 32-bit values can almost univerally be read atomically. Bleh.

However, I journeyed onward, knowing that something interesting would have to come out of this. There really is not too much to expand on, the algorithms are all identical from my previous post, they are just using integers instead of long integers. I must say that I had quite a frustrating time, as I came across a unique error whose appearance coincides with the most recent Xbox Live update, wherein the Game Studio Connect informs me: "The game you're trying to play is missing or corrupt" (someone else on the XNA forms had the problem, too). Symptoms include your breakpoints suddenly failing to work - the IDE might tell you that the assembly you're trying to run is different from the one you have. Or it might not. In any case, the solution I found was to delete the file on the Xbox and delete any and all cache files and binaries on my PC.

Once I actually started deploying things on the 360, I noticed that my old testing classes would not work. The reason? The Xbox cannot handle more than around 500 threads - and even then it does not handle them very well. After much trial-and-error and repeatedly overloading my Xbox (it would not respond to X gem presses), I landed on a 100 thread limit. Much more than this and performance slows to a crawl. Realistically, this is not a big deal, as you really should never have more than 4 threads (one per available core) going.

But enough complaining, let's see some results!

Test Name	Running Time (seconds)
Test Name	PC	Xbox 360	% Slower
No Concurrency *	0.109 (92.7%)	1.639 (96.5%)	1 504%
Interlocked.Increment	0.141	1.425	1 011%
Interlocked.CompareExchange	0.203	3.612	1 779%
lock (object)	0.328	136.989	41 765%
Interlocked.Exchange	0.625	93.868	15 018%
No Concurrency (batch)*	0.031 (100.0%)	0.467 (54.0%)	1 506%
Interlocked.CompareExchange (batch)	0.203	2.168	1 068%
lock(object) (batch)	0.047	1.990	4 234%
AutoResetEvent (batch)	0.047	2.343	4 985%
Interlocked.Exchange (batch)	0.031	0.926	2 987%
* Non-synchronizing algorithm yeilds incorrect results -- number in paraenthesis gives correct percentage.

Something caught my eye almost immediately: Despite having 6 cores (as opposed to my PC's 2), the PC blows the Xbox out of the water on every single test. The 360 is at least 10 times slower than the PC. Why? An interesting question - it seems synchronization is just more difficult to do on the Xbox. But I'm not a fan of hand-waving a problem off, so I'll turn to my old pal, the XNA Remote Performance Monitor.

Perhaps you've heard of the .NET Compact Framework Garbage Collector and how it is not your friend and it is horrible and slow? Well, it's really not that bad (at least here). With the performance monitor, I can see that for the 10 tests run, the garbage collector ran 11 times, all but one of which were explicitly called by me right before the test was run. The one I did not call ran during the lock (object) test and really only occurred because the test took so damn long. As it turns out, this doesn't affect anything, as my code boxes 1 value type per frame, which is a non-issue for the garbage collector to clean.

I'm really not sure why Xbox synchronization performance is so slow. The problem most likely stems from the fact that the architecture of this particular gaming console centers around the GPU, whereas x86 architecture centers around the CPU - and thread synchronization is a very CPU-oriented problem.

As you can see, all memory access goes through the GPU, whereas on your PC, memory access is basically direct from the CPU:

This provides the Xbox with huge advantages when it comes to graphics performance, as many graphics calls that take time on the PC are essentially free on the 360. Much like on your typical GPU, there is an advanced memory store operation that performs swizzling for free. However, as far as I can tell, we lowly XNA developers do not have access to this functionality (aside from in your HLSL shaders).

Since it does not appear as if Microsoft has any plans to open the 360 for assembly-level work, we have to make do with what we have. There are two functions that do not totally suck: Interlocked.Increment and the batch version of Interlocked.Exchange.

Interlocked.Increment is not very interesting. As I've said before, it is a highly-specialized function that you have to be extremely creative with to use to do anything more than increment some variable. That said, it is very good at incrementing a variable.

However, for those of us that are sick, twisted individuals that enjoy a good time programming with some non-specific functions, the Interlocked.Exchange looks very neat. Some might recall that this very method was the uncontested winner of the Threaded Throughput Throw-down as the amount of data grew significantly large. This actually demonstrates an important lesson for Xbox programming: Never leave a core idle. Even though we are duplicating work (as opposed to relying on explicit locking mechanisms), this method of do-it-now-and-check-later outstrips the competition.

Okay, not all the competition. The non-concurrent batch method is twice as fast on the 360 than the Interlocked.Exchange method. Why? The answer is simple: The cache. All interlocked operations require leaving the CPU environment and actually interacting with outside things. Sure, the values are certainly in the cache, but other significant values could have changed outside our happy little environment. From this, there is another important lesson: Try to make your processes interact as little as possible. As the saying goes: multithreading is not inherently difficult, it is the interaction which causes the problems.

In conclusion, this was a pretty useless post. I was really hoping to find a solution that really blew the others out of the water or at least showed a significant variation from running on the PC. Well, I guess now we know not to rely on explicit locks.

24 June 2009

Open Source and Cheating

I've been thinking a lot about Vapor's main project: Orion. For those of you that do not yet know, Orion is a real-time strategy game built on the XNA Framework. We're taking inspiration from Starcraft, Red Alert, Supreme Commander, Open Real-Time Strategy, Team Fortress 2 and a million other places. Basically, we are attempting to build an incredibly skill-oriented, luckless, fast-paced and fun RTS. I'm sure I'll talk more about it in the future.

Vapor Gaming is meant to be an open-source gaming company, but I am having serious thoughts about releasing the source code to the "flagship product." There are about a million reasons why I would really like to, a billion that do not influence me either way and a single reason why I do not want to. The single reason is: cheaters.

It isn't that I care about potential hackers looking at the source code, since security through obscurity is a flawed concept. The thing that I care about is guaranteeing that when two people are playing over the internet, those two people are running the same executable. More specifically, I care that every process in the application's current run domain is supposed to be there and has not been altered. The .NET Framework has quite a few systems to help this in a local context, using a mixture of obfuscation, strong-name signatures and watermarking (and these are pretty tough, the 1024-bit RSA key has never been broken: see Bruce Schneier). However, all of these systems are designed to protect the user from the code, not the other way around.

Let's assume that we release the source code and people can rebuild it at will. Without including the private keys used to encrypt and sign the assembly a hacker can recompile the assembly to do whatever he wants, yet tell the remote system that it is perfectly valid. Is there a way to prevent this? Even John Carmack can't figure it out.

I do not want to treat all people as "potential hackers," but that is the reality of the situation. Ill treatment of users (or people in general) is not a good thing, but reducing in-game hacking is an overall benefit to the community (as opposed to DRM systems), especially since too much hacking can completely destroy a game - it is no fun to lose because the other player hacked their way to victory.

Client-side solutions will not work, because of the fact that code can be trivially recompiled. A method such as TestAllAssembliesValid() could just be replaced with a system that returns true (or true after a slight delay, so as to emulate actually performing the checks). The craziest case I could come up with is the server sending an algorithm in script to run against the assemblies and return the result in a secret key, but a "bad" assembly could just load up the legitimate copy and run the algorithm against that. It would not matter if things were changed, as it would just be a matter of time before the next round of hacks came out and I would rather not have an arms race.

A solution that would work decently is to host all ranked games on Vapor-owned servers. In this case, if clients attempt to do something seedy, such as raise their resource count, the server would not know or care about it, so the hacker would not be able to reap the benefits of it (there is no Vapor Engine network code for 'Please set my resources to ____' and all gameplay state changes have to go through the server). However, this might not be feasible from a cost standpoint, as the amount of servers needed to run such a system has yet to be determined. Furthermore, this would not secure client-only hacks, such as map hacks and map reveals.

Another option is to use server-side heuristics to determine if someone is cheating. Since ranked match results will be communicated to the server, it would be possible to apply some sort of algorithm of the given conditions to analyze if someone is cheating or not. However, the accuracy of such a method is questionable at best. I am not entirely convinced that it is possible to accurately determine anything, even if an entire game replay is uploaded (as the replay system could be tinkered with on recompilation).

It's really starting to look like there isn't a good way to programmatically catch cheaters. So what if there is an easy system to report cheaters in? The ultimate issue with this is the sheer amount of noise generated by users. I've played enough RTS in my day to know that any time there is a cheating board, it is quickly filled with "OMG HE MAP HACKS!" and enough FPS to know there are a ton of "OMG WALL HACK" and "OMG AIM BOT" accusations. Giving people an easier means to commit this stupidity would only generate more noise.

However, there are already a ton of systems for noise filtering. Some of these are community-based. YouTube's inappropriate content flag is a good example of one; Slashdot's moderation system is a more advanced example of collaborative flagging. These solve the problem of noise filtration by shifting much of the work to the people playing the game. It allows community members to participate in game moderation. A system could feasibly analyze "cheat" reports. If a user's last three games have been flagged as "cheating" by three different people, then the user is probably cheating.

So what do we do when we catch a cheater? Ban him for life. We are trying to build a game based on game skills, not hacking skills (those are good for other things). If you are not going to have fun playing by the rules, we do not want you in our community.

16 June 2009

Multi-Threaded Synchronization Performance in C#

I spend a lot of time studying and debating the many ways application performance can be improved. While working with applications that involves network usage or any other high-latency read process, making them multi-threaded is one of the best ways to decrease turnaround time. While waiting for more data to come in through the network, it is almost universally a Good Thing to start processing those jobs in the background. The next few posts will be dedicated to multi-threading performance, detailing the creation of a simple producer/consumer system and a reader/writer lock.

Let's start with the basics: A worker thread needs to communicate with its parent and while these interactions should be minimized, they are, at some point, necessary. What is the best way to get information from one thread to another? What is the least expensive synchronization mechanism?

After perusing the internet, there appears to be a shortage of real-world benchmarks for different methods of synchronization. This likely comes from the fact that performance is so incredibly varied. The same test may perform better with one algorithm on a dual-core system and reverse on a single-core. Algorithms that work for all cases on an x86 or x87 platform may flat-out fail on other architectures such as Power. Universally, the choice of synchronization solution depends on variables and conditions not directly related to the algorithm. But, being an American, I love to see how things fare in head-to-head combat.

To have our battle to the death of algorithms, we must first choose what our code should accomplish. I've chosen to increment a shared long integer, as the actual operation we are attempting to accomplish will take relatively few cycles as compared to the synchronization mechanisms used to generate the correct result. A shared long also ensures that memory reads and writes cannot happen atomically without extra overhead.

In summary: There will be 1 000 threads incrementing a shared long integer 100 000 times.

Just so you know how things are working, this is basically how threads will be run:

TimeSpan ThreadTest(ThreadStart threadStart, out long value)
{
  // start our long at 0
  m_value = 0;

  // create all the threads to point at threadStart --
  // which should be a function pointer to an incrementer
  List<Thread> threads = new List<Thread>();
  for (int i = 0; i < 1000; i++)
  {
    Thread t = new Thread(threadStart);
    threads.Add(t);
  }

  // collect garbage now to minimize interference
  GC.Collect();

  // start each of the threads
  DateTime start = DateTime.Now;
  foreach (Thread t in threads)
    t.Start();

  // join each of the threads until completion
  foreach (Thread t in threads)
    t.Join();
  DateTime finish = DateTime.Now;

  value = m_value;
  return finish - start;
}

So: how can we accomplish this goal? The simplest solution would be to have 1000 threads looping and incrementing 100 000 times, like so:

private void NoConcurrency()
{
  for (int i = 0; i < INCREMENTS; i++)
    m_value++;
}

This process, however, will not give you consistent results and with this many threads and increments, m_value will never finish with the correct value of 100 000 000. The reason for this is that m_value++ is not an atomic operation, it actually looks more like:

long tempVal = m_value + 1;
m_value = tempVal;

With many threads all reading an writing, the value of m_value could easily have changed since when it was read. Ultimately, the final value will be much less than the desired 100 000 000.

The classic solution to this problem is to use explicit locks through mutexes. C# has a convenient construct for this, wherein one simply calls lock(object) { ... } and a mutex designated by the given reference object is acquired and before the code block and released afterwards. Changing our original non-concurrent method is quite simple:

private void MutexBlock()
{
  for (int i = 0; i < INCREMENTS; i++)
  lock (lockObject)
    m_value++;
}

Wow! That was easy! Now m_value will end at the desired value. This .NET's convenient mutual-exclusion lock and can be used for any reference object. It means that only one thread can obtain a lock on the given lock object at a time. Since we're assuming all access to m_value is done in this function, it means that the read, increment and write-back of the value happens atomically (technically not 100% true: if somebody Ctrl+Alt+Deletes the process, that notion flies out the window).

But incrementing something is such a common operation and locking is slow...surely there must be a faster way of performing this operation without the use of locks. Of course! Enter .NET's convenient batch of functions inside the Interlocked class.

private void InterlockedIncrement()
{
  for (int i = 0; i < INCREMENTS; i++)
    Interlocked.Increment(ref m_value);
}

That's neat - somebody made it so you can just increment a value - and in a single operation, too! Be wary, though, this does not compile directly to an x86 assembly statement like lock inc [eax]. There is a lot more stuff going on behind the scenes, with memory barriers and some volatile writing so when we get off of x86 and move to something more interesting like PowerPC, our assumptions of this working hold true.

I'm sure you can see the shortcoming of the Interlocked.Increment function: you can only increment an integer or a long by 1. Okay, so there is an Interlocked.Add function which lets you add (or subtract) values, but none of these are very interesting. No, it doesn't matter for the current operation, but I'm sure you can see that it would be beneficial to have some flexibility.

What we need is a system of locking that doesn't actually use the lock construct. Luckily, there are a million ways to do this. One of them is through the use of Interlocked.Exchange.

private void InterlockedExchangeBlock()
{
  long val;
  while ((val = Interlocked.Exchange(ref m_value, -1) == -1) ;
  val++;
  Interlocked.Exchange(ref m_value, val);
}

What is happening here? First, we're swapping out the value at m_value with -1 and checking if the value we received is -1. We keep trying again and again until we get a value that is not -1. Then, we increment the value we found in m_value and save it back.

The keen observer will notice that this is basically the lock construct, except we are wasting CPU cycles by spinning. Why would we do this instead of using an explicit lock? The answer is actually quite complicated. When a .NET thread is unable to obtain a lock for a mutex, it yields to let the operating system reschedule the thread. That is fine, but the thread could be rescheduled a long time in the future and if we know that the value is going to become readable very soon, we should spin-wait until we can grab it. Yes, we'll waste some CPU cycles, but it's all in the name of making a super bad-ass program.

But we've definitely freed ourselves of the problem with using Interlocked.Increment - sure, we're only incrementing, but we could have multiplied m_value by 10 or found the m_valueth Fibonacci number before storing it back in its position. The only real problem we run across is if m_value changes while we're performing the math on it. Is an atomic way to check for this sort of thing? Of course! That's what Interlocked.CompareExchange is for - it takes one more argument, which is compared to the value in memory and the exchange is only performed if the value in memory is equal to the third argument.

private void InterlockedCompareExchange()
{
  for (int i = 0; i < INCREMENTS; i++)
  {
    long orig, newval;
    do
    {
      orig = m_value;
      newval = orig + 1;
    } while (Interlocked.CompareExchange(ref m_value, newval, orig) != orig);
  }
}

This is a very strange-looking function. We grab the original value from the memory location, perform the math and then to something funny. What that while thing means in English is:

Compare the current value of m_value with what we originally pulled
If they are equal, put the new value into m_value
If they are not, retry the loop

This is true lockless programming. At no point is it possible for a thread to stall - if something bad happens and m_value is changed, the loop will be repeated, which is not really that big of a deal. This operation is persistent and atomic.

But it turns out that the biggest advantage of this algorithm is also its biggest weakness. In this case, redoing the operation is trivial - reading from memory and adding 1 is really, really cheap. When that work is thrown out, we have wasted very few cycles. But if we were doing something more expensive, such as downloading a web site and performing actions based on the received information, throwing out that information represents a huge waste. Also, as the time span between the initial read and the final write-back increases, the likelihood of that original read becoming invalid increases. It's really an art.

Okay, enough talk. What are the results? I ran the tests described here as well as "batch" versions, which obtain the lock and then perform the math. See the link at the bottom of the post for the exact source code. All of the tests were run under DEBUG and RELEASE compilation, both with the Visual Studio debugger attached and by themselves...here is a table of results:

Test Name	Running Time (in seconds)
Test Name	DEBUG, Attached	DEBUG, Clean	RELEASE, Attached	RELEASE, Clean
No Concurrency *	3.374 (91%)	0.499 (75%)	3.843 (93%)	0.281 (34%)
Interlocked.Increment	12.124	3.687	12.890	3.749
Interlocked.CompareExchange	10.249	4.812	11.156	4.249
lock(object)	14.460	4.531	13.671	4.390
Interlocked.Exchange	15.265	11.796	16.234	9.421
AutoResetEvent **	1:35.655	1:54.999	2:22.670	1:02.773
No Concurrency (batch) *	4.203 (76%)	0.218 (50%)	3.630 (78%)	0.910 (51%)
Interlocked.CompareExchange (batch)	4.218	0.390	3.383	0.137
lock(object) (batch)	4.620	0.421	4.656	0.265
AutoResetEvent (batch)	2.828	0.437	2.812	0.281
Interlocked.Exchange (batch)	4.765	0.374	4.593	0.780
* Non-synchronizing algorithm yeilds incorrect results -- number in paraenthesis gives correct percentage.
** The Auto Reset Event Block is calculating 1/100th of the data of the other algorithms, due to general slowness

The most interesting results are the batch operations of Interlocked.Exchange and Interlocked.CompareExchange: they perform significantly faster than their non-concurrent counterparts. How could that be? These algorithms spend a portion of their time spinning in while loops and repeating work. There are a couple things going on here to cause this result.

The first thing is CPU burst. Remember, these 1000 threads are not actually executing in parallel, but each is given a certain amount of time with the CPU before getting interrupted by the operating system. This means that the internal for loop takes approximately 300000 cycles to complete (100000 loops * (2 increments/loop + 1 JZ/loop)), assuming it compiled into something like:

mov ecx, 100000
mov ebx, 1
$lStart: sub ecx, ebx
jz SHORT $lFinish
add eax, ebx
jmp SHORT $lStart
$lFinish: ; whatever...

On my 3GHz computer, this would take approximately 0.094 milliseconds to complete.

This is where the second factor comes in: locality for run-time optimization. Both of these functions perform math on a stack-allocated local variable that is not passed to any other function. This is an ideal candidate to simply be put in a register at run-time, like in my assembly example. The CLR does not compile the byte-code to exactly my assembly, but you get the idea. Knowing the general idea of what the runtime environment is doing and knowing how you can help to optimize it is an important component in high-performance managed coding for any language.

Another interesting result is the batch job for no concurrency takes longer than the straight non-concurrent method. However, this only seems to apply during the RELEASE build. In DEBUG, the batch non-concurrent method is consistently faster than the RELEASE build and always the fastest algorithm. The IL output for the two methods is identical, so something fishy must be going on. The Internet seems to be lacking information, so if anybody knows why emitting debug information seems to speed up the process, I would really like to know. The result is wrong, but it would still be nice to know.

The best conclusion that we can make, however, is that RELEASE without an attached debugger is the best monitor of threading behavior. It generally performs better than other build and run combinations (as it should) and, if nothing else, is the build and run environment that you should eventually be deploying to a production environment.

So what happens if we vary things like how much we're incrementing with each pass or how many threads are performing the incrementing? I modified the code to run through each of the tests with 10, 50, 100, 500, 1000, 2000 and 5000 threads and 25000, 50000, 75000, 100000, 150000 and 200000 increments. I also stopped testing AutoResetEvent because it sucks.

We could look directly at the numbers, but it's hard to decipher what the data means. Let's look at some pretty graphs brought to you by our good friend Excel! I thought the most interesting statistic would be a measure of throughput: How many effective increments are being performed each second?

First, the control group: Our non-concurrent method:

Non-concurrent performance graph

As you can see, throughput is higher with less threads and less incrementing. Standard allocation prefers to grab and let go. Perhaps this is more meaningful with something to compare to:

Interlocked.CompareExchange performance graph

Interlocked.CompareExchange performance graph

With Interlocked.CompareExchange, you can see that it remains pretty consistently around 20 million increments/second. There is a slight dip with 50 threads and 75000 increments, although it certainly doesn't break the overall trend.

Interlock.CompareExchange batch performance graph

Interlock.CompareExchange batch performance graph

The batch version, however, almost doubles the throughput of the non-concurrent method and appears to be steadily increasing as the number of increments increases. Not only that, but the number of threads do not seem to effect the performance of the algorithm in the slightest.

Interlocked.Exchange batch performance graph

Interlocked.Exchange batch performance graph

The algorithm with the highest throughput is none other than the batch version using Interlocked.Exchange, peaking at over a billion increments each second. The best reason for this is that we're only performing 200000 increments at a time, which is quite a cheap operation, all things considered.

If we constrain the amount of threads to 1000 (enough threads for heavy contention without completely thrashing the scheduler), we can make a nice 2-D graph of how different algorithms behave:
Performance summary chart

That graph pretty much summarizes it: batch operations are way faster than trying to synchronize on each increment. But if you can not divide the work into separate modules, the Interlocked.Exchange method is the fastest, trailed at quite a distance by explicit locking. Good to know.

Source Code for the shared long test
The book of results

Come back next week to use this information for something useful!

08 June 2009

Open Source Software

Vapor Gaming's first release, currently known as under the code name Aquila, will be released as open-source software. We're hosting it through Google Project Hosting, our repository can be found here. But that's all just facts. How did we get to that point?

After a fairly lengthy discussion with my fellow developer, Sean, we decided on what we want the Aquila project to be. Yes, we want to make a web game and we want it to be cool. We want a user not to have to use any plug-ins -- requiring only a supported browser. Even Flash is off-limits, even though support for it is almost ubiquitous. We want to use only the browser, which means pushing Javascript as far as possible. But that's really just client-side.

We want an open model where people where free to build their own extensions even their own client. We'll provide the service and the first client, but what we are really trying to do is get people to have fun together. We are all on the Internet, most of us use Facebook, some of us use Twitter and by the time this post is published, there will be a million people on some new service. We want to make it possible for anyone to integrate our application in any place where it would be conceivable. There is no way that two developers can possibly track and develop plug-ins for all the places that people could want it.

From observation, when people use and enjoy a service, programmers among them love to integrate that service into other things, even if it is difficult. When I first started playing with programming (beyond "Hello World"), I created a console-based AIM client in C by adapting some awful VisualBASIC 6.0 library. When I started playing on the Internet, I spent quite some time crafting GETs for XmlHttpRequests and picking apart the HTML returned to me. Was it useful? Probably not, but it was fun to message my friends using something I had created.

But, while hand-crafting and parsing strings was doable, it was really just a means to an end. I ended up with applications that were not exactly hardy, and could break when someone changed the page layout of their web site. It was neat to see, but creating a robust application that could qualify as "releasable" was impossible to do. When things are exposed as web services, where people can easily send querystring GETs, SOAP or JSON and get back some easy-to-parse XML data, it is easy for people to create applications with the information. Most people enjoy creating their own extension applications more than writing and maintaining adaptor code for other people's junk. More enjoyment in creating extensions is equivalent to more people creating extensions, which means applications will reach a wider audience. Twitter is a great example of this - by exposing posting and reading as simple POSTs and RSS feeds, Twitter can go anywhere with a web connection (and with a little bit of adaptation, even to places that can't: SMS).

For this reason (and many others), there has been a large push for companies to expose functionality as web services. But there is nothing inherently open-source about this action. The Yahoo and Google query APIs are easy to use, yet the backing data is provided by a secretive, proprietary system. So we're really back at square one: Why open-source?

When it comes down to it, many of the advantages of open-source software are lost when providing the ultimate product as a service. The chance of forking drops severely because forking the project means losing all the users, which are the real value of any community. Modifications and security updates have to eventually go back to one spot: the server. One company effectively has the power to control all development, because updates not applied to trunk will never reach the majority of the community.

In reality, the two largest factors in our decision to open source the Aquila Engine come down to vanity and laziness. We genuinely think what we are doing is cool and we think that you think it is cool, too. We hope that when you use whatever Aquila turns out to be, you will think "That's neat," and wonder how we did it. And guess what? The source code will be availible for you to see how! A little self-centered, I know.

Furthermore, we enjoy writing software, not writing documentation APIs. We document in code as much as we can, because that is easy and genuinely helpful, but creating an independant page dedicated to documentation? Yuck! Having to copy and paste that documentation from our code to that page? Spare me! If someone wants to create better documentation than we can, great! Until that time, we can just point people towards the JSP pages that field service calls and say: here's how it's done. The self-documenting code that people see will be superior to some semi-descriptive documentation on an API reference page, as I am not particuarily skilled at writing API docs.

Beyond this, there are some philosophical reasons for releasing software as free and open-source, but these topics have been covered a million times over by people who are better with words than I.

In summary, Aquila will be released open source, under the GPL v3 licence (although I've been debating AGPL). In either case, things are looking fantastic!

Travis Gockel