The Basics of Software Transactional Memory

Jan 22 2012 Published by under Concurrent Programming, Programming

As promised, it's time for software transactional memory!

A bit of background, first. For most of the history of computers, the way that we've built software is very strongly based on the fact that a computer has a processor - a single CPU, which can do one thing at a time, in order. Until recently, that was true. But now it really isn't anymore. There's the internet - which effectively means that no computer is ever really operating in isolation - it's part of a huge computational space shared with billions of other computers. And even ignoring the internet, we're rapidly approaching the point where tiny devices, like cellphones, will have more than one CPU.

The single processor assumption makes things easy. We humans tend to think very sequentially - that is, the way that we describe how to do things is: do this, then do that. We're not so good at thinking about how to do lots of things at the same time. Just think about human language. If I want to bake a cake, what I'll do is: measure out the butter and sugar, put them in the mixer, mix it until they're creamed. Then add milk. Then in a separate bowl, measure out and sift the flour and the baking powder. Then slowly pour the dry stuff into the wet, and mix it together. Etc.

I don't need to fully specify that order. If I had multiple bakers, they could do many of steps at the same time. But how, in english, can I clearly say what needs to be done? I can say it, but it's awkward. It's harder to say, and harder to understand than the sequential instructions.

What I need to do are identifying families of things that can be done at the same time, and then the points at which those things will merge.

All of the measurements can be done at the same time. In any mixings step, the mixing can't be done until all of the ingredients are ready. Ingredients being ready could mean two things. It could mean that the ingredients were measured out; or it could mean that one of the ingredients for the mix is the product of one of the previous mixing steps, and that that previous step is complete. In terms of programming, we'd say that the measurement steps are independent; and the points at which we need to wait for several things to get done (like "we can't mix the dry ingredients in to the wet until the wet ingredients have all been mixed and the dry ingredients have been measured"), we'd call synchroniation points.

It gets really complicated, really quickly. In fact, it gets even more complicated than you might think. You see, if you were to write out the parallel instructions for this, you'd probably leave a bunch of places where you didn't quite fully specify things - because you'd be relying on stuff like common sense on the part of your bakers. For example, you'd probably say to turn on the over to preheat, but you wouldn't specifically say to wait until it reached the correct temperature to put stuff into it; you wouldn't mention things like "open the over door, then put the cake into it, then close it".

When we're dealing with multiple processors, we get exactly those kinds of problems. We need to figure out what can be done at the same time; and we need to figure out what the synchronization points are. And we also need to figure out how to do the synchronization. When we're talking about human bakers "don't mix until the other stuff is ready" is fine. But in software, we need to consider things like "How do we know when the previous steps are done?".

And it gets worse than that. When you have a set of processors doing things at the same time, you can get something called a race condition which can totally screw things up!

For example, imagine that we're counting that all of the ingredients are measured. We could imagine the mixer process as looking at a counter, waiting until all five ingredients have been measured. Each measuring process would do its measurement, and then increment the counter.

  val measureCount = 0
  process Mixer() {
    wait until measureCount == 5
  }

  process Measurer() {
     do_measure
	 measureCount = measureCount + 1
  }

What happens if two measurer finish at almost the same time? The last statement in Measurer actually consists of three steps: retrieve the value of measureCount; add one; store the incremented value. So we could wind up with:

Time Measurer1 Measurer2 measureCount
0 1
1 Fetch measureCount(=1) 1
2 Increment(=2) Fetch measurecount(=1) 1
3 Store updated(=2) Increment(=2) 2
4 Store updated(=2) 2

Now, Mixer will never get run! Because of the way that the two Measurers overlapped, one of the increments effectively got lost, and so the count will never reach 5. And the way it's written, there's absolutely no way to tell whether or not that happened. Most of the time, it will probably work - because the two processes have to hit the code that increments the counter at exactly the same time in order for there to be a problem. But every once in a while, for no obvious reason, the program will fail - the mixing will never get done. It's the worst kind of error to try to debug - one which is completely unpredictable. And if you try to run it in a debugger, because the debugger slows things down, you probably won't be able to reproduce it!

This kind of issue always comes down to coordination or synchronization of some kind - that is, the main issue is how do the different processes interact without stomping on each other?

The simplest approach is to use something called a lock. A lock is an object which signals ownership of some resource, and which has one really important property: in concept, it points at at most one process, and updating it is atomic meaning that when you want to look at it and update it, nothing can intervene between the read and write. So if you want to use the resource managed by the lock, you can look at the lock, and see if anyone is using it; and if not, set it to point to you. That process is called acquiring the lock.

In general, we wrap locks up in libraries to make them easier to use. If "l" was a lock, you'd take a lock by using a function something like "lock(l)", which really expanded to something like:

def take(L: Lock) {
  while (L != me)
     atomically do if L==no one then L=me
}

So the earlier code becomes:

val measureCount = 0
val l = new Lock()

process Mixer() {
  wait until measureCount == 5
}

process Measurer() {
  do_measure
  lock(l)
  measureCount = measureCount + 1
  unlock(l)
}

In a simple example like that, locks are really easy. Unfortunately, real examples get messy. For instance, there's a situation called deadlock. A classic demonstration is something called the dining philosophers. You've got four philosophers sitting at a table dinner table. Between each pair, there's a chopstick. In order to eat, a philosopher needs two chopsticks. When they get two chopsticks, they'll use them to take a single bite of food, and then they'll put down the chopsticks. If Each philosopher starts by grabbing the chopstick to their right, then no one gets to each. Each has one chopstick, and there's no way for them to get a second one.

That's exactly what happens in a real system. You lock each resource that you want to share. For some operations, you're going to use more than one shared resource, and so you need two locks. If you have multiple tasks that need multiple resources, it's easy to wind up with a situation where each task has some subset of the locks that they need.

Things like deadlock mean that simple locks get hairy really quickly. Not that any of the more complex coordination strategies make deadlocks impossible; you can always find a way of creating a deadlock in any system - but it's a lot easier to create accidental deadlocks using simple locks than, say, actors.

So there's a ton of methods that try to make it easier to do coordination between multiple tasks. Under the covers, these ultimately rely on primitives like locks (or semaphores, another similar primitive coordination tool). But they provide a more structured way of using them. Just like structured control flow makes code cleaner, clearer, and more maintanable, structured coordination mechanism makes concurrency cleaner, clearer, and more maintainable.

Software transactional memory is one approach to this problem, which is currently very trendy. It's still not entirely clear to me whether or not STM is really quite up to the real world - current implementations remain promising, but inefficient. But before getting in to any of that, we need to talk about just what it is.

As I see it, STM is based on two fundamental concepts:

  1. Optimism. In software terms, by optimism, we mean that we're going to plow ahead and assume that there aren't any errors; when we're done, we'll check if there was a problem, and clean up if necessary. A good example of this from my own background is source code control systems. In the older systems like RCS, you'd lock a source file before you edited it; then you'd make your changes, and check them in, and release the lock. That way, you know that you're never going to have two people making changes to the same file at the same time. But the downside is that you end up with lots of people sitting around doing nothing, waiting to get the lock on the file they want to change. Odds are, they weren't going to change the same part of the file as the guy who has the lock. But in order to make sure that they can't, the locks also block a lot of real work. Eventually, the optimistic systems came along, and what they did was say: "go ahead and edit the file all you want. But before I let you check in (save) the edited file to the shared repository, I'm going to make sure that no one changed it in a way that will mess things up. If I find out that someone did, then you're going to have to fix it."
  2. Transactions. A transaction is a concept from (among other places) databases. In a database, you often make a collection of changes that are, conceptually, all part of the same update. By making them a transaction, they become one atomic block - and either the entire collection all succeedd, or the entire collection all fail. Transactions guarantee that you'll never end up in a situation where half of the changes in an update got written to the database, and the other half didn't.

What happens in STM is that you have some collection of special memory locations or variables. You're only allowed to edit those variables in a transaction block. Whenever a program enters a transaction block, it gets a copy of the transaction variables, and just plows ahead, and does whatever it wants with its copy of the transaction variables. When it gets to the end of the block, it tries to commit the transaction - that is, it tries to update the master variables with the values of its copies. But it checks them first, to make sure that the current value of the master copies haven't changed since the time that it made its copy. If they did, it goes back and starts over, re-running the transaction block. So if anyone else updated any of the transaction variables, the transaction would fail, and then get re-executed.

In terms of our baking example, both of the measurers would enter the transaction block at the same time; and then whichever finished first would commit its transaction, which would update the master count variable. Then when the second transaction finished, it would check the count variable, see that it changed, and go back and start over - fetching the new value of the master count variable, incrementing it, and then committing the result. In terms of code, you'd just do something like:

transactional val measureCount = 0

process Mixer() {
  wait until measureCount == 5
}

process Measurer() {
  do_measure
  atomically {
    measureCount = measureCount + 1
  }
}

It's really that simple. You just mark all the shared resources as transactional, and then wrap the code that modifies them in a transaction block. And it just works. It's a very elegant solution.

Of course there's a lot more to it, but that's the basic idea. In the code, you identify the transactional variables, and only allow them to be updated inside of a transaction block. At runtime, when you encounter a transaction block, charge ahead, and do whatever you want. Then when you finish, make sure that there isn't an error. If there was, try again.

So what's it look like in a non-contrived programming language? These days, I'm doing most of my coding in Scala. There's a decent STM implementation for Scala as a part of a package called Akka.

In Akka, the way that you define a transactional variable is by using a Ref type. A Ref is a basically a cell that wraps a value. (It's almost like a pointer value in C.) So, for example, in our Baker example:

var count :Ref[Int] = Ref(0)

Then in code, to use it, you literally just wrap the code that modifies the Refs in "atomic". Alas, you don't quite get to treat the refs like normal variables - to access the value of a ref, you need to call Ref.get; to change the value, you need to use a method alter, which takes a function that computes the new value in terms of the old.

class Measurer {
  def doMeasure() {
    // do the measuring stuff
    atomic {
	  ref.alter(_ + 1)
    }
  }
}

The "(_ + 1)" probably needs a quick explanation. In Scala, you can define a single expression function using "_" to mark the slot where the parameter should go. So "(_ + 1)" is equivalent to the lambda expression { x => x + 1}.

You can see, just from this tiny example, why STM is such an attractive approach. It's so simple! It's very easy to read and write. It's a very simple natural model. It's brilliant in its simplicity. Of course, there's more to it that what I've written about here - error handling, voluntary transaction abort, retry management, transaction waits - but this is the basics, and it really is this simple.

What are the problems with this approach?

  1. Impurity. If not all variables are transactional, and you can modify a non-transactional variable inside of a transaction block, then you've got a mess onp your hands. Values from transactionals can "leak" out of transaction blocks.
  2. Inefficiency. You've got to either copy all of the transactional variables at the entry to the transaction block, or you've got to use some kind of copy-on-write strategy. However you do it, you've got grief - aggressive copying, copy-on-write, memory protect - they've all got non-trivial costs. And re-doing the entire transaction block every time it fails can eat a lot of CPU.
  3. Fairness. This is a fancy term for "what if one guy keeps getting screwed?" You've got lots of processes all working with the same shared resources, which are protected behind the transactional variables. It's possible for timing to work out so that one process keeps getting stuck doing the re-tries. This is something that comes up a lot in coordination strategies for concurrency, and the implementations can get pretty hairy trying to make sure that they don't dump all of the burden of retries on one process.

14 responses so far

  • Joe Shelby says:

    I had an interesting exercise a few (ok, almost 10) years ago in Java, Threads, and Synchronization. For an agent (simulation) system that had to do some modeling of reality, it became important that the next item to receive the locked element (Java threads follow a standard semaphore model) was the next item that had asked for it. Normally, a java semaphore, once it is released, can randomly go to any of the other threads asking for it - there's no set rule, the ultimate in "race condition" ­čÖé . I had to code a wrapper around the semaphore system, so that items would ask for it by joining a queue, and then wait - the wait was on the queue itself rather than the item. When the item was released, the queue would wake up, find the next inline, and release them with a reference to the object it was waiting for. Yes, there was a race to get into the queue, but the queue maintained the integrity of the order after that.

    Next steps would have been priority queue overrides and detached waiting - being informed that a thread had a resource via an event rather than blocking (useful when it was the core UI thread that needed the asset), but the core use case had been met so I never did much more with it. The UI case was having a UI element check the value of the asset as it changes without having to code UI callbacks into the separate other threads working on it. I could attach a monitor to the system during development and have it work without having to code UI actions (and all their single-threaded concerns) into the core framework that wouldn't need it in production, and also without having to code such UI alerts into the main worker threads.

  • billb says:

    In your second example code, you've defined the "take" function, but you've used "lock" everywhere else. (Feel free to delete this comment after it's irrelevant.)

  • John Millwe says:

    I have found Haskell's STM implementation to be quite well thought out and to not suffer from at least problem #1. Having a pure language helps. See http://research.microsoft.com/en-us/um/people/simonpj/papers/stm/stm.pdf for more information.

  • ├śrjan Johansen says:

    Hi, I usually read this site via RSS but today I wanted to look at the comments on this page, and discovered that Scientopia seems to have entered Microsoft's dangerous site filter. I reported it and trust you enough to override it, but I guess you might want to do something about it yourself.

  • Victor Eijkhout says:

    Thanks for the clear explanation.

    Related but unrelated question. Why does the whole of CS only concern itself with shared memory parallelism? Is distributed memory too hard? Too easy? Practical parallel computing is now using close to half a million processors, largely with distributed memory, yet somehow this seems to have escaped the theorists. Why?

    • MarkCC says:

      Distributed memory is just a different problem.

      For distributed memory, there's been lots of work for a long time. (Including, coincidentally, my own dissertation.) And that work produced a lot of really great stuff. In particular, there's mapreduce - both the Google version, and the Hadoop open-source version. Mapreduce is, without a doubt, the most successful parallel programming model ever, and it's not shared memory. And there's stuff like MPI, parallel fortran, etc., which do a really good job at highly parallel numeric code. In that sense, we've got a solution that's good enough for now.

      When it comes to shared memory, the simple fact is that the way that the computing world is now, it's the problem most in need of a solution. My laptop has 4 processors. My tablet has two processors. Pretty much every computing device that you can put your hands on these days has more than one processor. But programming in a way that allows a single application to make really effective use of those processors is still extremely difficult.

      For 1 or 2 processors, letting different apps each grab a processor is fine. For my two-process tablet, you don't need to specifically write multithreaded code. But a 16-CPU server is not at all uncommon these days. And the number of cores per machine is increasing. How do we write code to take advantage of that? The answer is, we still don't really have a good solution.

      • Shadonis says:

        Won't all this stuff change dramatically anyway once we start pushing up against the limits of Moore's Law here shortly?

        • Pushing up against the limits of Moore's Law is why shared-memory parallelism has become interesting.

          Distributed memory has always been a research topic - anyone processing large data sets (either as input, output, or an intermediate stage - so Google processing the web sites it's cached for indexing, or a petroleum company processing data from a potential drill site, or a physicist simulating a sub-atomic particle collision) has needed machines that are bigger than you can have built as a single system. They've thus had to build distributed shared memory machines, and work out how to program them, allowing for things like network delays and the like.

          Because of this history, distributed memory machines are well understood - if you need to program one, you reach for one of the libraries Mark mentioned and the problem, while not solved, is dealt with.

          • Victor Eijkhout says:

            Distributed memory libraries exist, true. And any number of shared memory threading libraries exist. What puzzles me is that for shared memory there are any number of theoretical papers, while I can't find anything for distributed memory.

            @Shadonis: yes, parallelism is going to be very hard. Right now you have shared and distributed, but no mix. In a few years the only way to go up will be to mix the two, and then there is no theory and little software beyond declaring "MPI+X" where X is OpenMP, CUDA, Cilk, whatever.

            @MarkCC: thanks for pointing out Mapreduce. It's an interesting model. My main problem with it is that it hides the architecture completely, which is not a good idea if you have algorithms that differ in orders of magnitude performance behaviour depending on how you lay them out across your machine. In Google's case everything is randomly distributed so one layout is as good as the next one. In scientific applications that is usually not the case.

  • Snarkyxanf says:

    Recipes are also a good example of a solution to a class of problems that come up frequently, in that they are basically a domain specific language, with declarative parts. There is rarely a procedure for the measurements given, just a list of the results. An instruction like "preheat the oven" is more of an optimization directive for the later declared constraint of "cook at X degrees".

    Unlike a procedural language, most of the ordering is actually imposed by the person executing it. Of course, just like good and bad compilers, some cooks are much better optimizers than others.

    Sorry for pushing the metaphor too far.

    • Engywuck says:

      well, even the results are somewhat vague in a recipe. "bake on medium temperature until light brown" - what is "medium" and "light brown" in this case? Or "add milk until it has the desired consistency", "use a bit cinnamon for better taste".

  • Peter says:

    Hi Mark,

    I'm an old school IMS programmer where a transaction is any data: field or record, that is passed between screens, programs, or database. We have dozens of people updating databases real time on our IBM mainframe without issues. The system hasn't really changed in decades.

    My question may be a little of topic, or perhaps a new topic all together. I quarantee if you can solve this issue you will be a billionaire. Besides being an IMS programmer I am an avid gamer. I love my PS3. The graphics are great. No improvements are needed now. The games are al realistic as you would want. The major frustration is the load times. You play a small level and go through a door an bam, loading delay. I don't see Sony or Microsoft solving this problem anytime soon. So what would be necessary to solve this, massivie amounts of memory with a cpu aways loading the next scene before you get there ($$$)? If you've played Skyrim you know what I am talking about. Beautiful game, but every few minutes is a loading dely. I would like to know more about the cause and possible solutions to this issue.

  • Filip Szczepa┼äski says:

    I think there is a problem with the mixer and measurer example. Introducing locks or the atomically doesn't actually guarantee that Measurer won't execute twice between the executions of Mixer, so Mixer can still miss when the value is 5

Leave a Reply