Concurrent garbage collection

July 31, 2019

In the previous post I looked a little bit at the fundamental principles of garbage collection, and now I’d like to talk a little about concurrent garbage collection and why it’s a much harder proposition.

Up until now our assumption has been that whilst the GC is running the data in the heap is constant - this can either be because the application has been paused (and thus nothing is modifying it), or because some mechanism has been used to take a snapshot of the heap in a consistent state (we’ll talk a little about that later). The “concurrent” case is where neither of those are true, and garbage collection is happening at the same time as the application is running (and thus modifying the heap).

On an algorithmic level, it’s only the mark phase of a mark/sweep GC that has interesting (read: awkward) concurrency properties - the sweep phase is purely concerned with objects that have been marked as unreachable, and thus by definition cannot be accessed by the application. This isn’t true with more advanced collectors that perform heap compaction and other shenanigans involving relocating live objects, however.

Depending on the situation the application and GC can actually be truly concurrent (as in running simultaneously on different threads), or time-sliced in some fashion (where execution alternates between doing garbage collection and regular application work - often referred to as “incremental” garbage collection as each chunk of GC work continues from where the last left off). The first of these two scenarios is slightly more awkward as you need to care about the atomicity of pointer writes and other “micro-level” concurrency headaches, but most of the fundamental issues are the same - to generalise, any situation in which the heap can be modified (“mutated”) between the start and end of the marking phase will have difficulties.

So essentially the situation we’re dealing with looks something like this:

  1. Mark phase starts and does some work
  2. Application interrupts GC and modifies heap
  3. Mark phase completes the rest of its work and finishes

Obviously there can be multiple iterations of the modification/GC work cycle (and in the truly parallel case there is a constant stream of modifications), but all the interesting properties can be observed with a single interruption.

What can the application do during that interruption, then? Well, there are a few operations the GC may care about:

  1. Creating new objects
  2. Loading pointers into registers (or, by extension, the stack)
  3. Storing pointers into memory from registers or the stack

Considering each of these in turn:

Creating new objects is easy to handle - if a new object is created during a GC cycle, then that object simply has to be marked as reachable upon creation. This is a conservative assumption (it is of course possible that a newly-created object could be discarded immediately), but that’s fine because it will get picked up the next time a GC occurs.

Loading pointers is by itself generally uninteresting - by definition the pointer must already exist in the heap (for it to be loaded), and thus the object can be shown to be reachable without needing to know about the copy held in a register. However, as previously discussed it is necessary to know about pointers which are held in registers/stack at the some point of the GC cycle (generally the end), as it is possible the heap location they were originally loaded from was subsequently modified and thus the GC never saw the pointer value that was loaded.

Storing pointers, therefore, is what causes most of the problems. Storing a pointer can essentially do one of three things:

  1. Replace a null with a new pointer
  2. Replace a pointer with a null
  3. Replace a pointer with another (different) pointer

Another way to think about these is in terms of references - storing a new pointer is in effect adding a reference to an object, whilst removing (or replacing) one gets rid of a reference. Since the GC process is trying to ascertain “does this object have any references from another live object”, we can essentially make two statements:

  1. When a pointer to an object is written, that object must be reachable.
  2. When a pointer to an object is removed (overwritten), that object may have become unreachable.

Both of these statements are fairly self-apparent - in the first, since the definition of a reachable object is “the application has some way to get a pointer to it”, the fact that it was able to store that pointer somewhere means it must currently have that pointer. Thus the object is reachable.

The second statement is even simpler - all it says is that it is possible that the pointer that is being overwritten was the last valid pointer to the object in question, and if it was then that object is now unreachable (but we have no easy way to ascertain that, hence the need for complex GC algorithms!).

As an aside, reference-counting based GC algorithms work by keeping track of the number of references to each object and updating that count each time a pointer write occurs, allowing them to determine semi-accurately when the last reference to an object is removed under the second statement’s rule. Where this gets awkward, however, is that the “last reference” is not the same thing as the “last reachable reference”, and so it’s possible to have a loop of objects that refer to each other but are still unreachable because nothing outside the loop has a reference to any of them.

At this point, you may be thinking that statement 1 is redundant, as in order to have obtained the pointer in the first place, the program must have a copy of it somewhere, and thus it should already be marked (or be about to be marked) as reachable. And indeed, at the moment the write occurs, this is true - the application must at a minimum have the pointer on the stack or in a register, and it is highly likely it also has it somewhere in heap memory too. However, we can’t guarantee that these copies will still exist by the time the GC gets around to looking for them - it may be that the copy which was written is the sole remaining one.

Scanning diagram 1

Another way to think about this is in terms of a time-line of events. Considering the case of an object which already has a pointer to it stored in the heap (the most common case), and then assume that three objects A, B and C are laid out in such a way that the GC scans from left to right, as shown above.

Scanning diagram 2

Initially, the GC scans the first half of the heap. We assume for the purposes of this diagram that objects B and C have pointers to them from somewhere external (and thus are reachable) - thus it marks object C as reachable (denoted by the double outline), but it doesn’t mark object A, because the only pointer being held is in object B, which is hasn’t got to yet.

Scanning diagram 3

Next, the application modifies the heap, loading the pointer to B from A, writing it into C, and then nulling the original pointer in A. This effectively “moves” the pointer from part of the heap that the GC hasn’t scanned yet to part that it has already scanned.

Scanning diagram 4

The GC now finishes scanning the remaining chunk of the heap, without spotting the pointer to A. At this point A is still reachable by the application (because of the pointer in C), but the GC thinks it is unreachable and thus garbage.

The case mentioned previously where the pointer was loaded into a register and then overwritten looks basically identical to this, except that instead of the pointer being in A initially it is on the stack or in a register, which are checked at the end of the GC and thus logically on the far right of the diagram. This specific case can actually be resolved fairly simply by scanning the stack and registers both at the start and the end of the GC, but usually there’s no reason to do so because most solutions to the more common “pointer moved from one side of the heap to the other” case inherently solve this one too by catching the point where the pointer gets written to the heap.

What’s interesting here is that this pattern of a pointer “jumping” from a not-yet-scanned object to an already-scanned one is essentially the sole difficult problem to deal with in a concurrent mark/sweep GC - most others are either relatively easy to resolve in a simple, performant manner (such as the newly-created object case), or are annoying and fiddly but not actually particularly difficult on a theoretical level (such as ensuring atomic pointer writes or walking thread stacks safely).

There are a few reasonably sensible solutions to this. The first, and simplest by far, is to ensure that the GC sees a consistent snapshot of the heap from a single moment in time. This can be done trivially by stopping the application and literally memcpy()-ing the heap, although obviously in the vast majority of cases that isn’t actually much better than just stopping the application and running the GC in-place. However, if hardware support is available then a COW (copy-on-write) scheme can be used to take an fast snapshot of the application memory for inspection with only a minimal pause in execution.

A running theme, incidentally, is that very few “concurrent” garbage collection models are truly 100% concurrent - if nothing else, retrieving thread state (stack and registers) is very hard to do correctly on a live thread, so almost all implementations do actually stop the application threads at least briefly to do this, as well as often other small housekeeping tasks that are awkward to do in a fully-concurrent manner. They still tend to fall into the general category of “concurrent collectors”, though, because the pause in these cases is usually well below the level where it has any measurable impact on the application and doesn’t scale with the size or complexity of the heap.

A variant of this approach is to catch pointer writes specifically using a “write barrier” - a small fragment of code inserted at the point of each write (generally by the JIT). The write barrier can then either make a copy of the pointer being overwritten, or somehow mark the object in question to be re-scanned later. The former pattern corresponds broadly to the notion of “taking a snapshot of the heap at the start of the GC cycle” (because we’re preserving the old state), whilst the latter corresponds to “taking a snapshot of the heap at the end” (because we re-scan after the modification) - although in both cases there is some “leakage” because the GC may see both the new and old pointer values depending on when it scans the object. That, however, is fine - as discussed in the last article, the GC does not need to be precise as long as it is conservative, and so occasionally marking “extra” pointers as reachable does not pose a significant problem.

It’s also possible to implement this approach using a hardware write barrier, which is more efficient but significantly more fiddly as you generally don’t want non-pointer writes to trigger the barrier (as this would be very, very inefficient) and thus either have to be using hardware that has a way of denoting the difference at an instruction level or abuse address space trickery to separate the different types of write.

The “mark and re-scan” approach can be implemented in a couple of different ways. Notably, it can either mark the object that the write modified (i.e. the one containing the pointer), or it can mark the object the pointer points to. There are some minor efficiency differences between these (mainly that marking modified objects is likely to involve less work as writes are often clustered together), but the former approach has the significant downside that there is no upper bound to how many times the GC may need to re-scan - if the application is rapidly modifying one particular object, the GC can get trapped in a cycle of endlessly re-scanning it only to find that the application modifies it again. With the “marking the target object” approach, the GC only needs to actually scan the object if it wasn’t already scanned, and thus the number of possible re-scans is limited to the number of live objects in the heap (and even then in practice the chances of requiring more than one re-scan is exceptionally low, as only pointers which somehow escaped the initial scan can trigger a re-scan).

You may well wonder why it’s worth even mentioning the idea of marking the modified object if the worst-case scenario is sufficiently bad to outweigh any possible benefits in most cases. The reason for bringing it up is that some more advanced GC schemes - notably generational collectors - make use of a variant of this technique to keep track of pointers that cross heaps (with mitigations that allow them to avoid the worst-case scenario).

Other versions of this technique include marking not objects but pages in memory (and then identifying and marking/scanning the objects within that page later), or keeping a list of object pointers that have been touched by the application (again for later examination). These have different trade-offs in terms of balancing impact on application code performance against GC performance, but semantically all end up doing roughly the same thing.

So by adding write barriers, the GC can put itself in a position where it has enough information about pointer modifications caused by the application to safely accommodate them even when they occur in the middle of a scan. Whilst this does add some overhead, for games applications in particular it is generally preferable to take a constant hit on pointer writes than to have to absorb a long stall time in execution due to a non-concurrent garbage collection pass.

This is a big topic, and there are a number of areas I’ve skimmed over or omitted entirely for the sake of brevity (or, well, because I haven’t come across them myself!). Here are a few jumping off points for further details: