Tuesday, July 23, 2013

The Cachenado: Anatomy and avoidance

What is a Cachenado you ask? Well, it's the name given (by one of my co-workers - RJ Johnson) to a scenario encountered on a recent project I was asked to help out with. Basically, it's when a storm of requests come into a cache where the data required has been ejected, and when handled badly, can lead to the application collapsing under the load.

The issue stems from cache-using code that looks something like this:

var value = GetValueFromCache(key);

If(null == value)

{

    value = DoWorkToGetValue();

    SaveValueInCache(key, value, timeout);

}

// use value

It's a common pattern - try to get the value from the cache, if it's not there, load it conventionally, then update the cache. Sounds simple, right? The problem occurs at a confluence of events - what happens if you get a large number of requests just as the value in the cache expires, and DoWorkToGetValue() takes some time to return? You end up with such a large number of threads (requests) blocked executing DoWorkToGetValue that even once the first one returns, stores the value in the cache, and allows new requests to use that value, there are so many threads running that it takes the server a bit to recover and get back to normal throughput. If you have several items stored in cache, and they expire around the same time, this can lead to a massive number of backed-up requests all trying to fill in the cache - a Cachenado.

So, what can we do to avoid it? Luckily, there are several approaches - some better than others, but as is usually the case, there's a tradeoff between performance, simplicity, and ease of implementation.

The first, and easiest implementation, is to simply say only one thread can be reloading the value for the cache at a time - if a thread needs the value and it's not there, and there's already another thread loading the value, just block and wait for the first thread to return it. While this makes everyone wait for a value when it expires from the cache (hurting page-load times for them all), it avoids the possibility of a Cachenado, and is reasonably simple to implement - just have a dictionary of key->lock object (tied to the cache, which means the dictionary should probably be a singleton), and hold that lock while confirming the value isn't in the cache and rebuilding the value.

Another approach is to aim a bit higher, and add the concept of a stale cache item, and series of rules for how we intend to process cache requests:

  1. If there is no value present in the cache, the first thread to notice this loads the value and stores it in the cache. The other threads block, and get the value from the cache. (ie. the cold-cache case)
  2. If there is a non-stale item in the cache, use it. (ie. the warm-cache case)
  3. If there is a stale item in the cache, and there is no-one currently loading that item, load the item and store it in the cache.
  4. If there is a stale item in the cache, and there is another thread currently loading that item, use the stale item.

One benefit to this is that you can even define how long you're ok with 'stale' items, and fall back to case #1 (block until it's ready) if the item is too old to use at all.  If you take this approach, there’s at least a couple of possible implementation routes you can go down to make this happen.  One is to not cache the actual data, but an object containing the data, and the time after which you consider it stale (storing it in the cache with an expiration based on how long you’re ok using the stale value).  Another possible approach is to cache the value normally, but register a callback when it’s removed from the cache.  In that callback, add the value back to the ‘stale’ cache (or the original cache with a different key), and have your reads check both caches (or both cache keys).

Both of these approaches so far can work well to retrofit to a system at risk of a Cachenado, just by making a few tweaks to the places using the cache - for example, you could change to something like this:

var value = GetValueFromCache(key, () => DoWorkToGetValue(), timeout);

// use value

By supplying a delegate to call to load the data, and the timeout to use when storing it in the cache, we can move the details of the implementation of the above algorithm down into the caching module, where we can more easily iterate through several options.

So, with the second approach, we can avoid the Cachenado, while still delivering good performance to most of our users (except for the one who first encountered the stale cache item who has to wait out the rebuild), and even give us some flexibility in terms of how long we allow stale values to be used.

The final optimization we may be able to make here would be to not even make the first user who encounters the state item have to wait for it to be rebuilt. It sounds easy enough, right? Just create a task to load the value and update the cache and fire it off. Here though, we meet a potential problem when refactoring existing code. To this point, we'd continued to run DoWorkToGetValue in the same context it was originally designed for - on the thread processing a request from the user. Maybe this code uses HttpContext.Current, or SPContext.Current, or something else like that that's not accessible from a background thread. Or maybe it accesses values from session. Or any number of other things that may be problematic from a background thread. In a new application, or an application with a limited number of cache-loading paths to check, it may be feasible to do the work to change over to loads on the background thread - if you do, just make sure you also remember to do the empty-cache loads on the background thread as well. Otherwise, you'll only notice threading issues during the refresh calls, which may be few and far between for a long-term cache. The only thing worse than a threading issue is a threading issue that is based on an action that happens at random times (a cache expiration).

On this particular project, we ended up going with solution #2, as this is an in-production application that we didn't want to risk a regression by introducing background threads right now. Switching to approach #3 (background-loading) is something we may consider in a future release when more testing can be devoted. We may not even bother with it though - no cache items take more than a few seconds to load, so it was only the confluence of many requests for the item arriving just after it expired, plus the number of items that expired right around the same time that lead to the Cachenado.