Beyond picking “two of three” from CAP

Arvind Jayaprakash

Arvind Jayaprakash on March 11, 2015

Citing the CAP theorem and the infamous “pick two of three” is a soundbite that we have all been subjected as a consequence of working on distributed systems. While the original Brewer’s conjecture gets a lot of mention, people often overlook his essay titled “CAP Twelve years later”. The single most important aspect highlighted in the essay is that consistency, availability and partition tolerance are actually a continuous range of values as opposed to being boolean in nature. The second aspect to pay attention to is that real world systems do not have a static behaviour as far as the CAP traits are concerned. We shall explore this phenomenon by trying to arrive at the design of a simplified ad-budget management system.

The Reference Problem

Digital advertising typically has advertisers who have a finite amount of money that they wish to spend for any one advertising campaign at hand. We shall use the term campaign balance to indicate sum total of unspent money associated with a campaign at any given point. The steps involved in serving an ad that concern itself with the campaign balance are as follows:

  1. Arrive at a list of all ad campaigns that match a given request for an ad based on the various targeting needs of the advertiser
  2. Eliminate all campaigns whose balance is zero
  3. Pick one campaign that would to be considered as the winner
  4. Decrement the balance of the winning campaign

The candidate for applying CAP in centered around the quality of the value that represents the campaign balance in a high volume, distributed system. An inaccuracy in the direction of underestimation would result in us displaying ads for which we would be unable to collect revenues from the advertisers; and an overestimation results in us believing that we have run out of advertising dollars and hence falsely concluding that our sales funnel has run dry.

A lot of choices shall be explored in the remainder of the document. We shall use a scale of 0-100 to describe the three CAP attributes to help us compare our options.

Concurrency and CA on a Single Node

We shall take a minor detour into the area of concurrency before we dive into distributed systems. Steps two through four needs to be executed with a database isolation level of “serializable” if we are to achieve true functional correctness. We are merely borrowing the term from database literature for the purposes of this discussion. Let us assume that the database in question is an embedded DB inside the application in question to remove the complexities introduced by a network partition. A wait on a lock is nothing but a tradeoff wherein we are giving up availability, albeit only for a short time, in favour of consistency. The level of contention depends a lot on concurrency and other factors such as the number of campaigns matching a given request. The key observation is that starvation, livelocks and deadlocks are all specific types of unavailability.

C/A/P Score: 100/90/0

We can pick availability over consistency if we were to imagine the system having the equivalent of a read committed isolation level. We do run the risk of campaign’s balance becoming negative but this can be countered by introducing probabilistic techniques such as picking a threshold value based on expected concurrency and expected auction win rate and using that to arrive at a safe low watermark and use that in place of zero in step #2. Such an optimistic system would be almost consistent and always available when executing on a single node.

C/A/P Score: 80/100/0

Multiple Nodes and Shared State

Let us now explore the situation of multiple servers handling the ad requests concurrently. The embedded db model is no longer feasible as values would never converge across the various nodes. We shall now explore ways to maximize the correctness of steps #2 and #4 in such an environment

A Common State Store

We can emulate the single node setup by externalizing and consolidating the database to run on a specific node with the applications running on the remaining nodes. The optimistic system (read committed semantics) can still be implemented once we calibrate the expected win rates to a cluster level. What does change however in the multi node setup is the possibility of a partition. Losing one of the application nodes results in reduced system capacity and a potentially a reduction in availability. The exact impact depends on both the cluster size and utilization levels. Losing the central database however causes a full outage

C/A Score (all systems up): 80/100

C/A Score (single app node failure on a fully utilised 10 app node cluster): 80/90

C/A Score (shared db failure): 100/0

Scalability

While scalability doesn’t figure directly in the CAP attributes, availability serves as a proxy for scalability. The reason for moving to multiple app servers is to increase the scalability. The usage of a single node for the shared state results in an inability to scale beyond a certain point as the read-write workload on the state maintaining node is constantly increasing. Factoring this into the puzzle dramatically alters the scores even if we were to ignore node failures and partitions. Hence, we shall consider the inability to scale towards unavailability. To be fair, we need to update all the scores that have been arrived at until this point but we shall not do so at it would just be an academic exercise.

Effective C/A/P Score: 80/10/100

Multiple Nodes with Caching

The availability score in the multi node shared db setup is pretty low and hence the natural temptation is to embark upon the usage of stale values or what is fondly known as caching and its alter ego: the hard problem of cache invalidation.

The availability of the central store, or lack of it to be precise, seems to strongly affect overall system availability. One workaround is to have the campaign budget levels be cached in each application server. This value can be refreshed at some frequency, say once a minute. The unavailability of the shared db during a given refresh cycle is ignored as a case of transient failure and we hope that the value catches up in the next cycle. It can also batch the summation of the pending updates at a similar periodicity. The effective balance at any given point is still calculated using expected auction win rate in conjunction with the time of the last cache update. We shall now look at the sources of error

Estimation Errors

The dominant factor in the estimation of remaining budget for single node cases turns out to be the concurrency level. A low enough concurrency value trivializes the impact of the expected win rate of the ad campaign when it comes to arriving at the current estimate. This happens because the system converges fairly often. The time required to converge is substantially higher in the multi node caching scenario and hence the overall error rates tends to be higher

Impact of a Disconnected Node on the Remaining Nodes

The scenario of having a single application node that is disconnected from the rest of the system plays out differently in the cache free, shared state setting when compared to the shared state setting. In the former, the isolated node becomes unavailable and does not impact the consistency of the surviving cluster. The cached mode is different as an isolated node results in delayed updates of the campaign balance. A delay in update not only affects the correctness of the remaining balance as seen by the surviving nodes in their subsequent refresh cycle but also affects the estimation of the expected auction win rates.

The Disconnected Node’s Quandary

A disconnected app server needs to consider shutting itself down at some point for a variety of reasons. Firstly, a partition on the writeback path for updates causes an estimation errors for the other nodes as mentioned earlier. Secondly, a partition on the cache refresh path results in a longer window of extrapolation and this induces estimation errors on the disconnected node in itself as the estimation techniques tend to rely on trending data. The safest thing for a disconnected node to do is to shut itself down from processing any transactions.

However, the very reason for introducing this design of operation on cached values was to overcome the need of having a perpetually available data state storage mechanism. One needs to spend time on trying to figure out when should a disconnected node give up i.e. thresholds for repeated failures. The key takeaway from this discussion is that that systems are designed to become unavailable of the inconsistency level gets out of hand.

C/A/P Score: 70/80/80

Conclusion

Unavailability of a system can manifest in ways besides not being able to accept any requests or process any transactions. Load shedding due to insufficient capacity and slow progress due to high contention on shared data structures also forms of degradation in availability. Likewise, consistency is also spectrum. Inconsistent does not automatically mean garbage values; it could mean things such as bounded errors, converging systems (a.k.a. eventually consistent), linearizable updates, etc. etc. Large systems tends to be built as modules and a different degree of CAP-ness can be chosen at different levels of abstraction. Trivializing such systems down to the “pick two of three” might not be all that useful.

Side Note #1: The Intersection of Technology and Business

The ATM example from Brewer’s article brings about an interesting aspect of how technical limitations sometimes finds its way back in the business rules. An ATM might bound the maximum size of a withdrawal to say $200 despite your bank allowing you to to withdraw as much as $10,000 a day. The expected workflow is that you withdraw in batches of $200. The reason for this design is that the fundamental risk of operating in a partitioned mode and the potential liabilities that arise out of it is modelling in the bank’s core business. In simple terms, someone somewhere knows how to write off a loss in case of an overdraft. This risk could be borne by the bank whose debit card you hold, the institution that operates the ATM in question or perhaps by you, the end customer. This exists as fine print in some document. The only choice to make concerns with who carries the risk. Businesses dealing with real world entities often need to factor in the consequences of the CAP theorem in what they do.

Side Note #2: What We Do at InMobi

The reference problem that was framed to illustrate the nuances of the CAP theorem is a highly simplified version of what we deal here at InMobi. For starters, the counters cannot be decremented for CPM campaigns under the ad-render completion notification beacon is received. Secondly, this post used database isolation levels meme to convey a complex set of expectations without having to invent newer terminology. Traditional RDBMS systems are unable to cope with both the frequency of updates and also the highly distributed nature of the deployments. We also glossed over the details of arriving at probabilistic means of speculating the additional campaign budget that would have been spent since the last known state as seen by the applications. This is a non-trivial problem in itself. Real world experience shows us that the strategies used for million dollar campaigns needs to be very different from those that are used for ten dollar campaigns. We could continue talking more about these challenges or you could instead come join us to be a part of solving the problem.