CAP Theorem : The Heart of Distributed Systems

EP02 | System Design | HLD Series | #01

Sep 17, 2024

In the world of distributed systems, one fundamental principle reigns supreme: the CAP Theorem. It was formulated by Eric Brewer in 2000s and has since become a cornerstone for designing and understanding distributed databases and systems.

In this article, we'll break down the CAP Theorem, explore the three pillars it stands on, discuss the possible combinations, and review real-world examples to illustrate each case.

What is the CAP Theorem?

The CAP Theorem, also known as Brewer’s Theorem, asserts that any distributed data store can only provide two out of the following three guarantees:

Consistency (C): Every read receives the most recent write or an error.
Availability (A): Every request (read or write) receives a response, without a guarantee that it contains the most recent data.
Partition Tolerance (P): The system continues to operate despite arbitrary message loss or failure of part of the system.

According to the theorem, when a partition (network failure) occurs, a system can either choose to:

Remain consistent but sacrifice availability, or
Stay available at the cost of consistency.

Let’s take a deeper dive into what each of these guarantees means and how different systems handle the trade-offs.

The Three Pillars of CAP

1. Consistency

Consistency guarantees that every read will return the most recent write. This means that whenever you query the system, it reflects the same state, irrespective of where the query is being made in the system.

For example, in a banking system, if a user makes a deposit in one node, this update should reflect immediately when queried from another node.

2. Availability

Availability ensures that the system responds to every request, even if the response is not the most recent. The system will always return some data, rather than timing out or failing to respond.

For example, even if part of a system is down, a user should still be able to interact with it, perhaps getting slightly stale data in the meantime.

3. Partition Tolerance

Partition tolerance means the system continues to operate despite the failure or loss of communication between different parts of the system. In large distributed systems, network failures are inevitable, and partition tolerance ensures that the system can still function.

This is especially important for global applications where servers are distributed across multiple data centers.

The CAP Trade-offs: Combinations in Practice

Since a distributed system can only provide two out of the three guarantees, let's explore the three possible combinations:

1. CP (Consistency + Partition Tolerance)

In CP systems, consistency is favored over availability during network partitions. The system will prioritize returning the most recent data and may sacrifice availability by refusing to respond if consistency cannot be guaranteed.

Example: Redis, HBase, MongoDB (when configured with strong consistency)
HBase and MongoDB can be configured to favor consistency. If a network partition occurs, the system may become unavailable to ensure that no stale data is served. For instance, in a financial system, it's often more critical to have accurate, up-to-date data, even if this means the system is temporarily unavailable during a partition.

2. AP (Availability + Partition Tolerance)

In AP systems, availability is maintained even at the cost of consistency. This means the system will always respond to requests, but the data it returns might not be up to date during a network partition.

Example: Cassandra, CouchDB, DynamoDB
AP systems like Cassandra prioritize availability. During a partition, Cassandra will continue to operate and serve data even if some nodes are unable to communicate with each other. However, the data may not always be consistent across nodes until the partition is resolved.

3. CA (Consistency + Availability)

CA systems prioritize consistency and availability but sacrifice partition tolerance. These systems are typically found in single-node environments or tightly coupled systems where network partitions are less of a concern.

Example: Traditional RDBMS (Single-node databases)
Databases like MySQL and PostgreSQL in a single-node setup provide both consistency and availability. However, when they are scaled to a distributed setup across multiple data centers, they cannot handle network partitions without sacrificing one of the other two properties.

Real-World Examples of the CAP Theorem in Action

Let’s walk through some real-world use cases where distributed systems have to make these trade-offs:

1. Amazon DynamoDB (AP)

Amazon DynamoDB, a NoSQL database, prioritizes availability and partition tolerance. In the case of network partitions, it will always respond, but the data it returns might not reflect the latest changes. This is useful for services like product catalogs, where slightly stale data is not a critical issue.

2. Google Spanner (CP)

Google Spanner is an interesting example as it strives for global consistency and partition tolerance (CP). Spanner uses advanced techniques like TrueTime, which combines GPS and atomic clocks, to ensure that even across globally distributed nodes, consistency can be maintained. However, in the event of partitions, there is a risk that the system could become unavailable to preserve this consistency.

3. Cassandra (AP)

Cassandra is a highly available and partition-tolerant database that powers large-scale applications like Netflix. During network partitions, Cassandra ensures that the system stays available by accepting writes and reads, but the consistency is resolved eventually through a process called "eventual consistency."

4. MongoDB (CP or AP)

MongoDB can be configured to either prioritize consistency (CP) or availability (AP). For example, in environments where consistency is critical, MongoDB will halt operations during network partitions to avoid inconsistency (CP). On the other hand, if availability is more important, it will continue to respond with potentially stale data (AP).

Beyond CAP

As distributed systems have grown in complexity, there’s been a movement to transcend the rigid trade-offs defined by CAP. Below are some of the techniques and concepts that address the challenges of modern distributed systems.

1. Eventual Consistency and Tunable Consistency

2. PACELC

3. CRDTs (Conflict-free Replicated Data Types)

4. Multi-leader Replication and Consensus Algorithms

We’ll discuss all these concepts in further articles where we would go in detail regarding what’s beyond CAP and how real world distributed systems are moving ahead in the game.

Conclusion: Choosing the Right Trade-off

The CAP Theorem reminds us that in distributed systems, trade-offs are inevitable. The choice between consistency, availability, and partition tolerance depends entirely on the system's requirements and the nature of the application.

Consistency (CP) is crucial for systems where data integrity is more important than system uptime.
Availability (AP) is favored in applications where uptime is critical, even if the data is slightly outdated.
Partition Tolerance is essential for any distributed system since network failures are a given in large-scale systems.

Understanding the CAP Theorem and its implications will help you make informed decisions when designing and selecting distributed systems, whether you're building a real-time financial system, a globally available application, or a highly scalable NoSQL database.

Shoutouts

Here are some interesting articles I’ve read recently

This post is public so feel free to share it.

DevArch Digest

Discussion about this post