3 Things You Should Know About Data Consistency, Distribution, and Replication with Apache Cassandra

Published in

Building Real-World, Real-Time AI

7 min readAug 26, 2021

In this post, we’ll show how data distribution works with Apache Cassandra. We’ll also share details about common issues and how to remedy them.

Apache Cassandra® is a database management system with very high availability. Its simple and very scalable design makes it perfect for handling vast amounts of data. That’s why it is used to manage databases in companies like Apple, Spotify, Facebook, Netflix, and several stock exchanges. These enterprises all have their own staff of developers and engineers, who can allot time to get the hang of the somewhat complex setup and management.

Cassandra is also beneficial to smaller ventures. But when you’re just getting used to developing cloud-based apps with Cassandra, you’ll have more than enough to focus on. The last thing you need is to be slowed down by nagging questions, such as:

“Which node is the data on?”
“Is the data replicated appropriately?”
“How can we fix replication?”

When your query doesn’t return the requested data, these questions could haunt a database engineer. But your first experiences with Cassandra don’t have to be like that. To help you avoid them, we’ll dive into the basics to show you how data distribution and replication work with Cassandra.

Data distribution and replication in Cassandra

First, we’ll start with a simple six-node cluster:

$ nodetool statusDatacenter: MovingPictures==========================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving—  Address    Load       Tokens Owns  Host ID      RackUN 10.21.12.1 172.53 KiB 1      9.5%  cded0a67–... R40UN 10.21.12.2 158.39 KiB 1      48.6% 8f76afc6–... R40UN 10.21.12.3 93.25 KiB  1      33.7% 424eba3e-... R40UN 10.21.12.4 85.31 KiB  1      48.1% 60921b4f-... R40UN 10.21.12.5 66.01 KiB  1      42.4% 37ba560a-... R40UN 10.21.12.6 66.01 KiB  1      17.8% d68c5adb-... R40

We’ll build our nodes with a single token (num_tokens: 1) to illustrate where the data replicas are stored. This will provide a clear picture of the data replica distribution.

We recommend using a num_tokens value from 4 to 16 with Cassandra versions 3 and 4. This is reflected in Cassandra 4.0’s default setting of 16. The token allocation algorithm in Cassandra lets you distribute data more evenly with a small number of token ranges per node. If you set the number too high, you risk issues with data streaming events, which can ultimately put the cluster in an inconsistent state.

Given the name of our logical data center, our keyspace definition will look something like this:

> CREATE KEYSPACE spaceflight_data WITH replication={'class':'NetworkTopologyStrategy','MovingPictures':3};

This will allow the keyspace to make use of our cluster topology. It will also enforce a data replication factor (RF) of 3 for our table data. We’ll use this keyspace to create a simple table:

> use spaceflight_data ;> CREATE TABLE astronauts_by_group (name TEXT,group INT,flights INT,spaceflight_hours INT,PRIMARY KEY ((group),spaceflight_hours,name)) WITH CLUSTERING ORDER BY (spaceflight_hours DESC, name ASC);

For this tutorial, we’ll use a subset of NASA astronaut data. After loading the data, let’s query this table for NASA group 16.

> SELECT group, name, flights, spaceflight_hours as hoursFROM astronauts_by_group WHERE group=16;group | name          | flights | hours— — — + — — — — — — — + — — - - + — — — -16    | Peggy Whitson | 3       | 1169816    | Daniel Burbank| 3       | 451216    | Laurel Clark  | 1       | 38216    | Duane Carey   | 1       | 262

So where (in our cluster) is the data for NASA group 16? Now that we have a table with data, we can adjust the above query just a little to find out:

> SELECT token(group), group, name, flights,spaceflight_hours as hoursFROM astronauts_by_group WHERE group=16;system.token(group)  | group | name          | flights | hours— — — — — — — — — —  + — — - + — — — — — — - + — — - - + — — — -5477287129830487822 | 16    | Peggy Whitson | 3      | 11698-5477287129830487822 | 16    | Daniel Burbank| 3      | 4512-5477287129830487822 | 16    | Laurel Clark  | 1      | 382-5477287129830487822 | 16    | Duane Carey   | 1      | 262

The partition key value of group 16 has a (Murmur3) hashed token value of -5477287129830487822. This means that the data for group 16 was written to the node responsible for the token range containing this token. We can use the nodetool ring command to view token range assignments by node:

$ nodetool ringDatacenter: MovingPictures==========
Address    Rack Status State  Load       Owns   Token 
                                                596144303205729214410.21.12.3 R40  Up     Normal 100.04 KiB 74.06% -7692208291930584689 10.21.12.5 R40  Up     Normal 90.81 KiB  50.05% -4668378901566903199 10.21.12.6 R40  Up     Normal 109.54 KiB 43.77% -4411907325378154170 10.21.12.1 R40  Up     Normal  85.85 KiB 25.94% -2906767109774091443 10.21.12.2 R40  Up     Normal 152.21 KiB 49.94% 4544838327361920696 10.21.12.4 R40  Up     Normal 106.25 KiB 56.23% 5961443032057292144

In our cluster, we can see that Node Five (10.21.12.5) is responsible for the token range of -7692208291930584688 to -4668378901566903199, therefore the data was written there. This information can be verified with the nodetool getendpoints command with our keyspace, table, and partition key value (16) as arguments:

$ nodetool getendpoints spaceflight_data astronauts_by_group 1610.21.12.510.21.12.610.21.12.1

Replication factor with Cassandra

At this stage, you might be wondering how the data also ended up on Node Six (10.21.12.6) and Node One (10.21.12.1). This is where the replication factor we set in the keyspace definition above comes into play. In fact, the very last part of that definition is key here:

'MovingPictures':3

That told our cluster to ensure there should be a total of three copies of each row present in the cluster. But why should they be on Node Six and Node One? In processing the cluster topology, Node Five knows who its numeric neighbors are. Node Six and Node One are responsible for the next sequential ranges in the ring. So they’re designated to hold the replicas of the primary data row.

Please note there are other aspects used to determine replica placement such as multiple data centers, racks, tokens, and so on. These are not factors in this highly simplistic example.

Data replication failures and out-of-sync data can occur for several reasons. Sometimes nodes go down. Entire networks can go down or fail to completely send all packets.

Nothing can replace running a predictable, scheduled repair process for data we really care about keeping consistent. We strongly recommend using something like Cassandra Reaper to handle this. Cassandra Reaper is a simple and user-friendly web interface designed to let you repair hundreds of clusters at once. However, sometimes the repair process can be problematic or take longer to run than expected.

Swift data repair after a node crash

If someone is in a hurry to repair data, they can do it. All it takes is knowing the specific keys. Let’s say, for instance, that a single node crashes and is hard-down for several hours. During that outage, the cluster continues to serve read and write requests. However, once the outage time has exceeded the max_hint_window_ms time, the chance of missing writes is very high.

Queries for new data written for NASA group 3 work just fine during the outage. But once the crashed node comes back (without a repair being run), the queries fail.

> SELECT * FROM astronauts_by_group WHERE group=3;group | spaceflight_hours | name | flights— — — + — — — — — — — — — + — — - + — — —(0 rows)

For a quick solution, this can be resolved by running the query at a consistency level of “ALL.”

> CONSISTENCY ALLConsistency level set to ALL.> SELECT * FROM astronauts_by_group WHERE group=3;group | spaceflight_hours | name            | flights— — — + — — — — — — — — — + — — — — — — — - + — — — — -3     | 566               | Gene Cernan     | 33     | 289               | Buzz Aldrin     | 23     | 266               | Michael Collins | 2(3 rows)

Now we get the correct data from the group. But why does this work? Simply put, querying at consistency ALL has a unique side-effect. It invokes a read repair 100% of the time.

In this case, the read repair process noticed that the primary replica did not have the most recent write. So the data was streamed from one of the up-to-date replicas to repair it. This fixes the replica permanently, and future queries for this data (at lower consistency levels) will succeed.

Key takeaways

Tools like nodetool getendpoints and the token() function can help reveal which nodes are responsible for specific data replicas.
Crashed nodes down longer than the max_hint_window_ms time should have a repair run once they are back up.
Forcing a read repair with queries at a consistency level ALL can be a quick way to remedy inconsistent data on specific keys.

There are many nuances with Cassandra and its data persistence mechanisms. Understanding those mechanisms, and how they behave under certain conditions, can make it easier to troubleshoot and remedy data consistency issues.

Get up to speed faster with tutorials on our DataStax Developers YouTube channel and subscribe to our event alert to get notified about new developer workshops. For exclusive posts on all things data, follow DataStax on Medium.

3 Things You Should Know About Data Consistency, Distribution, and Replication with Apache Cassandra

Data distribution and replication in Cassandra

Replication factor with Cassandra

Swift data repair after a node crash

Key takeaways

Resources

Written by DataStax