Hinted Handoff and GC Grace Demystified

There are many knobs to turn in Apache Cassandra. Finding the right value for all of them is hard. Yet even with all values finely tuned unexpected things happen. In this post we will see how gc_grace_seconds can break the promises of the Hinted Handoff.

The gc_grace_seconds defines the time Cassandra keeps tombstones around. Tombstones are special values Cassandra writes instead of the actual data whenever the data is deleted or its TTL expires. Alain covered tombstones in detail in a previous post, About Deletes and Tombstones in Cassandra.

Hinted Handoff is one of the data consistency mechanisms built into Cassandra. When a node in the cluster goes down, the remaining nodes will save mutations (writes) for this node locally as hints. This will be happening for a period given by the max_hint_window_ms setting in cassandra.yaml. Once this window expires, nodes will stop saving hints.

Prior to version 3.0, Cassandra stores hints in the system.hints table. In version 3.0 and later, hint storage received a redesign and Cassandra now stores hints in flat files on disk. Regardless of the version, hints have an expiration time. There is a slight difference in how Cassandra sets the hint expiration time depending on the version:

  • Up until 2.2, Cassandra picks the smallest one of gc_grace_seconds (a table property) and cassandra.maxHintTTL (a JVM runtime argument). This value then becomes a TTL of the row in the system.hints table.
  • In 3.0 and later, Cassandra uses only the gc_grace_seconds for hint expiration.
    • This value goes into the hint file together with the hint.

Once a node is about to replay a hint, it will only send out the not yet expired hints. Moreover, the hint data itself must still be live, meaning the TTL of the data, if used, has not yet expired.

As a mid-post recap, we have several concepts influencing the lifetime of a hint:

  • max_hint_window_ms controlling how long to collect hints for.
  • gc_grace_seconds indicating hint expiration time.
  • Data TTL determining duration of data validity.

Next we will explore combinations of values for these settings and their impact on Hinted Handoff.

First, let’s consider only the max_hint_window_ms and gc_grace_seconds. By default, gc_grace_seconds is set to 10 days and max_hint_window_ms is 3 hours.

Default values of hint window and GC grace.

With the default settings, first hint will expire long after Cassandra stops collecting hints. There is plenty of time (9 days and 21 hours exactly) for the unavailable nodes to come back and receive their hints.

Let’s now suppose we lower the gc_grace_seconds so that it falls below max_hint_window_ms. This move is acceptable in situations when Cassandra faces too many tombstones and we want it to drop them quicker.

GC grace shorter than hint window.

GC grace shorter than hint window means the very first hints start to expire even before the hint collection stops. This breaks the guarantee of Hinted Handoff to deliver max_hint_window_ms span of missed data.

Taking this to the extreme, we can set gc_grace_seconds to 0. In other words, we tell Cassandra to drop tombstones immediately.

GC grace set to 0.

With gc_grace_seconds set to 0, the hints expire immediately as well. Before the Hinted Handoff rewrite, Cassandra did not even store the hint. Hints expiring immediately, or not being stored in the first place, virtually disable the Hinted Handoff.

So far we have not considered data TTL. Let’s now see what adding data TTL to the mix implies for Hinted Handoff.

As long as gc_grace_seconds is bigger than data TTL, hints will always expire only after the data they contain:

GC grace bigger than data TTL.

However, the GC grace needs to be higher than data TTL. Without it, we end up with the following:

GC grace smaller than data TTL.

With data TTL longer than GC grace, hints expire before the data they contain. This means we once again break the reliability of Hinted Handoff because not all collected hints get replayed.

Data TTL allows setting the gc_grace_seconds lower than max_hint_window_ms because the data expires before the hints do:

Data expires before hints do.

Flipping the setup and making data TTL longer than gc_grace_seconds leads to problems once again as hints expire before their data does:

Data expires after hints do.

In this post we took a close look on how gc_grace_seconds can prevent the Hinted Handoff from delivering all hints it collected within the max_hint_window_ms period. The relationship between the data’s TTL, gc_grace_seconds and max_hint_window_ms is nuanced and can be a little confusing at first glance.

When making decisions regarding these settings, keep these suggestions in mind:

  • When not using data TTL, gc_grace_seconds should be (far) longer than max_hint_window_ms.
  • When using data TTL, gc_grace_seconds should be (reasonably) larger than the smaller of max_hint_window_ms and data TTL.
cassandra