CockroachDB is built to be fault-tolerant and to recover automatically, but sometimes disasters happen. A disaster is any event that puts your cluster at risk, and usually means your cluster is experiencing hardware failure, data failure, or has compromised security keys. Having a disaster recovery plan enables you to recover quickly, while limiting the consequences.
Hardware failure
When planning to survive hardware failures, start by determining the minimum replication factor you need based on your fault tolerance goals:
Increasing the replication factor can impact write performance in that more replicas must agree to reach quorum. For more details about the mechanics of writes and the Raft protocol, see Read and Writes Overview.
For the purposes of choosing a replication factor, disk failure is equivalent to node failure.
If you experience hardware failures in a cluster, the recovery actions you need to take will depend on the type of infrastructure and topology pattern used:
Single-region survivability planning
The table below shows the replication factor (RF) needed to achieve the listed fault tolerance goals for a single region, cloud-deployed cluster with nodes spread as evenly as possible across 3 availability zones (AZs):
See our basic production topology for configuration guidance.
Fault Tolerance Goals | 3 nodes | 5 nodes | 9 nodes |
---|---|---|---|
1 Node | RF = 3 | RF = 3 | RF = 3 |
1 AZ | RF = 3 | RF = 3 | RF = 3 |
2 Nodes | Not possible | RF = 5 | RF = 5 |
AZ + Node | Not possible | Not possible | RF = 9 |
2 AZ | Not possible | Not possible | Not possible |
To be able to survive 2+ availability zones failing, scale to a multi-region deployment.
Single-region recovery
For hardware failures in a single-region cluster, the recovery actions vary and depend on the type of infrastructure used.
For example, consider a cloud-deployed CockroachDB cluster with the following setup:
- Single-region
- 3 nodes
- A node in each availability zone (i.e., 3 AZs)
- Replication factor of 3
The table below describes what actions to take to recover from various hardware failures in this example cluster:
Failure | Availability | Consequence | Action to Take |
---|---|---|---|
1 Disk | √ | Fewer resources are available. Some data will be under-replicated until the failed nodes are marked dead. Once marked dead, data is replicated to other nodes and the cluster remains healthy. |
Restart the node with a new disk. |
1 Node | √ | If the node or AZ becomes unavailable, check the Overview dashboard on the Admin UI:
|
|
1 AZ | √ | ||
2 Nodes | X | Cluster is unavailable. | Restart 1 of the 2 nodes that are down to regain quorum. If you can’t recover at least 1 node, contact Cockroach Labs support for assistance. |
1 AZ + 1 Node | X | Cluster is unavailable. | Restart the node that is down to regain quorum. When the AZ comes back online, try restarting the node. If you can’t recover at least 1 node, contact Cockroach Labs support for assistance. |
2 AZ | X | Cluster is unavailable. | When the AZ comes back online, try restarting at least 1 of the nodes. You can also contact Cockroach Labs support for assistance. |
3 Nodes | X | Cluster is unavailable. | Restart 2 of the 3 nodes that are down to regain quorum. If you can’t recover 2 of the 3 failed nodes, contact Cockroach Labs support for assistance. |
1 Region | X | Cluster is unavailable. Potential data loss between last backup and time of outage if the region and nodes did not come back online. |
When the region comes back online, try restarting the nodes in the cluster. If the region does not come back online and nodes are lost or destroyed, try restoring the latest cluster backup into a new cluster. You can also contact Cockroach Labs support for assistance. |
When using Kubernetes, recovery actions happen automatically in many cases and no action needs to be taken.
Multi-region survivability planning
The table below shows the replication factor (RF) needed to achieve the listed fault tolerance (e.g., survive 1 failed node) for a multi-region, cloud-deployed cluster with 3 availability zones (AZ) per region and one node in each AZ:
The chart below describes the CockroachDB default behavior when locality flags are correctly set. It does not use geo-partitioning or a specific topology pattern. For a multi-region cluster in production, we do not recommend using the default behavior, as the cluster's performance will be negatively affected.
Fault Tolerance Goals | 3 Regions (9 Nodes Total) |
4 Regions (12 Nodes Total) |
5 Regions (15 Nodes Total) |
---|---|---|---|
1 Node | RF = 3 | RF = 3 | RF = 3 |
1 AZ | RF = 3 | RF = 3 | RF = 3 |
1 Region | RF = 3 | RF = 3 | RF = 3 |
2 Nodes | RF = 5 | RF = 5 | RF = 5 |
1 Region + 1 Node | RF = 9 | RF = 7 | RF = 5 |
2 Regions | Not possible | Not possible | RF = 5 |
2 Regions + 1 Node | Not possible | Not possible | RF = 15 |
Multi-region recovery
For hardware failures in a multi-region cluster, the actions taken to recover vary and depend on the type of infrastructure used.
For example, consider a cloud-deployed CockroachDB cluster with the following setup:
- 3 regions
- 3 AZs per region
- 9 nodes (1 node per AZ)
- Replication factor of 3
The table below describes what actions to take to recover from various hardware failures in this example cluster:
Failure | Availability | Consequence | Action to Take |
---|---|---|---|
1 Disk | √ | Under-replicated data. Fewer resources for workload. | Restart the node with a new disk. | 1 Node | √ | If the node or AZ becomes unavailable check the Overview dashboard on the Admin UI:
|
1 AZ | √ | ||
1 Region | √ | Check the Overview dashboard on the Admin UI. If nodes are marked Dead, decommission the nodes and add 3 new nodes in a new region. Ensure that locality flags are set correctly upon node startup. | |
2 or More Regions | X | Cluster is unavailable. Potential data loss between last backup and time of outage if the region and nodes did not come back online. |
When the regions come back online, try restarting the nodes in the cluster. If the regions do not come back online and nodes are lost or destroyed, try restoring the latest cluster backup into a new cluster. You can also contact Cockroach Labs support for assistance. |
When using Kubernetes, recovery actions happen automatically in many cases and no action needs to be taken.
Data failure
When dealing with data failure due to bad actors, rogue applications, or data corruption, domain expertise is required to identify the affected rows and determine how to remedy the situation (e.g., remove the incorrectly inserted rows, insert deleted rows, etc.). However, there are a few actions that you can take for short-term remediation:
- If you are within the garbage collection window, run differentials.
- If you have a backup file, restore to a point in time.
- If your cluster is running and you do not have a backup with the data you need, create a new backup.
- To recover from corrupted data in a database or table, restore the corrupted object.
To give yourself more time to recover and clean up the corrupted data, put your application in “read only” mode and only run AS OF SYSTEM TIME
queries from the application.
Run differentials
If you are within the garbage collection window (default is 25 hours), run AS OF SYSTEM TIME
queries and use CREATE TABLE AS … SELECT * FROM
to create comparison data and run differentials to find the offending rows to fix.
If you are outside of the garbage collection window, you will need to use a backup to run comparisons.
Restore to a point in time
- If you are a core user, use a backup that was taken with
AS OF SYSTEM TIME
to restore to a specific point. - If you are an enterprise user, use your backup file to restore to a point in time where you are certain there was no corruption. Note that the backup must have been taken with revision history.
Create a new backup
If your cluster is running, you do not have a backup that encapsulates the time you want to restore to, and the data you want to recover is still in the garbage collection window, there are two actions you can take:
- If you are a core user, trigger a backup using
AS OF SYSTEM TIME
to create a new backup that encapsulates the specific time. TheAS OF SYSTEM TIME
must be within the garbage collection window (default is 25 hours). - If you are an enterprise user, trigger a new backup
with_revision_history
and you will have a backup you can use to restore to the desired point in time within the garbage collection window (default is 25 hours).
Recover from corrupted data in a database or table
If you have corrupted data in a database or table, restore the object from a prior backup. If revision history is in the backup, you can restore from a point in time.
Instead of dropping the corrupted table or database, we recommend renaming the table or renaming the database so you have historical data to compare to later. If you drop a database, the database cannot be referenced with AS OF SYSTEM TIME
queries (see #51380 for more information), and you will need to take a backup that is backdated to the system time when the database still existed.
If the table you are restoring has foreign keys, careful consideration should be applied to make sure data integrity is maintained during the restore process.
Compromised security keys
CockroachDB maintains a secure environment for your data. However, there are bad actors who may find ways to gain access or expose important security information. In the event that this happens, there are a few things you can do to get ahead of a security issue:
- If you have changefeeds to cloud storage sinks, cancel the changefeed job and restart it with new access credentials.
- If you are using encryption at rest, rotate the store key(s).
- If you are using wire encryption / TLS, rotate your keys.
Changefeeds to cloud storage
- Cancel the changefeed job immediately and record the high water timestamp for where the changefeed was stopped.
- Remove the access keys from the identity management system of your cloud provider and replace with a new set of access keys.
- Create a new changefeed with the new access credentials using the last high water timestamp.
Encryption at rest
If you believe the user-defined store keys have been compromised, quickly attempt to rotate your store keys that are being used for your encryption at rest setup. If this key has already been compromised and the store keys were rotated by a bad actor, the cluster should be wiped if possible and restored from a prior backup.
If the compromised keys were not rotated by a bad actor, quickly attempt to rotate the store key by restarting each of the nodes with the old key and the new key. For an example on how to do this, see Encryption.
Once all of the nodes are restarted with the new key, put in a request to revoke the old key from the Certificate Authority.
CockroachDB does not allow prior store keys to be used again.
Wire Encryption / TLS
As a best practice, keys should be rotated. In the event that keys have been compromised, quickly attempt to rotate your keys. This can include rotating node certificates, client certificates, and the CA certificate.