Because of CockroachDB's multi-active availability design, you can perform a "rolling upgrade" of your CockroachDB cluster. This means that you can upgrade nodes one at a time without interrupting the cluster's overall health and operations.
Step 1. Verify that you can upgrade
To upgrade to a new version, you must first be on a production release of the previous version. The release does not need to be the latest production release of the previous version, but it must be a production release rather than a testing release (alpha/beta).
Therefore, if you are upgrading from v19.1 to v20.1, or from a testing release (alpha/beta) of v19.2 to v20.1:
First upgrade to a production release of v19.2. Be sure to complete all the steps.
Then return to this page and perform a second rolling upgrade to v20.1.
If you are upgrading from any production release of v19.2, or from any earlier v20.1 release, you do not have to go through intermediate releases; continue to step 2.
Step 2. Prepare to upgrade
Before starting the upgrade, complete the following steps.
Check load balancing
Make sure your cluster is behind a load balancer, or your clients are configured to talk to multiple nodes. If your application communicates with a single node, stopping that node to upgrade its CockroachDB binary will cause your application to fail.
Check cluster health
Verify the overall health of your cluster using the Admin UI. On the Cluster Overview:
Under Node Status, make sure all nodes that should be live are listed as such. If any nodes are unexpectedly listed as suspect or dead, identify why the nodes are offline and either restart them or decommission them before beginning your upgrade. If there are dead and non-decommissioned nodes in your cluster, it will not be possible to finalize the upgrade (either automatically or manually).
Under Replication Status, make sure there are 0 under-replicated and unavailable ranges. Otherwise, performing a rolling upgrade increases the risk that ranges will lose a majority of their replicas and cause cluster unavailability. Therefore, it's important to identify and resolve the cause of range under-replication and/or unavailability before beginning your upgrade.
In the Node List:
- Make sure all nodes are on the same version. If any nodes are behind, upgrade them to the cluster's current version first, and then start this process over.
- Make sure capacity and memory usage are reasonable for each node. Nodes must be able to tolerate some increase in case the new version uses more resources for your workload. Also go to Metrics > Dashboard: Hardware and make sure CPU percent is reasonable across the cluster. If there's not enough headroom on any of these metrics, consider adding nodes to your cluster before beginning your upgrade.
Review breaking changes
Review the backward-incompatible changes in v20.1, and if any affect your application, make necessary changes.
Let ongoing bulk operations finish
Make sure there are no bulk imports or schema changes in progress. These are complex operations that involve coordination across nodes and can increase the potential for unexpected behavior during an upgrade.
To check for ongoing imports or schema changes, use SHOW JOBS
or check the Jobs page in the Admin UI.
Once all nodes are running v20.1, but before the upgrade has been finalized, any schema changes still running will stop making progress, but SHOW JOBS
and the Jobs page in the Admin UI will show them as running until the upgrade has been finalized. During this time, it will not be possible to manipulate these schema changes via PAUSE JOB
/RESUME JOB
/CANCEL JOB
statements. Once the upgrade has been finalized, these schema changes will run to completion.
Note that this behavior is specific to upgrades from v19.2 to v20.1; it does not apply to other upgrades.
Review temporary limitations
Once all nodes are running v20.1, but before the upgrade has been finalized:
New schema changes will be blocked and return an error, with the exception of
CREATE TABLE
statements without foreign key references and no-op schema change statements that useIF NOT EXISTS
. Update your application or tooling to prevent disallowed schema changes during this period. Once the upgrade has been finalized, new schema changes can resume.GRANT
andREVOKE
statements will be blocked and return an error. This is because privileges are stored with table metadata and, therefore, privilege changes are considered schema changes, from an internal perspective. Update your application or tooling to prevent privilege changes during this period. Once the upgrade has been finalized, changes to user privileges can resume.
Note that these limitations are specific to upgrades from v19.2 to v20.1; they do not apply to other upgrades.
Step 3. Decide how the upgrade will be finalized
This step is relevant only when upgrading from v19.2.x to v20.1. For upgrades within the v20.1.x series, skip this step.
By default, after all nodes are running the new version, the upgrade process will be auto-finalized. This will enable certain features and performance improvements introduced in v20.1. However, it will no longer be possible to perform a downgrade to v19.2. In the event of a catastrophic failure or corruption, the only option will be to start a new cluster using the old binary and then restore from one of the backups created prior to performing the upgrade. For this reason, we recommend disabling auto-finalization so you can monitor the stability and performance of the upgraded cluster before finalizing the upgrade, but note that you will need to follow all of the subsequent directions, including the manual finalization in step 5:
Upgrade to v19.2, if you haven't already.
Start the
cockroach sql
shell against any node in the cluster.Set the
cluster.preserve_downgrade_option
cluster setting:> SET CLUSTER SETTING cluster.preserve_downgrade_option = '19.2';
It is only possible to set this setting to the current cluster version.
Features that require upgrade finalization
When upgrading from v19.2 to v20.1, certain features and performance improvements will be enabled only after finalizing the upgrade, including but not limited to:
Primary key changes: After finalization, it will be possible to change the primary key of an existing table using the
ALTER TABLE ... ALTER PRIMARY KEY
statement, or usingDROP CONSTRAINT
and thenADD CONSTRAINT
in the same transaction.Additional authentication methods: After finalization, it will be possible to set the
server.host_based_authentication.configuration
cluster setting totrust
orreject
to unconditionally allow or deny matching connection attempts.Password for the
root
user: After finalization, it will be possible to useALTER USER root WITH PASSWORD
to set a password for theroot
user.Dropping indexes used by foreign keys: After finalization, it will be possible to drop an index used by a foreign key constraint if another index exists that fulfills the indexing requirements.
Hash-sharded indexes: After finalization, it will be possible to use hash-sharded indexes to distribute sequential traffic uniformly across ranges, eliminating single-range hotspots and improving write performance on sequentially-keyed indexes. This is an experimental feature that must be enabled by setting the
experimental_enable_hash_sharded_indexes
session variable toon
.CREATEROLE
andNOCREATEROLE
privileges: After finalization, it will be possible to allow or disallow a user or role to create, alter, or drop other roles via theCREATEROLE
orNOCREATEROLE
privilege.Nested transactions: After finalization, it will be possible to create nested transactions using
SAVEPOINT
s.TIMETZ
data type: After finalization, it will be possible to use theTIMETZ
data type to store a time of day with a time zone offset from UTC.TIME
/TIMETZ
andINTERVAL
precision: After finalization, it will be possible to specify precision levels from 0 (seconds) to 6 (microseconds) forTIME
/TIMETZ
andINTERVAL
values.
Step 4. Perform the rolling upgrade
For each node in your cluster, complete the following steps. Be sure to upgrade only one node at a time, and wait at least one minute after a node rejoins the cluster to upgrade the next node. Simultaneously upgrading more than one node increases the risk that ranges will lose a majority of their replicas and cause cluster unavailability.
Once all nodes are running v20.1, but before the upgrade has been finalized, new schema changes will be blocked and return an error, with the exception of CREATE TABLE
statements without foreign key references and no-op schema change statements that use IF NOT EXISTS
. Be sure to update your application or tooling to prevent disallowed schema changes during the upgrade process.
Note that this behavior is specific to upgrades from v19.2 to v20.1; it does not apply to other upgrades.
We recommend creating scripts to perform these steps instead of performing them manually. Also, if you are running CockroachDB on Kubernetes, see our documentation on single-cluster and/or multi-cluster orchestrated deployments for upgrade guidance instead.
Drain and stop the node using one of the following methods:
- If the node was started with a process manager, gracefully stop the node by sending
SIGTERM
with the process manager. If the node is not shutting down after 1 minute, sendSIGKILL
to terminate the process. When usingsystemd
, for example, setTimeoutStopSec=60
in your configuration template and runsystemctl stop <systemd config filename>
to stop the node withoutsystemd
restarting it. - If the node was started using
cockroach start
and is running in the foreground, pressctrl-c
in the terminal. - If the node was started using
cockroach start
and the--background
and--pid-file
flags, runkill <pid>
, where<pid>
is the process ID of the node.
Note:The amount of time you should wait before sending
SIGKILL
can vary depending on your cluster configuration and workload, which affects how long it takes your nodes to complete a graceful shutdown. In certain edge cases, forcefully terminating the process before the node has completed shutdown can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. If you need maximum cluster availability, you can runcockroach node drain
prior to node shutdown and actively monitor the draining process instead of automating it.Verify that the process has stopped:
$ ps aux | grep cockroach
Alternately, you can check the node's logs for the message
server drained and shutdown completed
.- If the node was started with a process manager, gracefully stop the node by sending
Download and install the CockroachDB binary you want to use:
$ curl https://binaries.cockroachdb.com/cockroach-v20.1.17.darwin-10.9-amd64.tgz
$ tar -xzf cockroach-v20.1.17.darwin-10.9-amd64.tgz
$ curl https://binaries.cockroachdb.com/cockroach-v20.1.17.linux-amd64.tgz
$ tar -xzf cockroach-v20.1.17.linux-amd64.tgz
If you use
cockroach
in your$PATH
, rename the outdatedcockroach
binary, and then move the new one into its place:i="$(which cockroach)"; mv "$i" "$i"_old
$ cp -i cockroach-v20.1.17.darwin-10.9-amd64/cockroach /usr/local/bin/cockroach
i="$(which cockroach)"; mv "$i" "$i"_old
$ cp -i cockroach-v20.1.17.linux-amd64/cockroach /usr/local/bin/cockroach
Start the node to have it rejoin the cluster.
Warning:For maximum availability, do not wait more than a few minutes before restarting the node with the new binary. See this open issue for context.
Without a process manager like
systemd
, re-run thecockroach start
command that you used to start the node initially, for example:$ cockroach start \ --certs-dir=certs \ --advertise-addr=<node address> \ --join=<node1 address>,<node2 address>,<node3 address>
If you are using
systemd
as the process manager, run this command to start the node:$ systemctl start <systemd config filename>
Verify the node has rejoined the cluster through its output to
stdout
or through the Admin UI.If you use
cockroach
in your$PATH
, you can remove the old binary:$ rm /usr/local/bin/cockroach_old
If you leave versioned binaries on your servers, you do not need to do anything.
After the node has rejoined the cluster, ensure that the node is ready to accept a SQL connection.
Unless there are tens of thousands of ranges on the node, it's usually sufficient to wait one minute. To be certain that the node is ready, run the following command:
cockroach sql -e 'select 1'
The command will automatically wait to complete until the node is ready.
Repeat these steps for the next node.
Step 5. Finish the upgrade
This step is relevant only when upgrading from v19.2.x to v20.1. For upgrades within the v20.1.x series, skip this step.
If you disabled auto-finalization in step 3, monitor the stability and performance of your cluster for as long as you require to feel comfortable with the upgrade (generally at least a day) and remember to prevent new schema changes and changes to user privileges as mention earlier. If during this time you decide to roll back the upgrade, repeat the rolling restart procedure with the old binary.
Once you are satisfied with the new version:
Start the
cockroach sql
shell against any node in the cluster.Re-enable auto-finalization:
> RESET CLUSTER SETTING cluster.preserve_downgrade_option;
Update your application or tooling to re-enable schema changes that were disallowed during the upgrade process. Note that this applies only to upgrades from v19.2 to v20.1. It does not apply to other upgrades.
Troubleshooting
After the upgrade has finalized (whether manually or automatically), it is no longer possible to downgrade to the previous release. If you are experiencing problems, we therefore recommend that you:
Run the
cockroach debug zip
command against any node in the cluster to capture your cluster's state.Reach out for support from Cockroach Labs, sharing your debug zip.
In the event of catastrophic failure or corruption, the only option will be to start a new cluster using the old binary and then restore from one of the backups created prior to performing the upgrade.