Physical Cluster Replication Technical Overview

On this page Carat arrow pointing down

Physical cluster replication (PCR) automatically and continuously streams data from an active primary CockroachDB cluster to a passive standby cluster. Each cluster contains: a system virtual cluster and an application virtual cluster:

  • The system virtual cluster manages the cluster's control plane and the replication of the cluster's data. Admins connect to the system virtual cluster to configure and manage the underlying CockroachDB cluster, set up PCR, create and manage a virtual cluster, and observe metrics and logs for the CockroachDB cluster and each virtual cluster.
  • Each other virtual cluster manages its own data plane. Users connect to a virtual cluster by default, rather than the system virtual cluster. To connect to the system virtual cluster, the connection string must be modified. Virtual clusters contain user data and run application workloads. When PCR is enabled, the non-system virtual cluster on both primary and secondary clusters is named main.

This separation of concerns means that the replication stream can operate without affecting work happening in a virtual cluster.

Replication stream start-up sequence

Starting a physical replication stream consists of two jobs: one each on the standby and primary cluster:

  • Standby consumer job: Communicates with the primary cluster via an ordinary SQL connection and is responsible for initiating the replication stream. The consumer job ingests updates from the primary cluster producer job.
  • Primary producer job: Protects data on the primary cluster and sends updates to the standby cluster.

The stream initialization proceeds as follows:

  1. The standby's consumer job connects via its system virtual cluster to the primary cluster and starts the primary cluster's physical stream producer job.
  2. The primary cluster chooses a timestamp at which to start the physical replication stream. Data on the primary is protected from garbage collection until it is replicated to the standby using a protected timestamp.
  3. The primary cluster returns the timestamp and a job ID for the replication job.
  4. The standby cluster retrieves a list of all nodes in the primary cluster. It uses this list to distribute work across all nodes in the standby cluster.
  5. The initial scan runs on the primary and backfills all data from the primary virtual cluster as of the starting timestamp of the replication stream.
  6. Once the initial scan is complete, the primary then begins streaming all changes from the point of the starting timestamp.

Two virtualized clusters with system virtual cluster and application virtual cluster showing the directional stream.

During the replication stream

The replication happens at the byte level, which means that the job is unaware of databases, tables, row boundaries, and so on. However, when a cutover to the standby cluster is initiated, the replication job ensures that the cluster is in a transactionally consistent state as of a certain point in time. Beyond the application data, the job will also replicate users, privileges, basic zone configuration, and schema changes.

During the job, rangefeeds are periodically emitting resolved timestamps, which is the time where the ingested data is known to be consistent. Resolved timestamps provide a guarantee that there are no new writes from before that timestamp. This allows the standby cluster to move the protected timestamp forward as the replicated timestamp advances. This information is sent to the primary cluster, which allows for garbage collection to continue as the replication stream on the standby cluster advances.

Note:

If the primary cluster does not receive replicated time information from the standby after 24 hours, it cancels the replication job. This ensures that an inactive replication job will not prevent garbage collection. The time at which the job is removed is configurable with ALTER VIRTUAL CLUSTER virtual_cluster EXPIRATION WINDOW = duration syntax.

Cutover and promotion process

The tracked replicated time and the advancing protected timestamp allows the replication stream to also track retained time, which is a timestamp in the past indicating the lower bound that the replication stream could cut over to. Therefore, the cutover window for a replication job falls between the retained time and the replicated time.

Timeline showing how the cutover window is between the retained time and replicated time.

Replication lag is the time between the most up-to-date replicated time and the actual time. While the replication keeps as current as possible to the actual time, this replication lag window is where there is potential for data loss.

For the cutover process, the standby cluster waits until it has reached the specified cutover time, which can be in the past (retained time), the LATEST timestamp, or in the future. Once that timestamp has been reached, the replication stream stops and any data in the standby cluster that is above the cutover time is removed. Depending on how much data the standby needs to revert, this can affect the duration of RTO (recovery time objective).

After reverting any necessary data, the standby virtual cluster is promoted as available to serve traffic and the replication job ends.

Note:

For detail on cutting back to the primary cluster following a cutover, refer to Cut back to the primary cluster.


Yes No
On this page

Yes No