Transaction Retry Error Reference

On this page Carat arrow pointing down
Warning:
As of May 10, 2022, CockroachDB v20.2 is no longer supported. For more details, refer to the Release Support Policy.

This page has a list of the transaction retry error codes emitted by CockroachDB.

Transaction retry errors use the SQLSTATE error code 40001, and emit error messages with the string restart transaction. The most common transaction retry errors are also known as serialization errors. Serialization errors indicate that a transaction failed because it could not be placed into a serializable ordering among all of the currently-executing transactions.

The failure to place a transaction into a serializable ordering usually happens due to a transaction conflict with another concurrent or recent transaction that is trying to write to the same data. When multiple transactions are trying to write to the same data, this state is also known as contention. When serialization errors occur due to contention, the transaction needs to be retried by the client as described in client-side retry handling.

In less common cases, transaction retry errors are not caused by contention, but by the internal state of the CockroachDB cluster. For example, the cluster could be overloaded. In such cases, other actions may need to be taken above and beyond client-side retries.

See below for a complete list of retry error codes. For each error code, we describe:

  • Why the error is happening.
  • What to do about it.

This page is meant to provide information about specific transaction retry error codes to make troubleshooting easier. In most cases, the correct actions to take when these errors occur are:

  1. Update your app to retry on serialization errors (where SQLSTATE is 40001), as described in client-side retry handling.

  2. Design your schema and queries to reduce contention. For more information about how contention occurs and how to avoid it, see Understanding and avoiding transaction contention. In particular, if you are able to send all of the statements in your transaction in a single batch, CockroachDB can usually automatically retry the entire transaction for you.

  3. Use historical reads with SELECT ... AS OF SYSTEM TIME.

  4. In some cases, SELECT FOR UPDATE can also be used to reduce the occurrence of serialization errors.

  5. High priority transactions are less likely to experience serialization errors than low priority transactions. Adjusting transaction priorities usually does not change how many serialization errors occur, but it can help control which transactions experience them.

Note:

Note that your application's retry logic does not need to distinguish between the different types of serialization errors. They are listed here for reference during advanced troubleshooting.

Overview

CockroachDB always attempts to find a serializable ordering among all of the currently-executing transactions.

Whenever possible, CockroachDB will auto-retry a transaction internally without notifying the client. CockroachDB will only send a serialization error to the client when it cannot resolve the error automatically without client-side intervention.

In other words, by the time a serialization error bubbles up to the client, CockroachDB has already tried to handle the error internally, and could not.

The main reason why CockroachDB cannot auto-retry every serialization error without sending an error to the client is that the SQL language is "conversational" by design. The client can send arbitrary statements to the server during a transaction, receive some results, and then decide to issue other arbitrary statements inside the same transaction based on the server's response.

Suppose that the client is a Java application using JDBC, or an analyst typing BEGIN directly to a SQL shell. In either case, the client is free to issue a BEGIN, wait an arbitrary amount of time, and issue additional statements. Meanwhile, other transactions are being processed by the system, potentially writing to the same data.

This "conversational" design means that there is no way for the server to always retry the arbitrary statements sent so far inside an open transaction. If there are different results for any given statement than there were at an earlier point in the currently open transaction's lifetime (likely due to the operations of other, concurrently-executing transactions), CockroachDB must defer to the client to decide how to handle that situation. This is why we recommend keeping transactions as small as possible.

Error reference

RETRY_WRITE_TOO_OLD

TransactionRetryWithProtoRefreshError: ... RETRY_WRITE_TOO_OLD ...

Description:

The RETRY_WRITE_TOO_OLD error occurs when a transaction A tries to write to a row R, but another transaction B that was supposed to be serialized after A (i.e., had been assigned a higher timestamp), has already written to that row R, and has already committed. This is a common error when you have too much contention in your workload.

Action:

  1. Retry transaction A as described in client-side retry handling.
  2. Design your schema and queries to reduce contention. For more information about how contention occurs and how to avoid it, see Understanding and avoiding transaction contention. In particular, if you are able to send all of the statements in your transaction in a single batch, CockroachDB can usually automatically retry the entire transaction for you.

RETRY_SERIALIZABLE

TransactionRetryWithProtoRefreshError: ... RETRY_SERIALIZABLE ...

Description:

The RETRY_SERIALIZABLE error occurs in the following cases:

  1. When a transaction A has its timestamp moved forward (also known as A being "pushed") as CockroachDB attempts to find a serializable transaction ordering. Specifically, transaction A tried to write a key that transaction B had already read, and B was supposed to be serialized after A (i.e., B had a higher timestamp than A). CockroachDB will try to serialize A after B by changing A's timestamp, but it cannot do that when another transaction has subsequently written to some of the keys that A has read and returned to the client. When that happens, the RETRY_SERIALIZATION error is signalled. For more information about how timestamp pushes work in our transaction model, see the architecture docs on the transaction layer's timestamp cache.

  2. When a high-priority transaction A does a read that runs into a write intent from another lower-priority transaction B, and some other transaction C writes to a key that B has already read. Transaction B will get this error when it tries to commit, because A has already read some of the data touched by B and returned results to the client, and C has written data previously read by B.

  3. When a transaction A is forced to refresh (i.e., change its timestamp) due to hitting the maximum closed timestamp interval (closed timestamps enable Follower Reads and Change Data Capture (CDC)). This can happen when transaction A is a long-running transaction, and there is a write by another transaction to data that A has already read. If this is the cause of the error, the solution is to increase the kv.closed_timestamp.target_duration setting to a higher value. Unfortunately, there is no indication from this error code that a too-low closed timestamp setting is the issue. Therefore, you may need to rule out cases 1 and 2 (or experiment with increasing the closed timestamp interval, if that is possible for your application - see the note below).

Action:

  1. If you encounter case 1 or 2 above, the solution is to:

    1. Retry transaction A as described in client-side retry handling.
    2. Design your schema and queries to reduce contention. For more information about how contention occurs and how to avoid it, see Understanding and avoiding transaction contention. In particular, if you are able to send all of the statements in your transaction in a single batch, CockroachDB can usually automatically retry the entire transaction for you.
    3. Use historical reads with SELECT ... AS OF SYSTEM TIME.
  2. If you encounter case 3 above, the solution is to:

    1. Increase the kv.closed_timestamp.target_duration setting to a higher value. As described above, this will impact the freshness of data available via Follower Reads and CDC changefeeds.
    2. Retry transaction A as described in client-side retry handling.
    3. Design your schema and queries to reduce contention. For more information about how contention occurs and how to avoid it, see Understanding and avoiding transaction contention. In particular, if you are able to send all of the statements in your transaction in a single batch, CockroachDB can usually automatically retry the entire transaction for you.
    4. Use historical reads with SELECT ... AS OF SYSTEM TIME.
Note:

If you increase the kv.closed_timestamp.target_duration setting, it means that you are increasing the amount of time by which the data available in Follower Reads and CDC changefeeds lags behind the current state of the cluster. In other words, there is a trade-off here: if you absolutely must execute long-running transactions that execute concurrently with other transactions that are writing to the same data, you may have to settle for longer delays on Follower Reads and/or CDC to avoid frequent serialization errors. The anomaly that would be exhibited if these transactions were not retried is called write skew.

RETRY_ASYNC_WRITE_FAILURE

TransactionRetryWithProtoRefreshError: ... RETRY_ASYNC_WRITE_FAILURE ...

Description:

The RETRY_ASYNC_WRITE_FAILURE error occurs when some kind of problem with your cluster's operation occurs at the moment of a previous write in the transaction, causing CockroachDB to fail to replicate one of the transaction's writes. For example, this can happen if you have a networking partition that cuts off access to some nodes in your cluster.

Action:

  1. Retry the transaction as described in client-side retry handling. This is worth doing because the problem with the cluster is likely to be transient.
  2. Investigate the problems with your cluster. For cluster troubleshooting information, see Troubleshoot Cluster Setup.

ReadWithinUncertaintyInterval

TransactionRetryWithProtoRefreshError: ReadWithinUncertaintyIntervalError:
        read at time 1591009232.376925064,0 encountered previous write with future timestamp 1591009232.493830170,0 within uncertainty interval `t <= 1591009232.587671686,0`;
        observed timestamps: [{1 1591009232.587671686,0} {5 1591009232.376925064,0}]

Description:

The ReadWithinUncertaintyIntervalError can occur when two transactions which start on different gateway nodes attempt to operate on the same data at close to the same time, and one of the operations is a write. The uncertainty comes from the fact that we cannot tell which one started first - the clocks on the two gateway nodes may not be perfectly in sync.

For example, if the clock on node A is ahead of the clock on node B, a transaction started on node A may be able to commit a write with a timestamp that is still in the "future" from the perspective of node B. A later transaction that starts on node B should be able to see the earlier write from node A, even if B's clock has not caught up to A. The "read within uncertainty interval" occurs if we discover this situation in the middle of a transaction, when it is too late for the database to handle it automatically. When node B's transaction retries, it will unambiguously occur after the transaction from node A.

Note:

This behavior is non-deterministic: it depends on which node is the leaseholder of the underlying data range. It’s generally a sign of contention. Uncertainty errors are always possible with near-realtime reads under contention.

Action:

The solution is to do one of the following:

  1. Be prepared to retry on uncertainty (and other) errors, as described in client-side retry handling.
  2. Use historical reads with SELECT ... AS OF SYSTEM TIME.
  3. Design your schema and queries to reduce contention. For more information about how contention occurs and how to avoid it, see Understanding and avoiding transaction contention. In particular, if you are able to send all of the statements in your transaction in a single batch, CockroachDB can usually automatically retry the entire transaction for you.
  4. If you trust your clocks, you can try lowering the --max-offset option to cockroach start, which provides an upper limit on how long a transaction can continue to restart due to uncertainty.
Note:

Uncertainty errors are a form of transaction conflict. For more information about transaction conflicts, see Transaction conflicts.

RETRY_COMMIT_DEADLINE_EXCEEDED

TransactionRetryWithProtoRefreshError: TransactionPushError: transaction deadline exceeded ...

Description:

The RETRY_COMMIT_DEADLINE_EXCEEDED error means that the transaction timed out due to being pushed by other concurrent transactions. This error is most likely to happen to long-running transactions. The conditions that trigger this error are very similar to the conditions that lead to a RETRY_SERIALIZABLE error, except that a transaction that hits this error got pushed for several minutes, but did not hit any of the conditions that trigger a RETRY_SERIALIZABLE error. In other words, the conditions that trigger this error are a subset of those that trigger RETRY_SERIALIZABLE, and that this transaction ran for too long (several minutes).

Note:

Read-only transactions do not get pushed, so they do not run into this error.

This error occurs in the cases described below.

  1. When a transaction A has its timestamp moved forward (also known as A being "pushed") as CockroachDB attempts to find a serializable transaction ordering. Specifically, transaction A tried to write a key that transaction B had already read. B was supposed to be serialized after A (i.e., B had a higher timestamp than A). CockroachDB will try to serialize A after B by changing A's timestamp.

  2. When a high-priority transaction A does a read that runs into a write intent from another lower-priority transaction B. Transaction B may get this error when it tries to commit, because A has already read some of the data touched by B and returned results to the client.

  3. When a transaction A is forced to refresh (change its timestamp) due to hitting the maximum closed timestamp interval (closed timestamps enable Follower Reads and Change Data Capture (CDC)). This can happen when transaction A is a long-running transaction, and there is a write by another transaction to data that A has already read.

Action:

  1. The RETRY_COMMIT_DEADLINE_EXCEEDED error is one case where the standard advice to add a retry loop to your application may not be advisable. A transaction that runs for long enough to get pushed beyond its deadline is quite likely to fail again on retry for the same reasons. Therefore, the best thing to do in this case is to shrink the running time of your transactions so they complete more quickly and do not hit the deadline.
  2. If you encounter case 3 above, you can increase the kv.closed_timestamp.target_duration setting to a higher value. Unfortunately, there is no indication from this error code that a too-low closed timestamp setting is the issue. Therefore, you may need to rule out cases 1 and 2 (or experiment with increasing the closed timestamp interval, if that is possible for your application - see the note below).
Note:

If you increase the kv.closed_timestamp.target_duration setting, it means that you are increasing the amount of time by which the data available in Follower Reads and CDC changefeeds lags behind the current state of the cluster. In other words, there is a trade-off here: if you absolutely must execute long-running transactions that execute concurrently with other transactions that are writing to the same data, you may have to settle for longer delays on Follower Reads and/or CDC to avoid frequent serialization errors. The anomaly that would be exhibited if these transactions were not retried is called write skew.

ABORT_REASON_ABORTED_RECORD_FOUND

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_ABORTED_RECORD_FOUND) ...

Description:

The ABORT_REASON_ABORTED_RECORD_FOUND error means that the client application is trying to use a transaction that has been aborted. This happens in one of the following cases:

  • Write-write conflict: Another high-priority transaction B encountered a write intent by our transaction A, and tried to push A's timestamp.
  • Cluster overload: B thinks that A's transaction coordinator node is dead, because the coordinator node hasn't heartbeated the transaction record for a few seconds.
  • Deadlock: Some transaction B is trying to acquire conflicting locks in reverse order from transaction A.

Action:

If you are encountering deadlocks:

  • Avoid producing deadlocks in your application by making sure that transactions acquire locks in the same order.

If you are using only default transaction priorities:

  • This error means your cluster has problems. You are likely overloading it. Investigate the source of the overload, and do something about it. For more information, see Node liveness issues.

If you are using high- or low-priority transactions:

ABORT_REASON_CLIENT_REJECT

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_CLIENT_REJECT) ...

Description:

The ABORT_REASON_CLIENT_REJECT error is caused by the same conditions as the ABORT_REASON_ABORTED_RECORD_FOUND, and requires the same actions. The errors are fundamentally the same, except that they are discovered at different points in the process.

ABORT_REASON_PUSHER_ABORTED

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_PUSHER_ABORTED) ...

Description:

The ABORT_REASON_PUSHER_ABORTED error is caused by the same conditions as the ABORT_REASON_ABORTED_RECORD_FOUND, and requires the same actions. The errors are fundamentally the same, except that they are discovered at different points in the process.

ABORT_REASON_ABORT_SPAN

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_ABORT_SPAN) ...

Description:

The ABORT_REASON_ABORT_SPAN error is caused by the same conditions as the ABORT_REASON_ABORTED_RECORD_FOUND, and requires the same actions. The errors are fundamentally the same, except that they are discovered at different points in the process.

ABORT_REASON_NEW_LEASE_PREVENTS_TXN

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_NEW_LEASE_PREVENTS_TXN) ...

Description:

The ABORT_REASON_NEW_LEASE_PREVENTS_TXN error occurs because the timestamp cache will not allow transaction A to create a transaction record. A new lease wipes the timestamp cache, so this could mean the leaseholder was moved and the duration of transaction A was unlucky enough to happen across a lease acquisition. In other words, leaseholders got shuffled out from underneath transaction A (due to no fault of the client application or schema design), and now it has to be retried.

Action:

Retry transaction A as described in client-side retry handling.

ABORT_REASON_TIMESTAMP_CACHE_REJECTED

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_TIMESTAMP_CACHE_REJECTED) ...

Description:

The ABORT_REASON_TIMESTAMP_CACHE_REJECTED error occurs when the timestamp cache will not allow transaction A to create a transaction record. This can happen due to a range merge happening in the background, or because the timestamp cache is an in-memory cache, and has outgrown its memory limit (about 64 MB).

Action:

Retry transaction A as described in client-side retry handling.

See also


Yes No
On this page

Yes No