This page provides best practices for optimizing query performance in CockroachDB.
DML best practices
Use multi-row statements instead of multiple single-row statements
For INSERT
, UPSERT
, and DELETE
statements, a single multi-row statement is faster than multiple single-row statements. Whenever possible, use multi-row statements for DML queries instead of multiple single-row statements.
For more information, see:
Use UPSERT
instead of INSERT ON CONFLICT
on tables with no secondary indexes
When inserting or updating columns on a table that does not have secondary indexes, Cockroach Labs recommends using an UPSERT
statement instead of INSERT ON CONFLICT DO UPDATE
. Whereas INSERT ON CONFLICT
always performs a read, the UPSERT
statement writes without reading, making it faster. This may be useful if you are using a simple SQL table of two columns to simulate direct KV access.
If the table has a secondary index, there is no performance difference between UPSERT
and INSERT ON CONFLICT
. However, INSERT
without an ON CONFLICT
clause may not scan the table for existing values. This can provide a performance improvement over UPSERT
.
Bulk-insert best practices
Use multi-row INSERT
statements for bulk-inserts into existing tables
To bulk-insert data into an existing table, batch multiple rows in one multi-row INSERT
statement. Experimentally determine the optimal batch size for your application by monitoring the performance for different batch sizes (10 rows, 100 rows, 1000 rows). Do not include bulk INSERT
statements within an explicit transaction.
You can also use the IMPORT INTO
statement to bulk-insert CSV data into an existing table.
For more information, see Insert Multiple Rows.
Large multi-row INSERT
queries can lead to long-running transactions that result in transaction retry errors. If a multi-row INSERT
query results in an error code 40001
with the message transaction deadline exceeded
, we recommend breaking up the query up into smaller batches of rows.
Use IMPORT
instead of INSERT
for bulk-inserts into new tables
To bulk-insert data into a brand new table, the IMPORT
statement performs better than INSERT
.
Bulk-delete best practices
Use TRUNCATE
instead of DELETE
to delete all rows in a table
The TRUNCATE
statement removes all rows from a table by dropping the table and recreating a new table with the same name. This performs better than using DELETE
, which performs multiple transactions to delete all rows.
Use batch deletes to delete a large number of rows
To delete a large number of rows, we recommend iteratively deleting batches of rows until all of the unwanted rows are deleted. For an example, see Bulk-delete Data.
Batch delete "expired" data
CockroachDB has support for Time to Live ("TTL") expiration on table rows, also known as Row-Level TTL. Row-Level TTL is a mechanism whereby rows from a table are considered "expired" and can be automatically deleted once those rows have been stored longer than a specified expiration time.
For more information, see Batch delete expired data with Row-Level TTL.
Assign column families
A column family is a group of columns in a table that is stored as a single key-value pair in the underlying key-value store.
When a table is created, all columns are stored as a single column family. This default approach ensures efficient key-value storage and performance in most cases. However, when frequently updated columns are grouped with seldom updated columns, the seldom updated columns are nonetheless rewritten on every update. Especially when the seldom updated columns are large, it's therefore more performant to assign them to a distinct column family.
Unique ID best practices
The best practices for generating unique IDs in a distributed database like CockroachDB are very different than for a legacy single-node database. Traditional approaches for generating unique IDs for legacy single-node databases include:
- Using the
SERIAL
pseudo-type for a column to generate random unique IDs. This can result in a performance bottleneck because IDs generated temporally near each other have similar values and are located physically near each other in a table's storage. - Generating monotonically increasing
INT
IDs by using transactions with roundtripSELECT
s, e.g.,INSERT INTO tbl (id, …) VALUES ((SELECT max(id)+1 FROM tbl), …)
. This has a very high performance cost since it makes allINSERT
transactions wait for their turn to insert the next ID. You should only do this if your application really does require strict ID ordering. In some cases, using change data capture (CDC) can help avoid the requirement for strict ID ordering. If you can avoid the requirement for strict ID ordering, you can use one of the higher-performance ID strategies outlined in the following sections.
The preceding approaches are likely to create hot spots for both reads and writes in CockroachDB. We discourage indexing on sequential keys. If a table must be indexed on sequential keys, use hash-sharded indexes. Hash-sharded indexes distribute sequential traffic uniformly across ranges, eliminating single-range hot spots and improving write performance on sequentially-keyed indexes at a small cost to read performance.
To create unique and non-sequential IDs, we recommend the following approaches (listed in order from best to worst performance):
Approach | Pros | Cons |
---|---|---|
1. Use multi-column primary keys | Potentially fastest, if done right | Complex, requires up-front design and testing to ensure performance |
2. Use functions to generate unique IDs | Good performance; spreads load well; easy choice | May leave some performance on the table; requires other columns to be useful in queries |
3. Use INSERT with the RETURNING clause |
Easy to query against; familiar design | Slower performance than the other options; higher chance of transaction contention |
Use multi-column primary keys
A well-designed multi-column primary key can yield even better performance than a UUID primary key, but it requires more up-front schema design work. To get the best performance, ensure that any monotonically increasing field is located after the first column of the primary key. When done right, such a composite primary key should result in:
- Enough randomness in your primary key to spread the table data / query load relatively evenly across the cluster, which will avoid hot spots. By "enough randomness" we mean that the prefix of the primary key should be relatively uniformly distributed over its domain. Its domain should have at least as many elements as you have nodes.
- A monotonically increasing column that is part of the primary key (and thus indexed) which is also useful in your queries.
For example, consider a social media website. Social media posts are written by users, and on login the user's last 10 posts are displayed. A good choice for a primary key might be (username, post_timestamp)
. For example:
> CREATE TABLE posts (
username STRING,
post_timestamp TIMESTAMP,
post_id INT,
post_content STRING,
CONSTRAINT posts_pk PRIMARY KEY(username, post_timestamp)
);
This would make the following query efficient.
> SELECT * FROM posts
WHERE username = 'alyssa'
ORDER BY post_timestamp DESC
LIMIT 10;
username | post_timestamp | post_id | post_content
+----------+---------------------------+---------+--------------+
alyssa | 2019-07-31 18:01:00+00:00 | ... | ...
alyssa | 2019-07-30 10:22:00+00:00 | ... | ...
alyssa | 2019-07-30 09:12:00+00:00 | ... | ...
alyssa | 2019-07-29 13:48:00+00:00 | ... | ...
alyssa | 2019-07-29 13:47:00+00:00 | ... | ...
alyssa | 2019-07-29 13:46:00+00:00 | ... | ...
alyssa | 2019-07-29 13:43:00+00:00 | ... | ...
...
Time: 924µs
To see why, let's look at the EXPLAIN
output. It shows that the query is fast because it does a point lookup on the indexed column username
(as shown by the line spans | /"alyssa"-...
). Furthermore, the column post_timestamp
is already in an index, and sorted (since it's a monotonically increasing part of the primary key).
> EXPLAIN (VERBOSE)
SELECT * FROM posts
WHERE username = 'alyssa'
ORDER BY post_timestamp DESC
LIMIT 10;
info
----------------------------------------------------------------
distribution: local
vectorized: true
• revscan
columns: (username, post_timestamp, post_id, post_content)
ordering: -post_timestamp
estimated row count: 10 (missing stats)
table: posts@posts_pk
spans: /"alyssa"-/"alyssa"/PrefixEnd
limit: 10
(10 rows)
Time: 1ms total (execution 1ms / network 0ms)
Note that the above query also follows the indexing best practice of indexing all columns in the WHERE
clause.
Use functions to generate unique IDs
To auto-generate unique row identifiers, you can use the gen_random_uuid()
, uuid_v4()
, or unique_rowid()
functions.
To use the UUID
column with the gen_random_uuid()
function as the default value:
CREATE TABLE users (
id UUID NOT NULL DEFAULT gen_random_uuid(),
city STRING NOT NULL,
name STRING NULL,
address STRING NULL,
credit_card STRING NULL,
CONSTRAINT "primary" PRIMARY KEY (city ASC, id ASC),
FAMILY "primary" (id, city, name, address, credit_card)
);
INSERT INTO users (name, city) VALUES ('Petee', 'new york'), ('Eric', 'seattle'), ('Dan', 'seattle');
SELECT * FROM users;
id | city | name | address | credit_card
+--------------------------------------+----------+-------+---------+-------------+
cf8ee4e2-cd74-449a-b6e6-a0fb2017baa4 | new york | Petee | NULL | NULL
2382564e-702f-42d9-a139-b6df535ae00a | seattle | Eric | NULL | NULL
7d27e40b-263a-4891-b29b-d59135e55650 | seattle | Dan | NULL | NULL
(3 rows)
Alternatively, you can use the BYTES
column with the uuid_v4()
function as the default value:
CREATE TABLE users2 (
id BYTES DEFAULT uuid_v4(),
city STRING NOT NULL,
name STRING NULL,
address STRING NULL,
credit_card STRING NULL,
CONSTRAINT "primary" PRIMARY KEY (city ASC, id ASC),
FAMILY "primary" (id, city, name, address, credit_card)
);
INSERT INTO users2 (name, city) VALUES ('Anna', 'new york'), ('Jonah', 'seattle'), ('Terry', 'chicago');
SELECT * FROM users;
id | city | name | address | credit_card
+------------------------------------------------+----------+-------+---------+-------------+
4\244\277\323/\261M\007\213\275*\0060\346\025z | chicago | Terry | NULL | NULL
\273*t=u.F\010\274f/}\313\332\373a | new york | Anna | NULL | NULL
\004\\\364nP\024L)\252\364\222r$\274O0 | seattle | Jonah | NULL | NULL
(3 rows)
In either case, generated IDs will be 128-bit, sufficiently large to generate unique values. Once the table grows beyond a single key-value range's default size, new IDs will be scattered across all of the table's ranges and, therefore, likely across different nodes. This means that multiple nodes will share in the load.
This approach has the disadvantage of creating a primary key that may not be useful in a query directly, which can require a join with another table or a secondary index.
If it is important for generated IDs to be stored in the same key-value range, you can use an integer type with the unique_rowid()
function as the default value, either explicitly or via the SERIAL
pseudo-type:
CREATE TABLE users3 (
id INT DEFAULT unique_rowid(),
city STRING NOT NULL,
name STRING NULL,
address STRING NULL,
credit_card STRING NULL,
CONSTRAINT "primary" PRIMARY KEY (city ASC, id ASC),
FAMILY "primary" (id, city, name, address, credit_card)
);
INSERT INTO users3 (name, city) VALUES ('Blake', 'chicago'), ('Hannah', 'seattle'), ('Bobby', 'seattle');
SELECT * FROM users3;
id | city | name | address | credit_card
+--------------------+---------+--------+---------+-------------+
469048192112197633 | chicago | Blake | NULL | NULL
469048192112263169 | seattle | Hannah | NULL | NULL
469048192112295937 | seattle | Bobby | NULL | NULL
(3 rows)
Upon insert or upsert, the unique_rowid()
function generates a default value from the timestamp and ID of the node executing the insert. Such time-ordered values are likely to be globally unique except in cases where a very large number of IDs (100,000+) are generated per node per second. Also, there can be gaps and the order is not completely guaranteed.
To understand the differences between the UUID
and unique_rowid()
options, see the SQL FAQs. For further background on UUIDs, see What is a UUID, and Why Should You Care?.
Use INSERT
with the RETURNING
clause to generate unique IDs
If something prevents you from using multi-column primary keys or UUID
s to generate unique IDs, you might resort to using INSERT
s with SELECT
s to return IDs. Instead, use the RETURNING
clause with the INSERT
statement as shown below for improved performance.
Generate monotonically-increasing unique IDs
Suppose the table schema is as follows:
> CREATE TABLE X (
ID1 INT,
ID2 INT,
ID3 INT DEFAULT 1,
PRIMARY KEY (ID1,ID2)
);
The common approach would be to use a transaction with an INSERT
followed by a SELECT
:
> BEGIN;
> INSERT INTO X VALUES (1,1,1)
ON CONFLICT (ID1,ID2)
DO UPDATE SET ID3=X.ID3+1;
> SELECT * FROM X WHERE ID1=1 AND ID2=1;
> COMMIT;
However, the performance best practice is to use a RETURNING
clause with INSERT
instead of the transaction:
> INSERT INTO X VALUES (1,1,1),(2,2,2),(3,3,3)
ON CONFLICT (ID1,ID2)
DO UPDATE SET ID3=X.ID3 + 1
RETURNING ID1,ID2,ID3;
Generate random unique IDs
Suppose the table schema is as follows:
> CREATE TABLE X (
ID1 INT,
ID2 INT,
ID3 INT DEFAULT unique_rowid(),
PRIMARY KEY (ID1,ID2)
);
The common approach to generate random Unique IDs is a transaction using a SELECT
statement:
> BEGIN;
> INSERT INTO X VALUES (1,1);
> SELECT * FROM X WHERE ID1=1 AND ID2=1;
> COMMIT;
However, the performance best practice is to use a RETURNING
clause with INSERT
instead of the transaction:
> INSERT INTO X VALUES (1,1),(2,2),(3,3)
RETURNING ID1,ID2,ID3;
Secondary index best practices
See Secondary Index Best Practices.
Join best practices
See Join Performance Best Practices.
Subquery best practices
See Subquery Performance Best Practices.
Authorization best practices
See Authorization Best Practices.
Table scan best practices
Avoid SELECT *
for large tables
For large tables, avoid table scans (that is, reading the entire table data) whenever possible. Instead, define the required fields in a SELECT
statement.
For example, suppose the table schema is as follows:
> CREATE TABLE accounts (
id INT,
customer STRING,
address STRING,
balance INT
nominee STRING
);
Now if we want to find the account balances of all customers, an inefficient table scan would be:
> SELECT * FROM ACCOUNTS;
This query retrieves all data stored in the table. A more efficient query would be:
> SELECT CUSTOMER, BALANCE FROM ACCOUNTS;
This query returns the account balances of the customers.
Avoid SELECT DISTINCT
for large tables
SELECT DISTINCT
allows you to obtain unique entries from a query by removing duplicate entries. However, SELECT DISTINCT
is computationally expensive. As a performance best practice, use SELECT
with the WHERE
clause instead.
Use secondary indexes to optimize queries
See Statement Tuning with EXPLAIN
.
Use AS OF SYSTEM TIME
to decrease conflicts with long-running queries
If you have long-running queries (such as analytics queries that perform full table scans) that can tolerate slightly out-of-date reads, consider using the ... AS OF SYSTEM TIME
clause. Using this, your query returns data as it appeared at a distinct point in the past and will not cause conflicts with other concurrent transactions, which can increase your application's performance.
However, because AS OF SYSTEM TIME
returns historical data, your reads might be stale.
Prevent the optimizer from planning full scans
To avoid overloading production clusters, there are several ways to prevent the cost-based-optimizer from generating query plans with full table and index scans.
Use index hints to prevent full scans on tables
To prevent the optimizer from planning a full scan for a specific table, specify the
NO_FULL_SCAN
index hint. For example:SELECT * FROM table_name@{NO_FULL_SCAN};
To prevent a full scan of a partial index for a specific table, you must specify
NO_FULL_SCAN
in combination with the index name usingFORCE_INDEX
. For example:SELECT * FROM table_name@{FORCE_INDEX=index_name,NO_FULL_SCAN} WHERE b > 0;
This forces a constrained scan of the partial index. If a constrained scan of the partial index is not possible, an error is returned.
Disallow query plans that use full scans
When the disallow_full_table_scans
session setting is enabled, the optimizer will not plan full table or index scans on "large" tables (i.e., those with more rows than large_full_scan_rows
).
At the cluster level, set
disallow_full_table_scans
for some or all users and roles. For example:ALTER ROLE ALL SET disallow_full_table_scans = true;
At the application level, add
disallow_full_table_scans
to the connection string using theoptions
parameter.
If you disable full scans, you can set the large_full_scan_rows
session variable to specify the maximum table size allowed for a full scan. If no alternative plan is possible, the optimizer will return an error.
If you disable full scans, and you provide an index hint, the optimizer will try to avoid a full scan while also respecting the index hint. If this is not possible, the optimizer will return an error. If you do not provide an index hint and it is not possible to avoid a full scan, the optimizer will return an error, the full scan will be logged, and the sql.guardrails.full_scan_rejected.count
metric will be updated.
Disallow query plans that scan more than a number of rows
When the transaction_rows_read_err
session setting is enabled, the optimizer will not create query plans with scans that exceed the specified row limit. See Disallow transactions from reading or writing many rows.
Disallow transactions from reading or writing many rows
When the
transaction_rows_read_err
session setting is enabled, transactions that read more than the specified number of rows will fail. In addition, the optimizer will not create query plans with scans that exceed the specified row limit. For example, to set a default value for all users at the cluster level:ALTER ROLE ALL SET transaction_rows_read_err = 1000;
When the
transaction_rows_written_err
session setting is enabled, transactions that write more than the specified number of rows will fail. For example, to set a default value for all users at the cluster level:ALTER ROLE ALL SET transaction_rows_written_err = 1000;
To assess the impact of configuring these session settings, use the corresponding session settings transaction_rows_read_log
and transaction_rows_written_log
to log transactions that read or write the specified number of rows. Transactions are logged to the SQL_PERF
channel.
Transaction contention
Transaction contention occurs when the following three conditions are met:
- There are multiple concurrent transactions or statements (sent by multiple clients connected simultaneously to a single CockroachDB cluster).
- They operate on table rows with the same index key values (either on primary keys or secondary indexes).
- At least one of the transactions holds a write intent or exclusive locking read on the data.
By default under SERIALIZABLE
isolation, transactions that operate on the same index key values (specifically, that operate on the same column family for a given index key) are strictly serialized to obey transaction isolation semantics. To maintain this isolation, writing transactions "lock" rows to prevent interactions with concurrent transactions.
Locking reads issued with SELECT ... FOR UPDATE
perform a similar function by placing an exclusive lock on rows, which can cause contention for both SERIALIZABLE
and READ COMMITTED
transactions.
When transactions are experiencing contention, you may observe:
- Delays in query completion. This occurs when multiple transactions are trying to write to the same "locked" data at the same time, making a transaction unable to complete. This is also known as lock contention.
- Transaction retries performed automatically by CockroachDB. This occurs if a transaction cannot be placed into a serializable ordering among all of the currently-executing transactions. This is also called a serializability conflict.
- Transaction retry errors, which are emitted to your client when an automatic retry is not possible or fails. Your application must address transaction retry errors with client-side retry handling.
- Cluster hot spots.
To mitigate these effects, reduce the causes of transaction contention and reduce hot spots. For further background on transaction contention, see What is Database Contention, and Why Should You Care?.
Reduce transaction contention
You can reduce the causes of transaction contention:
Limit the number of affected rows by following optimizing queries (e.g., avoiding full scans, creating secondary indexes, etc.). Not only will transactions run faster, lock fewer rows, and hold locks for a shorter duration, but the chances of read invalidation when the transaction's timestamp is pushed, due to a conflicting write, are decreased because of a smaller read set (i.e., a smaller number of rows read).
Break down larger transactions (e.g., bulk deletes) into smaller ones to have transactions hold locks for a shorter duration. For example, use common table expressions to group multiple clauses together in a single SQL statement. This will also decrease the likelihood of pushed timestamps. For instance, as the size of writes (number of rows written) decreases, the chances of the transaction's timestamp getting bumped by concurrent reads decreases.
Use
SELECT FOR UPDATE
to aggressively lock rows that will later be updated in the transaction. Updates must operate on the most recent version of a row, so a concurrent write to the row will cause a retry error (RETRY_WRITE_TOO_OLD
). Locking early in the transaction forces concurrent writers to block until the transaction is finished, which prevents the retry error. Note that this locks the rows for the duration of the transaction; whether this is tenable will depend on your workload. For more information, see When and why to useSELECT FOR UPDATE
in CockroachDB.Use historical reads (
SELECT ... AS OF SYSTEM TIME
), preferably bounded staleness reads or exact staleness with follower reads when possible to reduce conflicts with other writes. This reduces the likelihood ofRETRY_SERIALIZABLE
errors as fewer writes will happen at the historical timestamp. More specifically, writes' timestamps are less likely to be pushed by historical reads as they would when the read has a higher priority level. Note that if theAS OF SYSTEM TIME
value is below the closed timestamp, the read cannot be invalidated.When replacing values in a row, use
UPSERT
and specify values for all columns in the inserted rows. This will usually have the best performance under contention, compared to combinations ofSELECT
,INSERT
, andUPDATE
.If applicable to your workload, assign column families and separate columns that are frequently read and written into separate columns. Transactions will operate on disjoint column families and reduce the likelihood of conflicts.
As a last resort, consider adjusting the closed timestamp interval using the
kv.closed_timestamp.target_duration
cluster setting to reduce the likelihood of long-running write transactions having their timestamps pushed. This setting should be carefully adjusted if no other mitigations are available because there can be downstream implications (e.g., historical reads, change data capture feeds, statistics collection, handling zone configurations, etc.). For example, a transaction A is forced to refresh (i.e., change its timestamp) due to hitting the maximum closed timestamp interval (closed timestamps enable Follower Reads and Change Data Capture (CDC)). This can happen when transaction A is a long-running transaction, and there is a write by another transaction to data that A has already read.
If you increase the kv.closed_timestamp.target_duration
setting, it means that you are increasing the amount of time by which the data available in Follower Reads and CDC changefeeds lags behind the current state of the cluster. In other words, there is a trade-off here: if you absolutely must execute long-running transactions that execute concurrently with other transactions that are writing to the same data, you may have to settle for longer delays on Follower Reads and/or CDC to avoid frequent serialization errors. The anomaly that would be exhibited if these transactions were not retried is called write skew.
Improve transaction performance by sizing and configuring the cluster
To maximize transaction performance, you'll need to maximize the performance of a single range. To achieve this, you can apply multiple strategies:
- Minimize the network distance between the replicas of a range, possibly using zone configs and partitioning, or the newer Multi-region SQL capabilities.
- Use the fastest storage devices available.
- If the contending transactions operate on different keys within the same range, add more CPU power (more cores) per node. However, if the transactions all operate on the same key, this may not provide an improvement.
Hot spots
A hot spot is any location on the cluster receiving significantly more requests than another. Hot spots are a symptom of resource contention and can create problems as requests increase, including excessive transaction contention.
Hot spots occur when an imbalanced workload access pattern causes significantly more reads and writes on a subset of data. For example:
- Transactions operate on the same range but different index keys. These operations are limited by the overall hardware capacity of the range leaseholder node.
- A range is indexed on a column of data that is sequential in nature (e.g., an ordered sequence, or a series of increasing, non-repeating
TIMESTAMP
s), such that all incoming writes to the range will be the last (or first) item in the index and appended to the end of the range. Because the system is unable to find a split point in the range that evenly divides the traffic, the range cannot benefit from load-based splitting. This creates a hot spot at the single range.
Read hot spots can occur if you perform lots of scans of a portion of a table index or a single key.
Reduce hot spots
Use index keys with a random distribution of values, so that transactions over different rows are more likely to operate on separate data ranges. See the SQL FAQs on row IDs for suggestions.
Place parts of the records that are modified by different transactions in different tables. That is, increase normalization. However, there are benefits and drawbacks to increasing normalization.
Benefits:
- Allows separate transactions to modify related underlying data without causing contention.
- Can improve performance for read-heavy workloads.
Drawbacks:
- More complex data model.
- Increases the chance of data inconsistency.
- Increases data redundancy.
- Can degrade performance for write-heavy workloads.
If the application strictly requires operating on very few different index keys, consider using
ALTER ... SPLIT AT
so that each index key can be served by a separate group of nodes in the cluster.If you are working with a table that must be indexed on sequential keys, consider using hash-sharded indexes. For details about the mechanics and performance improvements of hash-sharded indexes in CockroachDB, see the blog post Hash Sharded Indexes Unlock Linear Scaling for Sequential Workloads. As part of this, we recommend doing thorough performance testing with and without hash-sharded indexes to see which works best for your application.
To avoid read hot spots:
- Increase data distribution, which will allow for more ranges. The hot spot exists because the data being accessed is all co-located in one range.
Increase load balancing across more nodes in the same range. Most transactional reads must go to the leaseholder in CockroachDB, which means that opportunities for load balancing over replicas are minimal.
However, the following features do permit load balancing over replicas:
- Global tables.
- Follower reads (both the bounded staleness and the exact staleness kinds).
In these cases, more replicas will help, up to the number of nodes in the cluster.
For a demo on hot spot reduction, watch the following video:
See also
- If you aren't sure whether SQL query performance needs to be improved on your cluster, see Identify slow queries.
- For deployment and data location techniques to minimize network latency in multi-region clusters, see Topology Patterns.
- To read more about SQL best practices, see our SQL Performance Best Practices blog post.