SQL Performance Best Practices

On this page Carat arrow pointing down
Warning:
As of April 12, 2019, CockroachDB v1.1 is no longer supported. For more details, refer to the Release Support Policy.

This page provides best practices for optimizing SQL performance in CockroachDB.

Multi-Row DML Best Practices

Use Multi-Row DML instead of Multiple Single-Row DMLs

For INSERT, UPSERT, and DELETE statements, a single multi-row DML is faster than multiple single-row DMLs. Whenever possible, use multi-row DML instead of multiple single-row DMLs.

For more information, see:

Use TRUNCATE instead of DELETE to Delete All Rows in a Table

The TRUNCATE statement removes all rows from a table by dropping the table and recreating a new table with the same name. This performs better than using DELETE, which performs multiple transactions to delete all rows.

Bulk Insert Best Practices

Use Multi-Row INSERT Statements for Bulk Inserts into Existing Tables

To bulk-insert data into an existing table, batch multiple rows in one multi-row INSERT statement and do not include the INSERT statements within a transaction. Experimentally determine the optimal batch size for your application by monitoring the performance for different batch sizes (10 rows, 100 rows, 1000 rows). For more information, see Insert Multiple Rows.

Use IMPORT instead of INSERT for Bulk Inserts into New Tables

To bulk-insert data into a brand new table, the (experimental) IMPORT statement performs better than INSERT.

Execute Statements in Parallel

CockroachDB supports parallel execution of independent INSERT, UPDATE, UPSERT, and DELETE statements within a single transaction. Executing statements in parallel helps reduce aggregate latency and improve performance. To execute statements in parallel, append the RETURNING NOTHING clause to the statements in a transaction. For more information, see Parallel Statement Execution.

Assign Column Families

A column family is a group of columns in a table that is stored as a single key-value pair in the underlying key-value store.

When a table is created, all columns are stored as a single column family. This default approach ensures efficient key-value storage and performance in most cases. However, when frequently updated columns are grouped with seldom updated columns, the seldom updated columns are nonetheless rewritten on every update. Especially when the seldom updated columns are large, it's therefore more performant to assign them to a distinct column family.

Interleave Tables

Interleaving tables improves query performance by optimizing the key-value structure of closely related tables, attempting to keep data on the same key-value range if it's likely to be read and written together. This is particularly helpful if the tables are frequently joined on the columns that consist of the interleaving relationship.

Unique ID Best Practices

The common approach for generating unique IDs is one of the following:

  • Monotonically increase INT IDs by using transactions with roundtrip SELECTs
  • Use SERIAL variables to generate random unique IDs

The first approach does not take advantage of the parallelization possible in a distributed database like CockroachDB. The bottleneck with the second approach is that IDs generated temporally near each other have similar values and are located physically near each other in a table. This can cause a hotspot for reads and writes in a table.

The best practice in CockroachDB is to generate unique IDs using the UUID type, which generates random unique IDs in parallel, thus improving performance.

Use UUID to Generate Unique IDs

To auto-generate unique row IDs, use the UUID column with the gen_random_uuid() function as the default value:

icon/buttons/copy
> CREATE TABLE t1 (id UUID PRIMARY KEY DEFAULT gen_random_uuid(), name STRING);
icon/buttons/copy
> INSERT INTO t1 (name) VALUES ('a'), ('b'), ('c');
icon/buttons/copy
> SELECT * FROM t1;
+--------------------------------------+------+
|                  id                  | name |
+--------------------------------------+------+
| 60853a85-681d-4620-9677-946bbfdc8fbc | c    |
| 77c9bc2e-76a5-4ebc-80c3-7ad3159466a1 | b    |
| bd3a56e1-c75e-476c-b221-0da9d74d66eb | a    |
+--------------------------------------+------+
(3 rows)

Alternatively, you can use the BYTES column with the uuid_v4() function as the default value instead:

icon/buttons/copy
> CREATE TABLE t2 (id BYTES PRIMARY KEY DEFAULT uuid_v4(), name STRING);
icon/buttons/copy
> INSERT INTO t2 (name) VALUES ('a'), ('b'), ('c');
icon/buttons/copy
> SELECT * FROM t2;
+---------------------------------------------------+------+
|                        id                         | name |
+---------------------------------------------------+------+
| "\x9b\x10\xdc\x11\x9a\x9cGB\xbd\x8d\t\x8c\xf6@vP" | a    |
| "\xd9s\xd7\x13\n_L*\xb0\x87c\xb6d\xe1\xd8@"       | c    |
| "\uac74\x1dd@B\x97\xac\x04N&\x9eBg\x86"           | b    |
+---------------------------------------------------+------+
(3 rows)

In either case, generated IDs will be 128-bit, large enough for there to be virtually no chance of generating non-unique values. Also, once the table grows beyond a single key-value range (more than 64MB by default), new IDs will be scattered across all of the table's ranges and, therefore, likely across different nodes. This means that multiple nodes will share in the load.

If it's important for generated IDs to be stored in the same key-value range, you can use the SERIAL data type, which is an alias for INT with the unique_rowid() function as the default value:

icon/buttons/copy
> CREATE TABLE t3 (id SERIAL PRIMARY KEY, name STRING);
icon/buttons/copy
> INSERT INTO t3 (name) VALUES ('a'), ('b'), ('c');
icon/buttons/copy
> SELECT * FROM t3;
+--------------------+------+
|         id         | name |
+--------------------+------+
| 293807573840855041 | a    |
| 293807573840887809 | b    |
| 293807573840920577 | c    |
+--------------------+------+
(3 rows)

On insert, the unique_rowid() function generates a default value from the timestamp and ID of the node executing the insert. Such time-ordered values are likely to be globally unique except in cases where a very large number of IDs (100,000+) are generated per node per second.

Use INSERT with the RETURNING Clause to Generate Unique IDs

If something prevents you from using UUID to generate unique IDs, you might resort to using INSERTs with SELECTs to return IDs. Instead, use the RETURNING clause with the INSERT statement for improved performance.

Generate Monotonically-Increasing Unique IDs

Suppose the table schema is as follows:

> CREATE TABLE X (
    ID1 INT,
    ID2 INT,
    ID3 INT DEFAULT 1,
    PRIMARY KEY (ID1,ID2)
    );

The common approach would be to use a transaction with an INSERT followed by a SELECT:

> BEGIN;

> INSERT INTO X VALUES (1,1,1)
    ON CONFLICT (ID1,ID2)
    DO UPDATE SET ID3=X.ID3+1;

> SELECT * FROM X WHERE ID1=1 AND ID2=1;

> COMMIT;

However, the performance best practice is to use a RETURNING clause with INSERT instead of the transaction:

> INSERT INTO X VALUES (1,1,1),(2,2,2),(3,3,3)
    ON CONFLICT (ID1,ID2)
    DO UPDATE SET ID3=X.ID3 + 1
    RETURNING ID1,ID2,ID3;

Generate Random Unique IDs

Suppose the table schema is as follows:

> CREATE TABLE X (
    ID1 INT,
    ID2 INT,
    ID3 SERIAL,
    PRIMARY KEY (ID1,ID2)
    );

The common approach to generate random Unique IDs is a transaction using the SELECT statement:

> BEGIN;

> INSERT INTO X VALUES (1,1);

> SELECT * FROM X WHERE ID1=1 AND ID2=1;

> COMMIT;

However, the performance best practice is to use a RETURNING clause with INSERT instead of the transaction:

> INSERT INTO X VALUES (1,1),(2,2),(3,3)
    RETURNING ID1,ID2,ID3;

Indexes Best Practices

Use Secondary Indexes

You can use secondary indexes to improve the performance of queries using columns not in a table's primary key. You can create them:

  • At the same time as the table with the INDEX clause of CREATE TABLE. In addition to explicitly defined indexes, CockroachDB automatically creates secondary indexes for columns with the Unique constraint.
  • For existing tables with CREATE INDEX.
  • By applying the Unique constraint to columns with ALTER TABLE, which automatically creates an index of the constrained columns.

To create the most useful secondary indexes, check out our best practices.

Use Indexes for Faster JOINs

CockroachDB supports both merge joins and hash joins. CockroachDB uses merge joins whenever possible because they are more performant than hash joins computationally and in terms of memory. However, merge joins are possible only when the tables being joined are indexed on the relevant columns; when this condition is not met, CockroachDB resorts to the slower hash joins.

Why are merge joins faster than hash joins?

Merge joins are computationally less expensive and do not require additional memory. They are performed on the indexed columns of two tables as follows:

  • CockroachDB takes one row from each table and compares them.
  • If the rows are equal, CockroachDB returns the rows.
  • If the rows are not equal, CockroachDB discards the lower-value row and repeats the process with the next row until all rows are processed.

In contrast, hash joins are computationally expensive and require additional memory. They are performed on two tables as follows:

  • CockroachDB creates an in-memory hash table on the smaller table.
  • CockroachDB then uses the hash table and scans the larger table to find matching rows from the smaller table.

Why create indexes to perform merge joins?

A merge join requires both tables to be indexed on the merge columns. In case this condition is not met, CockroachDB resorts to the slower hash joins. So while using JOIN on two tables, first create indexes on the tables and then use the JOIN operator.

Also note that merge JOINs can be used only with distributed query processing.

Drop Unused Indexes

Though indexes improve read performance, they incur an overhead for every write. In some cases, like the use cases discussed above, the tradeoff is worth it. However, if an index is unused, it slows down DML operations. Therefore, drop unused indexes whenever possible.

Table Scans Best Practices

Avoid SELECT * for Large Tables

For large tables, avoid table scans (that is, reading the entire table data) whenever possible. Instead, define the required fields in the SELECT statement.

Example

Suppose the table schema is as follows:

> CREATE TABLE accounts (
    id INT, 
    customer STRING, 
    address STRING,
    balance INT
    nominee STRING
    );

Now if we want to find the account balances of all customers, an inefficient table scan would be:

> SELECT * FROM ACCOUNTS;

This query retrieves all data stored in the table. A more efficient query would be:

 > SELECT CUSTOMER, BALANCE FROM ACCOUNTS;

This query returns the account balances of the customers.

Avoid SELECT DISTINCT for Large Tables

SELECT DISTINCT allows you to obtain unique entries from a query by removing duplicate entries. However, SELECT DISTINCT is computationally expensive. As a performance best practice, use SELECT with the WHERE clause instead.


Yes No
On this page

Yes No