This page summarizes how to configure and monitor your cluster to prevent issues commonly encountered with:
CPU
In our sizing and production guidance, 1 vCPU is considered equivalent to 1 core in the underlying hardware platform.
Issues with CPU most commonly arise when there is insufficient CPU to support the scale of the workload.
CPU planning
Provision enough CPU to support your operational and workload concurrency requirements:
Category | Recommendations |
---|---|
CPU |
|
CPU monitoring
Monitor possible signs of CPU starvation:
Parameter | Description |
---|---|
Service latency | The time between when the cluster receives a query and finishes executing the query. |
CPU usage | The CPU consumption by the CockroachDB node process. |
Workload concurrency | The number of SQL statements being executed on the cluster at the same time. |
LSM health | The health of the persistent stores. |
Node health | The operational status of the nodes. |
Service latency
Degradation in SQL response time is the most common symptom of CPU starvation. It can also be a symptom of insufficient disk I/O.
- The Service Latency: SQL Statements, 99th percentile and Service Latency: SQL Statements, 90th percentile graphs on the SQL dashboard show the time in nanoseconds between when the cluster receives a query and finishes executing the query. This time does not include returning results to the client.
If latencies are consistently high, check for:
- High CPU usage.
- An I/O bottleneck.
CPU usage
Compaction on the storage layer uses CPU to run concurrent worker threads.
The CPU Percent graph on the Hardware and Overload dashboards shows the CPU consumption by the CockroachDB process, and excludes other processes on the node.
Expected values for a healthy cluster: CPU utilized by CockroachDB should not persistently exceed 80%. Because this metric does not reflect system CPU usage, values above 80% suggest that actual CPU utilization is nearing 100%.
If CPU usage is high, check whether workload concurrency is exceeding CPU resources.
Workload concurrency
The number of concurrent active SQL statements should be proportionate to your provisioned CPU.
The SQL Statements graph on the Overview and SQL dashboards shows the 10-second average of
SELECT
,UPDATE
,INSERT
, andDELETE
statements being executed per second on the cluster or node. The latest QPS value for the cluster is also displayed with the Queries per second counter on the Metrics page.Expected values for a healthy cluster: At any time, the total number of connections actively executing SQL statements should not exceed 4 times the number of vCPUs in the cluster. You can find them in the Active Executions view in the DB Console or Cloud Console. You can find the number of open connections in the DB Console or Cloud Console. For more details on configuring connection pools, see Size connection pools.
If workload concurrency exceeds CPU resources, you will observe:
- High CPU usage.
- Degradation in SQL response time.
- Over time, an unhealthy LSM and cluster instability.
To prevent issues with workload concurrency, provision sufficient CPU and use connection pooling for the workload.
LSM health
Issues at the storage layer, including an inverted LSM and high read amplification, can be observed when compaction falls behind due to insufficient CPU or excessively high recovery and rebalance rates.
The LSM L0 Health graph on the Overload dashboard shows the health of the persistent stores, which are implemented as log-structured merge (LSM) trees. Level 0 is the highest level of the LSM tree and consists of files containing the latest data written to the Pebble storage engine. For more information about LSM levels and how LSMs work, see Log-structured Merge-trees.
Expected values for a healthy cluster: The number of L0 files should not be in the high thousands. High values indicate heavy write load that is causing accumulation of files in level 0. These files are not being compacted quickly enough to lower levels, resulting in a misshapen LSM.
Note:An unhealthy LSM can be caused by other factors, including under-provisioned storage. To correlate this symptom with CPU starvation, check for high CPU usage and excessive workload concurrency.
The Read Amplification graph on the Storage Dashboard shows the average number of disk reads per logical SQL statement, also known as the read amplification factor.
Expected values for a healthy cluster: Read amplification factor should be in the single digits. A value exceeding 50 for 1 hour strongly suggests that the LSM tree has an unhealthy shape.
The
STORAGE
logging channel indicates an unhealthy LSM with the following:- Frequent
compaction
status messages. - High-read-amplification warnings, e.g.,
sstables (read amplification = 54)
.
- Frequent
If compaction has fallen behind and caused an inverted LSM, throttle your workload concurrency to allow compaction to catch up and restore a healthy LSM shape. The total number of workload connections across all connection pools should not exceed 4 times the number of vCPUs in the cluster by a large amount. If a node is severely impacted, you can start a new node and then decommission the problematic node.
Node health
If issues at the storage layer remain unresolved, affected nodes will miss their liveness heartbeats, causing the cluster to lose nodes and eventually become unresponsive.
The Node status on the Cluster Overview page indicates whether nodes are online (
LIVE
) or have crashed (SUSPECT
orDEAD
).The
/health
endpoint of the Cluster API returns a500
error when a node is unhealthy.A Prometheus alert can notify when a node has been down for 15 minutes or more.
If nodes have shut down, this can also be caused by insufficient storage capacity.
If the cluster becomes unavailable, the DB Console and Cluster API will also become unavailable. You can continue to monitor the cluster via the Prometheus endpoint and logs.
Memory
CockroachDB is resilient to node crashes. However, frequent node restarts caused by out-of-memory (OOM) crashes can impact cluster stability and performance.
Memory planning
Provision enough memory and allocate an appropriate portion for data caching:
Provision at least 4 GiB of RAM per vCPU.
Disable Linux memory swapping. Over-allocating memory on production machines can lead to unexpected performance issues when pages have to be read back into memory.
The default value for
--cache
is 128 MiB. For production deployments, set--cache
to25%
or higher. To determine appropriate settings for--cache
and--max-sql-memory
, use the following formula:(2 * --max-sql-memory) + --cache <= 80% of system RAM
To help guard against OOM events, CockroachDB sets a soft memory limit using mechanisms in Go. Depending on your hardware and workload, you may not need to manually tune --max-sql-memory
.
Test the configuration with a reasonable workload before deploying it to production.
On startup, if CockroachDB detects that --max-sql-memory
or --cache
are set too aggressively, a warning is logged.
For additional memory recommendations, refer to Recommended Production Settings: Memory and Recommended Production Setting: Cache and SQL memory size.
Memory monitoring
Monitor memory usage and node behavior for OOM errors:
Metric or event | Description |
---|---|
Node process restarts | Nodes restarting after crashes |
SQL memory usage | The memory allocated to the SQL layer |
Database memory usage | The memory in use by CockroachDB processes |
Node process restarts
CockroachDB attempts to restart nodes after they crash. Nodes that frequently restart following an abrupt process exit may point to an underlying memory issue.
The Node status on the Cluster Overview page indicates whether nodes are online (
LIVE
) or have crashed (SUSPECT
orDEAD
).When deploying on Kubernetes, the
kubectl get pods
output contains aRESTARTS
column that tracks the number of restarts for each CockroachDB pod.The
OPS
logging channel will record anode_restart
event whenever a node rejoins the cluster after being offline.A Prometheus alert can notify when a node has restarted more than once in the last 10 minutes.
Verify OOM errors
If you observe nodes frequently restarting, confirm that the crashes are caused by OOM errors:
Monitor
dmesg
to determine if a node crashed because it ran out of memory:$ sudo dmesg | grep -iC 3 "cockroach"
The following output indicates the node crashed due to insufficient memory:
$ host kernel: Out of Memory: Killed process <process_id> (cockroach).
When deploying on Kubernetes, run
kubectl logs {pod-name}
and look for OOM errors in the log messages.
To prevent OOM crashes, provision sufficient memory. If all CockroachDB machines are provisioned and configured correctly, either run the CockroachDB process on another node with sufficient memory, or reduce the memory allocated to CockroachDB.
If you confirm that nodes are crashing due to OOM errors, also check whether SQL queries may be responsible.
SQL memory usage
An untuned SQL query can consume significant resources and impact the performance of other workloads.
The SQL Memory graph on the SQL dashboard shows the current amount of memory in KiB allocated to the SQL layer.
Expected values for a healthy cluster: This value should not exceed the
--max-sql-memory
size. A healthy threshold is 75% of allocated--max-sql-memory
.The "active query dump", enabled by default with the
diagnostics.active_query_dumps.enabled
cluster setting, is a record of anonymized active queries that is written to disk when a node is detected to be under memory pressure.You can use the active query dump to correlate specific queries to OOM crashes. Active query dumps have the filename
activequeryprof.{date-and-time}.csv
and are found in theheap_profiler
directory in the configured logging directory. They are also included when runningcockroach debug zip
.A
SHOW STATEMENTS
statement can identify long-running queries on the cluster that may be consuming excessive memory.A
memory budget exceeded
error in the logs indicates that--max-sql-memory
, the memory allocated to the SQL layer, was exceeded by the operation referenced in the error. For guidance on resolving this issue, see Common Errors.
If you find queries that are consuming too much memory, cancel the queries to free up memory usage. For information on optimizing query performance, see SQL Performance Best Practices.
We strongly recommend adding size limits to all indexed columns, which includes columns in primary keys.
Values exceeding 1 MiB can lead to storage layer write amplification and cause significant performance degradation or even crashes due to OOMs (out of memory errors).
To add a size limit using CREATE TABLE
:
CREATE TABLE name (first STRING(100), last STRING(100));
To add a size limit using ALTER TABLE ... ALTER COLUMN
:
SET enable_experimental_alter_column_type_general = true;
ALTER TABLE name ALTER first TYPE STRING(99);
Database memory usage
CockroachDB memory usage includes both accounted memory, such as the amount allocated to --cache
and --max-sql-memory
; and unaccounted memory, such as uncollected Go garbage and process overhead.
The Memory Usage graph on the Runtime dashboard shows the total memory in use by CockroachDB processes. The RSS (resident set size) metric represents actual CockroachDB memory usage from the OS/Linux/pod point of view. The Go and CGo metrics represent memory allocation and total usage from a CockroachDB point of view.
Expected values for a healthy cluster: Go Allocated will depend on workload but should not exceed
--max-sql-memory
by more than 100%. CGo Allocated should not exceed the--cache
size and CGo Total should not exceed the--cache
size by more than 15%.
For more context on acceptable memory usage, see Suspected memory leak.
Storage and disk I/O
The cluster will underperform if storage is not provisioned or configured correctly. This can lead to further issues such as disk stalls and node shutdown.
Storage and disk planning
Provision enough storage capacity for CockroachDB data, and configure your volumes to maximize disk I/O:
Category | Recommendations |
---|---|
Storage |
|
Disk I/O |
|
Storage and disk monitoring
Monitor storage capacity and disk performance:
Metric or event | Description |
---|---|
Storage capacity | The available and used disk capacity in the CockroachDB store. |
Disk IOPS | The I/O requests per second. |
Node heartbeat latency | The time between node liveness heartbeats. |
Command commit latency | The speed at which Raft commands are being committed by nodes in the cluster. |
Storage capacity
CockroachDB requires disk space in order to accept writes and report node liveness. When a node runs out of disk space, it shuts down and cannot be restarted until space is freed up.
The Capacity graph on the Overview and Storage dashboards shows the available and used disk capacity in the CockroachDB store.
Expected values for a healthy cluster: Used capacity should not persistently exceed 80% of the total capacity.
A Prometheus alert can notify when a node has less than 15% of free space remaining.
Ensure that you provision sufficient storage. If storage is correctly provisioned and is running low, CockroachDB automatically creates an emergency ballast file that can free up space. For details, see Disks filling up.
For instructions on how to free up disk space as quickly as possible after deleting data, see How can I free up disk space quickly?
Disk IOPS
Insufficient disk I/O can cause poor SQL performance and potentially disk stalls.
The Disk Ops In Progress graph on the Hardware dashboard shows the number of disk reads and writes in queue.
Expected values for a healthy cluster: This value should be 0 or single-digit values for short periods of time. If the values persist in double digits, you may have an I/O bottleneck.
The Linux tool
iostat
(part ofsysstat
) can be used to monitor IOPS. In the device status output,avgqu-sz
corresponds to the Disk Ops In Progress metric. If service times persist in double digits on any node, this means that your storage device is saturated and is likely under-provisioned or misconfigured.
Ensure that you properly configure storage to prevent I/O bottlenecks. Afterward, if service times consistently exceed 1-5 ms, you can add more devices or expand the cluster to reduce the disk latency.
With insufficient disk I/O, you may also see:
- Degradation in SQL response time.
- An unhealthy LSM.
Node heartbeat latency
Because each node needs to update a liveness record on disk, maxing out disk bandwidth can cause liveness heartbeats to be missed.
The Node Heartbeat Latency: 99th percentile and Node Heartbeat Latency: 90th percentile graphs on the Distributed Dashboard show the time elapsed between node liveness heartbeats.
Expected values for a healthy cluster: Less than 100ms in addition to the network latency between nodes in the cluster.
Command commit latency
The Command Commit Latency: 50th percentile and Command Commit Latency: 99th percentile graphs on the Storage dashboard show how quickly Raft commands are being committed by nodes in the cluster. This is a good signal of I/O load.
Expected values for a healthy cluster: On SSDs (strongly recommended), this should be between 1 and 100 milliseconds. On HDDs, this should be no more than 1 second.