If you're having trouble starting or scaling your cluster, this page will help you troubleshoot the issue.
To use this guide, it's important to understand some of CockroachDB's terminology:
- A Cluster acts as a single logical database, but is actually made up of many cooperating nodes.
- Nodes are single instances of the
cockroach
binary running on a machine. It's possible (though atypical) to have multiple nodes running on a single machine.
Cannot run a single-node CockroachDB cluster
Try running:
$ cockroach start-single-node --insecure --logtostderr
If the process exits prematurely, check for the following:
An existing storage directory
When starting a node, the directory you choose to store the data in also contains metadata identifying the cluster the data came from. This causes conflicts when you've already started a node on the server, have quit cockroach
, and then tried to start another cluster using the same directory. Because the existing directory's cluster ID doesn't match the new cluster ID, the node cannot start.
Solution: Disassociate the node from the existing directory where you've stored CockroachDB data. For example, you can do either of the following:
Choose a different directory to store the CockroachDB data:
$ cockroach start-single-node --store=<new directory> --insecure
Remove the existing directory and start the node again:
$ rm -r cockroach-data/
$ cockroach start-single-node --insecure --logtostderr
Toolchain incompatibility
The components of the toolchain might have some incompatibilities that need to be resolved. For example, a few months ago, there was an incompatibility between Xcode 8.3 and Go 1.8 that caused any Go binaries created with that toolchain combination to crash immediately.
Incompatible CPU
If the cockroach
process had exit status 132 (SIGILL)
, it attempted to use an instruction that is not supported by your CPU. Non-release builds of CockroachDB may not be able to run on older hardware platforms than the one used to build them. Release builds should run on any x86-64 CPU.
Default ports already in use
Other services may be running on port 26257 or 8080 (CockroachDB's default --listen-addr
port and --http-addr
port respectively). You can either stop those services or start your node with different ports, specified in the --listen-addr
and --http-addr
flags.
If you change the port, you will need to include the --port=<specified port>
flag in each subsequent cockroach command or change the COCKROACH_PORT
environment variable.
Single-node networking issues
Networking issues might prevent the node from communicating with itself on its hostname. You can control the hostname CockroachDB uses with the --listen-addr
flag.
If you change the host, you will need to include --host=<specified host>
in each subsequent cockroach command.
CockroachDB process hangs when trying to start a node in the background
See Why is my process hanging when I try to start it in the background?
Cannot run SQL statements using built-in SQL client
If the CockroachDB node appeared to start successfully, in a separate terminal run:
$ cockroach sql --insecure -e "show databases"
You should see a list of the built-in databases:
database_name
+---------------+
defaultdb
postgres
system
(3 rows)
If you’re not seeing the output above, check for the following:
-
connection refused
error, which indicates you have not included some flag that you used to start the node. We have additional troubleshooting steps for this error here. - The node crashed. To ascertain if the node crashed, run
ps | grep cockroach
to look for thecockroach
process. If you cannot locate thecockroach
process (i.e., it crashed), file an issue, including the logs from your node and any errors you received.
Cannot run a multi-node CockroachDB cluster on the same machine
Running multiple nodes on a single host is useful for testing out CockroachDB, but it's not recommended for production deployments. To run a physically distributed cluster in production, see Manual Deployment or Orchestrated Deployment. Also be sure to review the Production Checklist.
If you are trying to run all nodes on the same machine, you might get the following errors:
Store directory already exists
ERROR: could not cleanup temporary directories from record file: could not lock temporary directory /Users/amruta/go/src/github.com/cockroachdb/cockroach/cockroach-data/cockroach-temp301343769, may still be in use: IO error: While lock file: /Users/amruta/go/src/github.com/cockroachdb/cockroach/cockroach-data/cockroach-temp301343769/TEMP_DIR.LOCK: Resource temporarily unavailable
Explanation: When starting a new node on the same machine, the directory you choose to store the data in also contains metadata identifying the cluster the data came from. This causes conflicts when you've already started a node on the server and then tried to start another cluster using the same directory.
Solution: Choose a different directory to store the CockroachDB data.
Port already in use
ERROR: cockroach server exited with error: consider changing the port via --listen-addr: listen tcp 127.0.0.1:26257: bind: address already in use
Solution: Change the --port
, --http-port
flags for each new node that you want to run on the same machine.
Cannot join a node to an existing CockroachDB cluster
Store directory already exists
When joining a node to a cluster, you might receive one of the following errors:
no resolvers found; use --join to specify a connected node
node belongs to cluster {"cluster hash"} but is attempting to connect to a gossip network for cluster {"another cluster hash"}
Explanation: When starting a node, the directory you choose to store the data in also contains metadata identifying the cluster the data came from. This causes conflicts when you've already started a node on the server, have quit the cockroach
process, and then tried to join another cluster. Because the existing directory's cluster ID doesn't match the new cluster ID, the node cannot join it.
Solution: Disassociate the node from the existing directory where you've stored CockroachDB data. For example, you can do either of the following:
Choose a different directory to store the CockroachDB data:
$ cockroach start --store=<new directory> --join=<cluster host> <other flags>
Remove the existing directory and start a node joining the cluster again:
$ rm -r cockroach-data/
$ cockroach start --join=<cluster host>:26257 <other flags>
Incorrect --join
address
If you try to add another node to the cluster, but the --join
address is not pointing at any of the existing nodes, then the process will never complete, and you'll see a continuous stream of warnings like this:
W180817 17:01:56.506968 886 vendor/google.golang.org/grpc/clientconn.go:942 Failed to dial localhost:20000: grpc: the connection is closing; please retry.
W180817 17:01:56.510430 914 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {localhost:20000 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:20000: connect: connection refused". Reconnecting…
Explanation: These warnings tell you that the node cannot establish a connection with the address specified in the --join
flag. Without a connection to the cluster, the node cannot join.
Solution: To successfully join the node to the cluster, start the node again, but this time include a correct --join
address.
Client connection issues
If a client cannot connect to the cluster, check basic network connectivity (ping
), port connectivity (telnet
), and certificate validity.
Networking issues
Most networking-related issues are caused by one of two issues:
- Firewall rules, which require your network administrator to investigate
- Inaccessible hostnames on your nodes, which can be controlled with the
--listen-addr
and--advertise-addr
flags oncockroach start
Solution:
To check your networking setup:
Use
ping
. Every machine you are using as a CockroachDB node should be able to ping every other machine, using the hostnames or IP addresses used in the--join
flags (and the--advertise-host
flag if you are using it).If the machines are all pingable, check if you can connect to the appropriate ports. With your CockroachDB nodes still running, log in to each node and use
telnet
ornc
to verify machine to machine connectivity on the desired port. For instance, if you are running CockroachDB on the default port of 26257, run either:-
telnet <other-machine> 26257
-
nc <other-machine> 26257
-
Both telnet
and nc
will exit immediately if a connection cannot be established. If you are running in a firewalled environment, the firewall might be blocking traffic to the desired ports even though it is letting ping packets through.
To efficiently troubleshoot the issue, it's important to understand where and why it's occurring. We recommend checking the following network-related issues:
- By default, CockroachDB advertises itself to other nodes using its hostname. If your environment doesn't support DNS or the hostname is not resolvable, your nodes cannot connect to one another. In these cases, you can:
- Change the hostname each node uses to advertises itself with
--advertise-addr
- Set
--listen-addr=<node's IP address>
if the IP is a valid interface on the machine
- Change the hostname each node uses to advertises itself with
- Every node in the cluster should be able to ping each other node on the hostnames or IP addresses you use in the
--join
,--listen-addr
, or--advertise-addr
flags. - Every node should be able to connect to other nodes on the port you're using for CockroachDB (26257 by default) through
telnet
ornc
:-
telnet <other node host> 26257
-
nc <other node host> 26257
-
Again, firewalls or hostname issues can cause any of these steps to fail.
Network partition
If the DB Console lists any dead nodes on the Cluster Overview page, then you might have a network partition.
Explanation: A network partition prevents nodes from communicating with each other in one or both directions. This can be due to a configuration problem with the network, such as when allowlisted IP addresses or hostnames change after a node is torn down and rebuilt. In a symmetric partition, node communication is broken in both directions. In an asymmetric partition, node communication works in one direction but not the other.
The effect of a network partition depends on which nodes are partitioned, where the ranges are located, and to a large extent, whether localities are defined. If localities are not defined, a partition that cuts off at least (n-1)/2 nodes will cause data unavailability.
Solution:
To identify a network partition:
- Access the Network Latency page of the DB Console.
- In the Latencies table, check for nodes with no connections. This indicates that a node cannot communicate with another node, and might indicate a network partition.
Authentication issues
Missing certificate
If you try to add a node to a secure cluster without providing the node's security certificate, you will get the following error:
problem with CA certificate: not found
*
* ERROR: cannot load certificates.
* Check your certificate settings, set --certs-dir, or use --insecure for insecure clusters.
*
* problem with CA certificate: not found
*
Failed running "start"
Explanation: The error tells you that because the cluster is secure, it requires the new node to provide its security certificate in order to join.
Solution: To successfully join the node to the cluster, start the node again, but this time include the --certs-dir
flag
Certification expiration
If you’re running a secure cluster, be sure to monitor your certificate expiration. If one of the inter-node certificates expires, nodes will no longer be able to communicate which can look like a network partition.
To check the certificate expiration date:
- Access the DB Console.
- Click the gear icon on the left-hand navigation bar to access the Advanced Debugging page.
- Scroll down to the Even More Advanced Debugging section. Click All Nodes. The Node Diagnostics page appears. Click the certificates for each node and check the expiration date for each certificate in the Valid Until field.
Client password not set
While connecting to a secure cluster as a user, CockroachDB first checks if the client certificate exists in the cert
directory. If the client certificate doesn’t exist, it prompts for a password. If password is not set and you press Enter, the connection attempt fails, and the following error is printed to stderr
:
Error: pq: invalid password
Failed running "sql"
Solution: To successfully connect to the cluster, you must first either generate a client certificate or create a password for the user.
Cannot create new connections to cluster for up to 40 seconds after a node dies
When a node dies abruptly and/or loses its network connection to the cluster, the following behavior can occur:
- For a period of up to 40 seconds, clients trying to connect with username and password authentication cannot create new connections to any of the remaining nodes in the cluster.
- Applications start timing out when trying to connect to the cluster during this window.
The reason this happens is as follows:
- Username and password information is stored in a system range.
- Since all system ranges are located near the beginning of the keyspace, the system range containing the username/password info can sometimes be colocated with another system range that is used to determine node liveness.
- If the username/password info and the node liveness record are stored together as described above, it can take extra time for the lease on this range to be transferred to another node. Normally, lease transfers take about 10 seconds, but in this case it may require multiple rounds of consensus to determine that the node in question is actually dead (the node liveness record check may be retried several times before failing).
For more information about how lease transfers work when a node dies, see How leases are transferred from a dead node.
The solution is to add connection retry logic to your application.
Clock sync issues
Node clocks are not properly synchronized
See the following FAQs:
- What happens when node clocks are not properly synchronized
- How can I tell how well node clocks are synchronized
Capacity planning issues
Following are some of the possible issues you might have while planning capacity for your cluster:
- Running CPU at close to 100% utilization with high run queue will result in poor performance.
- Running RAM at close to 100% utilization triggers Linux OOM and/or swapping that will result in poor performance or stability issues.
- Running storage at 100% capacity causes writes to fail, which in turn can cause various processes to stop.
- Running storage at 100% utilization read/write will causes poor service time.
- Running network at 100% utilization causes response between databases and client to be poor.
Solution: Access the DB Console and navigate to Metrics > Hardware dashboard to monitor the following metrics:
First, check adequate capacity was available for the incident for the following components.
Type | Time Series | What to look for |
---|---|---|
RAM capacity | Memory Usage | Any non-zero value |
CPU capacity | CPU Percent | Consistent non-zero values |
Network capacity | Network Bytes Received Network Bytes Sent |
Any non-zero value |
Check Near Out of Capacity Conditions:
Type | Time Series | What to look for |
---|---|---|
RAM capacity | Memory Usage | Consistently more than 80% |
CPU capacity | CPU Percent | Consistently less than 20% in idle (i.e.:80% busy) |
Network capacity | Network Bytes Received Network Bytes Sent |
Consistently more than 50% capacity for both |
Storage issues
Disks filling up
Like any database system, if you run out of disk space the system will no longer be able to accept writes. Additionally, a CockroachDB node needs a small amount of disk space (a few GBs to be safe) to perform basic maintenance functionality. For more information about this issue, see:
- Why is memory usage increasing despite lack of traffic?
- Why is disk usage increasing despite lack of writes?
- Can I reduce or disable the storage of timeseries data?
Memory issues
Suspected memory leak
A CockroachDB node will grow to consume all of the memory allocated for its cache
. The default size for the cache is ¼ of physical memory which can be substantial depending on your machine configuration. This growth will occur even if your cluster is otherwise idle due to the internal metrics that a CockroachDB cluster tracks. See the --cache
flag in cockroach start
.
CockroachDB memory usage has 3 components:
- Go allocated memory: Memory allocated by the Go runtime to support query processing and various caches maintained in Go by CockroachDB. These caches are generally small in comparison to the RocksDB cache size. If Go allocated memory is larger than a few hundred megabytes, something concerning is going on.
- CGo allocated memory: Memory allocated by the C/C++ libraries linked into CockroachDB and primarily concerns the block caches for the storage engine (this is true for both Pebble (the default engine) and RocksDB). This is the “cache” mentioned in the note above. The size of CGo allocated memory is usually very close to the configured cache size.
- Overhead: The process resident set size minus Go/CGo allocated memory.
If Go allocated memory is larger than a few hundred megabytes, you might have encountered a memory leak. Go comes with a built-in heap profiler which is already enabled on your CockroachDB process. See this excellent blog post on profiling Go programs.
Solution: To determine Go/CGo allocated memory:
Navigate to Metrics > Runtime dashboard, and check the Memory Usage graph.
On hovering over the graph, the values for the following metrics are displayed:
Metric | Description |
---|---|
RSS | Total memory in use by CockroachDB. |
Go Allocated | Memory allocated by the Go layer. |
Go Total | Total memory managed by the Go layer. |
CGo Allocated | Memory allocated by the C layer. |
CGo Total | Total memory managed by the C layer. |
- If CGo allocated memory is larger than the configured
cache
size, file an issue. - If the resident set size (RSS) minus Go/CGo total memory is larger than 100 megabytes, file an issue.
Node crashes because of insufficient memory
Often when a node exits without a trace or logging any form of error message, we’ve found that it is the operating system stopping it suddenly due to low memory. So if you're seeing node crashes where the logs just end abruptly, it's probably because the node is running out of memory. On most Unix systems, you can verify if the cockroach
process was stopped because the node ran out of memory by running:
$ sudo dmesg | grep -iC 3 "cockroach"
If the command returns the following message, then you know the node crashed due to insufficient memory:
$ host kernel: Out of Memory: Killed process <process_id> (cockroach).
To rectify the issue, you can either run the cockroachdb process on another node with sufficient memory, or reduce the cockroachdb memory usage.
Decommissioning issues
Decommissioning process hangs indefinitely
Explanation: Before decommissioning a node, you need to make sure other nodes are available to take over the range replicas from the node. If no other nodes are available, the decommission process will hang indefinitely.
Solution: Confirm that there are enough nodes with sufficient storage space to take over the replicas from the node you want to remove.
Decommissioned nodes displayed in UI forever
By design, decommissioned nodes are displayed in the DB Console forever. We retain the list of decommissioned nodes for the following reasons:
- Decommissioning is not entirely free, so showing those decommissioned nodes in the UI reminds you of the baggage your cluster will have to carry around forever.
- It also explains to future administrations why your node IDs have gaps (e.g., why the nodes are numbered n1, n2, and n8).
You can follow the discussion here: https://github.com/cockroachdb/cockroach/issues/24636
Replication issues
DB Console shows under-replicated/unavailable ranges
When a CockroachDB node dies (or is partitioned) the under-replicated range count will briefly spike while the system recovers.
Explanation: CockroachDB uses consensus replication and requires a quorum of the replicas to be available in order to allow both writes and reads to the range. The number of failures that can be tolerated is equal to (Replication factor - 1)/2. Thus CockroachDB requires (n-1)/2 nodes to achieve quorum. For example, with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on.
Under-replicated Ranges: When a cluster is first initialized, the few default starting ranges will only have a single replica, but as soon as other nodes are available, they will replicate to them until they've reached their desired replication factor. If a range does not have enough replicas, the range is said to be "under-replicated".
Unavailable Ranges: If a majority of a range's replicas are on nodes that are unavailable, then the entire range is unavailable and will be unable to process queries.
Solution:
To identify under-replicated/unavailable ranges:
On the Cluster Overview page, check the Replication Status. If the Under-replicated ranges or Unavailable ranges count is non-zero, then you have under-replicated or unavailable ranges in your cluster.
Check for a network partition: Click the gear icon on the left-hand navigation bar to access the Advanced Debugging page. On the Advanced Debugging page, click Network Latency. In the Latencies table, check if any cells are marked as “X”. If yes, it indicates that the nodes cannot communicate with those nodes, and might indicate a network partition. If there's no partition, and there's still no upreplication after 5 mins, then file an issue.
Add nodes to the cluster:
On the DB Console’s Cluster Overview page, check if any nodes are down. If the number of nodes down is less than (n-1)/2, then that is most probably the cause of the under-replicated/unavailable ranges. Add nodes to the cluster such that the cluster has the required number of nodes to replicate ranges properly.
If you still see under-replicated/unavailable ranges on the Cluster Overview page, investigate further:
- Access the DB Console
- Click the gear icon on the left-hand navigation bar to access the Advanced Debugging page.
- Click Problem Ranges.
- In the Connections table, identify the node with the under-replicated/unavailable ranges and click the node ID in the Node column.
- To view the Range Report for a range, click on the range number in the Under-replicated (or slow) table or Unavailable table.
- On the Range Report page, scroll down to the Simulated Allocator Output section. The table contains an error message which explains the reason for the under-replicated range. Follow the guidance in the message to resolve the issue. If you need help understanding the error or the guidance, file an issue. Please be sure to include the full range report and error message when you submit the issue.
Node liveness issues
"Node liveness" refers to whether a node in your cluster has been determined to be "dead" or "alive" by the rest of the cluster. This is achieved using checks that ensure that each node connected to the cluster is updating its liveness record. This information is shared with the rest of the cluster using an internal gossip protocol.
Common reasons for node liveness issues include:
- Heavy I/O load on the node. Because each node needs to update a liveness record on disk, maxing out disk bandwidth can cause liveness heartbeats to be missed. See also: Capacity planning issues.
- Outright I/O failure due to a disk stall. This will cause node liveness issues for the same reasons as listed above.
- Any Networking issues with the node.
The DB Console provides several ways to check for node liveness issues in your cluster:
For more information about how node liveness works, see the architecture documentation on the replication layer.
Check node heartbeat latency
To check node heartbeat latency:
In the DB Console, select the Metrics tab from the left-hand side of the page.
From the metrics page, select Dashboard: Distributed from the dropdown at the top of the page.
Scroll down the metrics page to find the Node Heartbeat Latency: 99th percentile and Node Heartbeat Latency: 90th percentile graphs.
Expected values for a healthy cluster: Less than 100ms in addition to the network latency between nodes in the cluster.
Check node liveness record last update
To see when a node last updated its liveness record:
Go to the Node Diagnostics page of the DB Console, which lives at:
https://yourcluster.yourdomain/#/reports/nodes
On the Node Diagnostics page, you will see a table listing information about the nodes in your cluster. To see when a node last updated its liveness record, check the Updated at field at the bottom of that node's column.
Expected values for a healthy cluster: When you load this page, the Updated at field should be within 4.5 seconds of the current time. If it's higher than that, you will see errors in the logs.
Check command commit latency
A good signal of I/O load is the Command Commit Latency in the Storage section of the dashboards. This dashboard measures how quickly Raft commands are being committed by nodes in the cluster.
To view command commit latency:
In the DB Console, select the Metrics tab from the left-hand side of the page.
From the Metrics page, select Dashboard: Storage from the dropdown at the top of the page.
Scroll down the metrics page to find the Command Commit Latency: 90th percentile and Command Commit Latency: 99th percentile graphs.
Expected values for a healthy cluster: On SSDs, this should be between 1 and 100 milliseconds. On HDDs, this should be no more than 1 second. Note that we strongly recommend running CockroachDB on SSDs.
Impact of node failure is greater than 10 seconds
When the cluster needs to access a range on a leaseholder node that is dead, that range's lease must be transferred to a healthy node. In theory, this process should take no more than 9 seconds for liveness expiration plus the cost of several network roundtrips.
In production, lease transfer upon node failure can take longer than expected. In v20.2, this is observed in the following scenarios:
The leaseholder node for the liveness range fails. The liveness range is a system range that stores the liveness record for each node on the cluster. If a node fails and is also the leaseholder for the liveness range, operations cannot proceed until the liveness range is transferred to a new leaseholder and the liveness record is made available to other nodes. This can cause momentary cluster unavailability.
The leaseholder node for
system.users
fails. The cluster accesses thesystem.users
table when authenticating a user. A customer attempting to log in to the cluster will encounter a delay while waiting for thesystem.users
table to find a new leaseholder. In v21.2.0 and later, there is no login delay because the authentication data is cached.A node comes in and out of liveness, due to network failures or overload. A "flapping" node can repeatedly lose and acquire leases in between its intermittent heartbeats, causing wide-ranging unavailability. In v21.1.6 and later, a node must successfully heartbeat for 30 seconds after failing a heartbeat before it is eligible to acquire leases.
Network or DNS issues cause connection issues between nodes. If there is no live server for the IP address or DNS lookup, connection attempts to a node will not return an immediate error, but will hang until timing out. This can cause unavailability and prevent a speedy movement of leases and recovery. In v20.2.18, v21.1.12, and later, CockroachDB avoids contacting unresponsive nodes or DNS during certain performance-critical operations, and the connection issue should generally resolve in 10-30 seconds. However, an attempt to contact an unresponsive node could still occur in other scenarios that are not yet addressed.
A node's disk stalls. A disk stall on a node can cause write operations to stall indefinitely, also causes the node's heartbeats to fail since the storage engine cannot write to disk as part of the heartbeat, and may cause read requests to fail if they are waiting for a conflicting write to complete. Lease acquisition from this node can stall indefinitely until the node is shut down or recovered. Pebble detects most stalls and will terminate the
cockroach
process after 60 seconds, but there are gaps in its detection. In v21.2.13, v22.1.2, and later, each lease acquisition attempt on an unresponsive node times out after 6 seconds. However, CockroachDB can still appear to stall as these timeouts are occurring.Otherwise unresponsive nodes. Internal deadlock due to faulty code, resource exhaustion, OS/hardware issues, and other arbitrary failures can make a node unresponsive. This can cause leases to become stuck in certain cases, such as when a response from the previous leaseholder is needed in order to move the lease.
Solution: If you are experiencing intermittent network or connectivity issues, first shut down the affected nodes temporarily so that nodes phasing in and out do not cause disruption.
If a node has become unresponsive without returning an error, shut down the node so that network requests immediately become hard errors rather than stalling.
If you are running a version of CockroachDB that is affected by an issue described here, upgrade to a version that contains the fix for the issue, as described in the preceding list.
Check for under-replicated or unavailable data
To see if any data is under-replicated or unavailable in your cluster, use the system.replication_stats
report as described in Replication Reports.
Check for replication zone constraint violations
To see if any of your cluster's data placement constraints are being violated, use the system.replication_constraint_stats
report as described in Replication Reports.
Check for critical localities
To see which of your localities (if any) are critical, use the system.replication_critical_localities
report as described in Replication Reports. A locality is "critical" for a range if all of the nodes in that locality becoming unreachable would cause the range to become unavailable. In other words, the locality contains a majority of the range's replicas.
Something else?
If we do not have a solution here, you can try using our other support resources, including: