A cockroachdb service failed to startup after mupdate to c8f8332bc #7221

leftwo · 2024-12-10T17:01:48Z

Dogfood rack was mupdated to omicron commit c8f8332

After the mupdate, everything came online expect for one cockroachdb zone on sled 17:

BRM42220017 # svcs -xZ
svc:/oxide/cockroachdb:default (CockroachDB)
  Zone: oxz_cockroachdb_3237a532-acaa-4ebe-bf11-dde794fea739
 State: maintenance since Tue Dec 10 16:33:26 2024
Reason: Restarting too quickly.
   See: http://illumos.org/msg/SMF-8000-L5
   See: /pool/ext/ae56280b-17ce-4266-8573-e1da9db6c6bb/crypt/zone/oxz_cockroachdb_3237a532-acaa-4ebe-bf11-dde794fea739/root/var/svc/log/oxide-cockroachdb:default.log
Impact: This service is not running.

Looking at the log, it does not provide much info other than processes have exited and then restarted:

[ Dec 10 16:31:15 Enabled. ]                                                                                       
[ Dec 10 16:31:15 Rereading configuration. ]
[ Dec 10 16:31:16 Rereading configuration. ]                                                                       
[ Dec 10 16:31:28 Executing start method ("/opt/oxide/lib/svc/manifest/cockroachdb.sh"). ]
+ set -o errexit                          
+ set -o pipefail                                                                                                  
+ . /lib/svc/share/smf_include.sh                                                                                  
++ SMF_EXIT_OK=0                                                                                                                                                                                                                      
++ SMF_EXIT_NODAEMON=94        
++ SMF_EXIT_ERR_FATAL=95                           
++ SMF_EXIT_ERR_CONFIG=96        
++ SMF_EXIT_MON_DEGRADE=97     
++ SMF_EXIT_MON_OFFLINE=98 
++ SMF_EXIT_ERR_NOSMF=99                                                                                           
++ SMF_EXIT_ERR_PERM=100                        
++ svcprop -c -p config/listen_addr svc:/oxide/cockroachdb:default
+ LISTEN_ADDR='[fd00:1122:3344:109::3]:32221'                                                                      
++ svcprop -c -p config/store svc:/oxide/cockroachdb:default
+ DATASTORE=/data                                                                                                  
++ /opt/oxide/internal-dns-cli/bin/dnswait cockroach
++ head -n 5     
++ tr '\n' ,                     
note: configured to log to "/dev/stderr"
16:31:28.240Z INFO dnswait: using system configuration
+ JOIN_ADDRS=3237a532-acaa-4ebe-bf11-dde794fea739.host.control-plane.oxide.internal.:32221,4c3ef132-ec83-4b1b-9574-7c7d3035f9e9.host.control-plane.oxide.internal.:32221,8bbea076-ff60-4330-8302-383e18140ef3.host.control-plane.oxide
.internal.:32221,a3628a56-6f85-43b5-be50-71d8f0e04877.host.control-plane.oxide.internal.:32221,e86845b5-eabd-49f5-9a10-6dfef9066209.host.control-plane.oxide.internal.:32221,
+ [[ -z 3237a532-acaa-4ebe-bf11-dde794fea739.host.control-plane.oxide.internal.:32221,4c3ef132-ec83-4b1b-9574-7c7d3035f9e9.host.control-plane.oxide.internal.:32221,8bbea076-ff60-4330-8302-383e18140ef3.host.control-plane.oxide.inte
rnal.:32221,a3628a56-6f85-43b5-be50-71d8f0e04877.host.control-plane.oxide.internal.:32221,e86845b5-eabd-49f5-9a10-6dfef9066209.host.control-plane.oxide.internal.:32221, ]]
+ args=('--insecure' '--listen-addr' "$LISTEN_ADDR" '--http-addr' '127.0.0.1:8080' '--store' "$DATASTORE" '--join' "$JOIN_ADDRS")
+ exec /opt/oxide/cockroachdb/bin/cockroach start --insecure --listen-addr '[fd00:1122:3344:109::3]:32221' --http-addr 127.0.0.1:8080 --store /data --join 3237a532-acaa-4ebe-bf11-dde794fea739.host.control-plane.oxide.internal.:322
21,4c3ef132-ec83-4b1b-9574-7c7d3035f9e9.host.control-plane.oxide.internal.:32221,8bbea076-ff60-4330-8302-383e18140ef3.host.control-plane.oxide.internal.:32221,a3628a56-6f85-43b5-be50-71d8f0e04877.host.control-plane.oxide.internal.
:32221,e86845b5-eabd-49f5-9a10-6dfef9066209.host.control-plane.oxide.internal.:32221,
[ Dec 10 16:31:28 Method "start" exited with status 0. ]
*                
* WARNING: ALL SECURITY CONTROLS HAVE BEEN DISABLED!
*                                                
* This mode is intended for non-production testing only.
*                                       
* In this mode:                                       
* - Your cluster is open to any client that can access fd00:1122:3344:109::3.                                                                                                                                                         
* - Intruders with access to your machine or network can observe client-server traffic.                                                                                                                                               
* - Intruders can log in without password and read or write any data in the cluster.                                                                                                                                                  
* - Intruders can consume all your server's resources and cause unavailability.                                                                                                                                                       
*                                                                                                                                                                                                                                     
*                                                                                                                  
* INFO: To start a secure server without mandating TLS for clients,                                                                                                                                                                   
* consider --accept-sql-without-tls instead. For other options, see:                                                                                                                                                                  
*                                                                                                                  
* - https://go.crdb.dev/issue-v/53404/v22.1
* - https://www.cockroachlabs.com/docs/v22.1/secure-a-cluster.html
*
CockroachDB node starting at 2024-12-10 16:31:59.237331244 +0000 UTC (took 30.1s)
build:               OSS v22.1.22-27-g76e176e260 @ 2024/10/23 21:38:21 (go1.17.13)
webui:               http://127.0.0.1:8080
sql:                 postgresql://root@[fd00:1122:3344:109::3]:32221/defaultdb?sslmode=disable
sql (JDBC):          jdbc:postgresql://[fd00:1122:3344:109::3]:32221/defaultdb?sslmode=disable&user=root
RPC client flags:    /opt/oxide/cockroachdb/bin/cockroach <client cmd> --host=[fd00:1122:3344:109::3]:32221 --insecure
logs:                /data/logs
temp dir:            /data/cockroach-temp3047379273
external I/O path:   /data/extern
store[0]:            path=/data
storage engine:      pebble
clusterID:           2a348c29-7ccb-4d77-9afd-f1e37b9abb40 
status:              restarted pre-existing node
nodeID:              1
[ Dec 10 16:33:25 Stopping because all processes in service exited. ]
[ Dec 10 16:33:25 Executing stop method (:kill). ]
[ Dec 10 16:33:25 Executing start method ("/opt/oxide/lib/svc/manifest/cockroachdb.sh"). ]
+ set -o errexit
+ set -o pipefail
+ . /lib/svc/share/smf_include.sh
++ SMF_EXIT_OK=0
++ SMF_EXIT_NODAEMON=94
++ SMF_EXIT_ERR_FATAL=95
++ SMF_EXIT_ERR_CONFIG=96
++ SMF_EXIT_MON_DEGRADE=97
++ SMF_EXIT_MON_OFFLINE=98
++ SMF_EXIT_ERR_NOSMF=99
++ SMF_EXIT_ERR_PERM=100
++ svcprop -c -p config/listen_addr svc:/oxide/cockroachdb:default
+ LISTEN_ADDR='[fd00:1122:3344:109::3]:32221'
++ svcprop -c -p config/store svc:/oxide/cockroachdb:default
+ DATASTORE=/data
++ ++ ++ head -n 5
/opt/oxide/internal-dns-cli/bin/dnswait cockroach
tr '\n' ,
note: configured to log to "/dev/stderr"
16:33:25.957Z INFO dnswait: using system configuration
+ JOIN_ADDRS=3237a532-acaa-4ebe-bf11-dde794fea739.host.control-plane.oxide.internal.:32221,4c3ef132-ec83-4b1b-9574-7c7d3035f9e9.host.control-plane.oxide.internal.:32221,8bbea076-ff60-4330-8302-383e18140ef3.host.control-plane.oxide
.internal.:32221,a3628a56-6f85-43b5-be50-71d8f0e04877.host.control-plane.oxide.internal.:32221,e86845b5-eabd-49f5-9a10-6dfef9066209.host.control-plane.oxide.internal.:32221,
+ [[ -z 3237a532-acaa-4ebe-bf11-dde794fea739.host.control-plane.oxide.internal.:32221,4c3ef132-ec83-4b1b-9574-7c7d3035f9e9.host.control-plane.oxide.internal.:32221,8bbea076-ff60-4330-8302-383e18140ef3.host.control-plane.oxide.inte
rnal.:32221,a3628a56-6f85-43b5-be50-71d8f0e04877.host.control-plane.oxide.internal.:32221,e86845b5-eabd-49f5-9a10-6dfef9066209.host.control-plane.oxide.internal.:32221, ]]
+ args=('--insecure' '--listen-addr' "$LISTEN_ADDR" '--http-addr' '127.0.0.1:8080' '--store' "$DATASTORE" '--join' "$JOIN_ADDRS")
+ [ Dec 10 16:33:26 Method "start" exited with status 0. ]
exec /opt/oxide/cockroachdb/bin/cockroach start --insecure --listen-addr '[fd00:1122:3344:109::3]:32221' --http-addr 127.0.0.1:8080 --store /data --join 3237a532-acaa-4ebe-bf11-dde794fea739.host.control-plane.oxide.internal.:32221
,4c3ef132-ec83-4b1b-9574-7c7d3035f9e9.host.control-plane.oxide.internal.:32221,8bbea076-ff60-4330-8302-383e18140ef3.host.control-plane.oxide.internal.:32221,a3628a56-6f85-43b5-be50-71d8f0e04877.host.control-plane.oxide.internal.:3
2221,e86845b5-eabd-49f5-9a10-6dfef9066209.host.control-plane.oxide.internal.:32221,

No core files were found in the expected places.

The text was updated successfully, but these errors were encountered:

davepacheco · 2024-12-10T18:39:32Z

tl;dr: the ZFS pool for this disk is out of space because it's full of Crucible regions whose reservations add up to almost the entire disk. Issues uncovered here:

Most importantly: Nexus should not choose to put new regions on disks that are this full. Filed Nexus overprovisions disks with control plane zones on them #7225.
Even with that fixed, we probably want to better isolate components on the same system from ENOSPC (want better runtime isolation of sled components from ENOSPC #7227). Some ideas:
- We could use quotas and reservations on control plane zones to ensure that they both have the space they need (the reservation) and that they don't use too much (quota).
- We may want to do this for any non-zone filesystems in the GZ as well.
- Details are tricky because the right value for production systems is different than, say, a4x2.
~~oxlog seems to be missing some log files in its output~~ This was a mistake in how I was postprocessing the output, so the only issue here is oxlog: want a way to list N most recent log files #6946.
It would be really nice if there were a reliable way to identify whether ENOSPC had ever happened, either in a zone or on a ZFS dataset (e.g., a kstat or ZFS dataset property). Filed illumos#16966.
CockroachDB is not using a ballast file on illumos.
(what am I missing?)

I'll add some notes about how we got here shortly.

iliana · 2024-12-10T18:55:26Z

On the ballast: I found that CockroachDB created the ballast (it lives at /data/auxiliary/EMERGENCY_BALLAST and is 1 GiB) but it did not drop the ballast. I am not entirely sure what happened here; I don't yet understand the logic for when CockroachDB drops the ballast.

davepacheco · 2024-12-10T19:37:52Z

Debugging process:

It's surprising that the log file pasted above does not include a message from SMF saying that it was putting the service into maintenance. It usually logs something about all processes in the service exiting or something dumping core or whatever. The last entry in the log file is from 2024-12-10T16:133:26Z saying the start method exited with status 0 (success) and the svcs -xv output shows that the service went into maintenance at the same second.

I went looking for notes in other log files: /var/adm/messages, /var/log/syslog*, and /var/svc/log/svc.startd.log are all empty files.

I don't know what made me think of this but I I noticed that the system was almost out of space in /var:

root@oxz_cockroachdb_3237a532:~# df -h /var
Filesystem             Size   Used  Available Capacity  Mounted on
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crypt/zone/oxz_cockroachdb_3237a532-acaa-4ebe-bf11-dde794fea739
                     649.21M 632.11M     17.10M    98%    /

@leftwo noticed that's also true of the "/data" dataset:

root@oxz_cockroachdb_3237a532:/data/logs# df -h .
Filesystem             Size   Used  Available Capacity  Mounted on
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crypt/cockroachdb
                      2.82T 10.88G     17.15M   100%    /data

This made me suspect the ZFS pool is out of space. When a dataset has no reservation or quota, the available space is usually what's left in the pool. The available space is also what's left in the pool if that number is less than the zone's unused reservation, if it has a reservation. So when multiple datasets have almost no space available, that suggests the pool's available space is the limiting factor.

Here's the space stats in a different, working CockroachDB zone:

root@oxz_cockroachdb_4c3ef132:~# df -h /var /data
Filesystem             Size   Used  Available Capacity  Mounted on
oxp_b358fb1e-f52a-4a63-9aab-170225509b37/crypt/zone/oxz_cockroachdb_4c3ef132-ec83-4b1b-9574-7c7d3035f9e9
                     671.27G 632.09M    670.65G     1%    /
oxp_b358fb1e-f52a-4a63-9aab-170225509b37/crypt/cockroachdb
                      2.82T  9.57G    670.65G     2%    /data

Similar amount of space used, but much more space available. That's consistent with being low on pool space on the broken node.

Which pool is it?

BRM42220017 # zfs list | grep 3237a
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crypt/zone/oxz_cockroachdb_3237a532-acaa-4ebe-bf11-dde794fea739       632M  15.2M      632M  /pool/ext/ae56280b-17ce-4266-8573-e1da9db6c6bb/crypt/zone/oxz_cockroachdb_3237a532-acaa-4ebe-bf11-dde794fea739

So it's pool oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb.

It is indeed almost out of space. Almost all of the space is used by Crucible regions:

BRM42220017 # zfs list -r oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb
NAME                                                                                                           USED  AVAIL     REFER  MOUNTPOINT
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb                                                                      2.82T  14.8M       24K  /oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible                                                             2.79T  14.8M     2.28M  /data
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible/regions                                                     2.79T  14.8M      106K  /data/regions
...

The Crucible regions all have a reservation and quota, but none of the other filesystems has a reservation and most don't have a quota either:

BRM42220017 # zfs list -r  -o name,reservation,quota oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb
NAME                                                                                                          RESERV  QUOTA
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb                                                                        none   none
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible                                                               none   none
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible/regions                                                       none   none
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible/regions/00520721-6c9a-4187-8704-800d19a4eb81                 12.5G    30G
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible/regions/00c2cfa4-8adc-4351-9d08-3bcfde5629ba                 1.25G     3G
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible/regions/0279be35-a793-4846-bcdf-734533252f5a                 1.25G     3G
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible/regions/06074035-c0cf-42ec-88e8-3dbfb47419f9                    5G    12G
...
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible/regions/fe08a14b-12ea-4875-8799-55dfcf4f0557                 1.25G     3G
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible/regions/fea94176-a7f3-4e8e-a8e6-befaab39f039                 1.25G     3G
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crypt                                                                  none   none
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crypt/cockroachdb                                                      none   none
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crypt/debug                                                            none   100G
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crypt/zone                                                             none   none
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crypt/zone/oxz_cockroachdb_3237a532-acaa-4ebe-bf11-dde794fea739        none   none
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crypt/zone/oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508           none   none
oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crypt/zone/oxz_propolis-server_d68e025e-aff4-4389-8d9a-c08991331bfc    none   none

This basically took us to the summary in my previous comment.

leftwo · 2024-12-10T23:17:21Z

Space was freed up on the zfs filesystem, and afterward a restart of the service resulted in things working again.

I verified it by pointing omdb at this specific instance of cockroachdb and was able to get a response:

root@oxz_switch0:~# omdb db --db-url postgresql://root@[fd00:1122:3344:109::3]:32221/omicron?sslmode=disable disks info fee6c746-e6f9-4a20-a52e-39a410d69298
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (116.0.0)
HOST_SERIAL DISK_NAME   INSTANCE_NAME PROPOLIS_ZONE VOLUME_ID                            DISK_STATE 
-           host41-disk -             -             4f70e772-6c4b-43e5-b4d6-4b66d86d9cfc detached   
HOST_SERIAL REGION                               ZONE                                              PHYSICAL_DISK                        
BRM42220031 6bea908b-231a-4814-920d-333808bf842f oxz_crucible_2b34cd1d-ea7d-41a1-82b9-75550fdf6eb0 a83bdc49-daf2-4248-b238-fda4239305de 
BRM42220016 216415f3-474b-48ce-8355-dc500f6b9653 oxz_crucible_d4d9acc8-3e0b-4fab-a0a2-d21920fabd7e ef01ef55-e9cf-4aa2-a2d8-d59ed1065e13 
BRM42220017 2132a1ea-e37f-4b00-9c0c-14acd2954ae6 oxz_crucible_f71344eb-f7e2-439d-82a0-9941e6868fb6 d8cb744b-bf35-4939-8e2e-b853eba63617 

VCR from volume ID 4f70e772-6c4b-43e5-b4d6-4b66d86d9cfc
ID                                   BS  SUB_VOLUMES READ_ONLY_PARENT 
fee6c746-e6f9-4a20-a52e-39a410d69298 512 1           false            

SUB VOLUME 0
    ID                                   BS  BPE    EC  GEN READ_ONLY 
    fee6c746-e6f9-4a20-a52e-39a410d69298 512 131072 160 2   false     
    [fd00:1122:3344:109::9]:19000
    [fd00:1122:3344:10a::6]:19001
    [fd00:1122:3344:102::d]:19001

leftwo · 2024-12-10T23:22:34Z

Filed this issue for cockroachdb logs living outside the law

#7230

leftwo · 2024-12-11T21:56:55Z

With the discovery of the filesystem being out of space, and the resumption of service after freeing up space, I think we can close this issue.

morlandi7 added this to the 12 milestone Dec 10, 2024

This was referenced Dec 10, 2024

Nexus overprovisions disks with control plane zones on them #7225

Open

want better runtime isolation of sled components from ENOSPC #7227

Open

leftwo closed this as completed Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A cockroachdb service failed to startup after mupdate to c8f8332bc #7221

A cockroachdb service failed to startup after mupdate to c8f8332bc #7221

leftwo commented Dec 10, 2024

davepacheco commented Dec 10, 2024 •

edited

Loading

iliana commented Dec 10, 2024

davepacheco commented Dec 10, 2024

leftwo commented Dec 10, 2024

leftwo commented Dec 10, 2024

leftwo commented Dec 11, 2024

A cockroachdb service failed to startup after mupdate to c8f8332bc #7221

A cockroachdb service failed to startup after mupdate to c8f8332bc #7221

Comments

leftwo commented Dec 10, 2024

davepacheco commented Dec 10, 2024 • edited Loading

iliana commented Dec 10, 2024

davepacheco commented Dec 10, 2024

leftwo commented Dec 10, 2024

leftwo commented Dec 10, 2024

leftwo commented Dec 11, 2024

davepacheco commented Dec 10, 2024 •

edited

Loading