instance stuck in starting, propolis zone fails to come up #7257

leftwo · 2024-12-15T17:49:05Z

On dogfood (rack2) after a mupdate to 020fde1
I have an instance that is stuck in starting.

The propolis zone responsible is on sled 23 (BRM42220016).
I see this zone has either started and failed in some way, or has never fully started:

BRM42220016 # zlogin oxz_propolis-server_f4019957-5d9d-45a0-97d8-e2c9c523a3e7
[Connected to zone 'oxz_propolis-server_f4019957-5d9d-45a0-97d8-e2c9c523a3e7' pts/3]
Last login: Sun Dec 15 02:50:00 on pts/3
The illumos Project     helios-2.0.23049        December 2024
root@oxz_propolis:~# svcs -x
svcs: Could not bind to repository server: repository server unavailable.  Exiting.
root@oxz_propolis:~# ps -ef 
     UID   PID  PPID   C    STIME TTY         TIME CMD
  netadm  9033  8945   0 01:15:01 ?           0:00 /lib/inet/ipmgmtd
    root  8962  8945   0 01:14:59 ?           0:00 /sbin/init
    root  9082  9081   0        - ?           0:00 <defunct>
    root  9237  9236   0        - ?           0:00 <defunct>
    root  8945  8945   0 01:14:59 ?           0:00 zsched
    root  9236  9079   0 01:15:27 ?           0:00 sulogin
    root  9079  8945   0 01:15:06 ?           0:00 /lib/svc/bin/svc.startd
    root  9081  9079   0 01:15:06 console     0:00 sulogin
  netcfg  9026  8945   0 01:15:01 ?           0:00 /lib/inet/netcfgd
    root 15039 15014   0 16:23:34 pts/3       0:00 ps -ef
    root 15013  8945   0 16:23:26 pts/3       0:00 /usr/bin/login -z global -f root
    root 15014 15013   0 16:23:26 pts/3       0:00 -bash

The text was updated successfully, but these errors were encountered:

leftwo · 2024-12-15T17:54:46Z

This same system has a whole pile of core files from crucible agent:

BRM42220016 # ls -lh /pool/*/*/crypt/debug/core\.*
-bash: /usr/bin/ls: Arg list too long

BRM42220016 # pwd 
/pool/ext/9005671f-3d90-4ed1-be15-ad65b9a65bd5/crypt/debug
BRM42220016 # ls core.oxz_crucible_a3ef7eba-c08e-48ef-ae7a-89e2fcb49b66.crucible-agent.* | wc -l
    6087

So, more than just propolis is having trouble.

citrus-it · 2024-12-15T22:14:18Z

I took a quick look and it appears that svc.configd failed to come up successfully as it exited with database corrupt error. This is generated for any unknown/unhandled error coming back from the sqlite2 backend.

root@oxz_propolis:/var/svc/log# mdb -p `pgrep startd`
Loading modules: [ svc.startd libumem.so.1 libuutil.so.1 libnvpair.so.1 ]
> $C
080477a8 zfs.171.65538.3420`__waitid+7(0, 2414, 80477c0, 3, 3, 0)
08047848 zfs.171.65538.3420`waitpid+0x45(2414, 0, 0)
08047878 fork_sulogin+0x78(0, 807b918)
080478e8 emi_is_disabled+0x3a()
08047d18 fork_emi+0x25()
08047d38 startup+0x214()
08047e48 main+0x204(feef1807, fef68704)
08047e88 _start_crt+0x9a(1, 8047eb8, fefceafc, 0, 0, 0)
08047eac _start+0x1a(1, 8047f6c, 0, 8047f84, 8047fa4, 8047fbc)
> ::walk thread
1
2
> 2::findstack -v
stack pointer for thread 2: feceea58
[ feceea58 zfs.171.65538.3420`__waitid+7() ]
  feceeaf8 zfs.171.65538.3420`waitpid+0x45(2379, 0, 0)
  feceeb28 fork_sulogin+0x78(0, 807baac)
  feceef58 fork_configd+0x45(6800)
  feceefc8 fork_configd_thread+0x326(1d1e)
  feceefe8 zfs.171.65538.3420`_thrp_setup+0x7f(febe0240)
  feceeff8 zfs.171.65538.3420`_lwp_start(febe0240, 0, 0, 0, 0, 0)
> 807baac/s
0x807baac:      svc.configd exited with database corrupt error after initialization of the repository

The repository database itself looks intact:

BRM42220016 # cp /etc/svc/repository-boot /tmp/test.db
BRM42220016 # /lib/svc/bin/sqlite /tmp/test.db .dump > /tmp/test.dump
BRM42220016 # wc -l /tmp/test.dump
   20839 /tmp/test.dump

BRM42220016 # cp /etc/svc/repository.db /tmp/test.db
BRM42220016 # /lib/svc/bin/sqlite /tmp/test.db .dump > /tmp/test.dump
BRM42220016 # wc -l /tmp/test.dump
   21471 /tmp/test.dump

I'm not sure if there's any way to recover the underlying sqlite2 reason that svc.configd exited at this point. If that's the case then we should at least open an issue to make it observable in future.

morlandi7 · 2024-12-16T20:36:48Z

Possibly related to #7209

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

instance stuck in starting, propolis zone fails to come up #7257

instance stuck in starting, propolis zone fails to come up #7257

leftwo commented Dec 15, 2024 •

edited by citrus-it

Loading

leftwo commented Dec 15, 2024 •

edited

Loading

citrus-it commented Dec 15, 2024 •

edited

Loading

morlandi7 commented Dec 16, 2024

instance stuck in starting, propolis zone fails to come up #7257

instance stuck in starting, propolis zone fails to come up #7257

Comments

leftwo commented Dec 15, 2024 • edited by citrus-it Loading

leftwo commented Dec 15, 2024 • edited Loading

citrus-it commented Dec 15, 2024 • edited Loading

morlandi7 commented Dec 16, 2024

leftwo commented Dec 15, 2024 •

edited by citrus-it

Loading

leftwo commented Dec 15, 2024 •

edited

Loading

citrus-it commented Dec 15, 2024 •

edited

Loading