Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

instance stuck in starting, propolis zone fails to come up #7257

Open
leftwo opened this issue Dec 15, 2024 · 3 comments
Open

instance stuck in starting, propolis zone fails to come up #7257

leftwo opened this issue Dec 15, 2024 · 3 comments

Comments

@leftwo
Copy link
Contributor

leftwo commented Dec 15, 2024

On dogfood (rack2) after a mupdate to 020fde1
I have an instance that is stuck in starting.

The propolis zone responsible is on sled 23 (BRM42220016).
I see this zone has either started and failed in some way, or has never fully started:

BRM42220016 # zlogin oxz_propolis-server_f4019957-5d9d-45a0-97d8-e2c9c523a3e7
[Connected to zone 'oxz_propolis-server_f4019957-5d9d-45a0-97d8-e2c9c523a3e7' pts/3]
Last login: Sun Dec 15 02:50:00 on pts/3
The illumos Project     helios-2.0.23049        December 2024
root@oxz_propolis:~# svcs -x
svcs: Could not bind to repository server: repository server unavailable.  Exiting.
root@oxz_propolis:~# ps -ef 
     UID   PID  PPID   C    STIME TTY         TIME CMD
  netadm  9033  8945   0 01:15:01 ?           0:00 /lib/inet/ipmgmtd
    root  8962  8945   0 01:14:59 ?           0:00 /sbin/init
    root  9082  9081   0        - ?           0:00 <defunct>
    root  9237  9236   0        - ?           0:00 <defunct>
    root  8945  8945   0 01:14:59 ?           0:00 zsched
    root  9236  9079   0 01:15:27 ?           0:00 sulogin
    root  9079  8945   0 01:15:06 ?           0:00 /lib/svc/bin/svc.startd
    root  9081  9079   0 01:15:06 console     0:00 sulogin
  netcfg  9026  8945   0 01:15:01 ?           0:00 /lib/inet/netcfgd
    root 15039 15014   0 16:23:34 pts/3       0:00 ps -ef
    root 15013  8945   0 16:23:26 pts/3       0:00 /usr/bin/login -z global -f root
    root 15014 15013   0 16:23:26 pts/3       0:00 -bash
@leftwo
Copy link
Contributor Author

leftwo commented Dec 15, 2024

This same system has a whole pile of core files from crucible agent:

BRM42220016 # ls -lh /pool/*/*/crypt/debug/core\.*
-bash: /usr/bin/ls: Arg list too long
BRM42220016 # pwd 
/pool/ext/9005671f-3d90-4ed1-be15-ad65b9a65bd5/crypt/debug
BRM42220016 # ls core.oxz_crucible_a3ef7eba-c08e-48ef-ae7a-89e2fcb49b66.crucible-agent.* | wc -l
    6087

So, more than just propolis is having trouble.

@citrus-it
Copy link
Contributor

citrus-it commented Dec 15, 2024

I took a quick look and it appears that svc.configd failed to come up successfully as it exited with database corrupt error. This is generated for any unknown/unhandled error coming back from the sqlite2 backend.

root@oxz_propolis:/var/svc/log# mdb -p `pgrep startd`
Loading modules: [ svc.startd libumem.so.1 libuutil.so.1 libnvpair.so.1 ]
> $C
080477a8 zfs.171.65538.3420`__waitid+7(0, 2414, 80477c0, 3, 3, 0)
08047848 zfs.171.65538.3420`waitpid+0x45(2414, 0, 0)
08047878 fork_sulogin+0x78(0, 807b918)
080478e8 emi_is_disabled+0x3a()
08047d18 fork_emi+0x25()
08047d38 startup+0x214()
08047e48 main+0x204(feef1807, fef68704)
08047e88 _start_crt+0x9a(1, 8047eb8, fefceafc, 0, 0, 0)
08047eac _start+0x1a(1, 8047f6c, 0, 8047f84, 8047fa4, 8047fbc)
> ::walk thread
1
2
> 2::findstack -v
stack pointer for thread 2: feceea58
[ feceea58 zfs.171.65538.3420`__waitid+7() ]
  feceeaf8 zfs.171.65538.3420`waitpid+0x45(2379, 0, 0)
  feceeb28 fork_sulogin+0x78(0, 807baac)
  feceef58 fork_configd+0x45(6800)
  feceefc8 fork_configd_thread+0x326(1d1e)
  feceefe8 zfs.171.65538.3420`_thrp_setup+0x7f(febe0240)
  feceeff8 zfs.171.65538.3420`_lwp_start(febe0240, 0, 0, 0, 0, 0)
> 807baac/s
0x807baac:      svc.configd exited with database corrupt error after initialization of the repository

The repository database itself looks intact:

BRM42220016 # cp /etc/svc/repository-boot /tmp/test.db
BRM42220016 # /lib/svc/bin/sqlite /tmp/test.db .dump > /tmp/test.dump
BRM42220016 # wc -l /tmp/test.dump
   20839 /tmp/test.dump

BRM42220016 # cp /etc/svc/repository.db /tmp/test.db
BRM42220016 # /lib/svc/bin/sqlite /tmp/test.db .dump > /tmp/test.dump
BRM42220016 # wc -l /tmp/test.dump
   21471 /tmp/test.dump

I'm not sure if there's any way to recover the underlying sqlite2 reason that svc.configd exited at this point. If that's the case then we should at least open an issue to make it observable in future.

@morlandi7
Copy link

Possibly related to #7209

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants