-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A cockroachdb service failed to startup after mupdate to c8f8332bc #7221
Comments
tl;dr: the ZFS pool for this disk is out of space because it's full of Crucible regions whose reservations add up to almost the entire disk. Issues uncovered here:
I'll add some notes about how we got here shortly. |
On the ballast: I found that CockroachDB created the ballast (it lives at |
Debugging process: It's surprising that the log file pasted above does not include a message from SMF saying that it was putting the service into maintenance. It usually logs something about all processes in the service exiting or something dumping core or whatever. The last entry in the log file is from 2024-12-10T16:133:26Z saying the start method exited with status 0 (success) and the I went looking for notes in other log files: /var/adm/messages, /var/log/syslog*, and /var/svc/log/svc.startd.log are all empty files. I don't know what made me think of this but I I noticed that the system was almost out of space in /var:
@leftwo noticed that's also true of the "/data" dataset:
This made me suspect the ZFS pool is out of space. When a dataset has no reservation or quota, the available space is usually what's left in the pool. The available space is also what's left in the pool if that number is less than the zone's unused reservation, if it has a reservation. So when multiple datasets have almost no space available, that suggests the pool's available space is the limiting factor. Here's the space stats in a different, working CockroachDB zone:
Similar amount of space used, but much more space available. That's consistent with being low on pool space on the broken node. Which pool is it?
So it's pool It is indeed almost out of space. Almost all of the space is used by Crucible regions:
The Crucible regions all have a reservation and quota, but none of the other filesystems has a reservation and most don't have a quota either:
This basically took us to the summary in my previous comment. |
Space was freed up on the zfs filesystem, and afterward a restart of the service resulted in things working again. I verified it by pointing omdb at this specific instance of cockroachdb and was able to get a response:
|
Filed this issue for cockroachdb logs living outside the law |
With the discovery of the filesystem being out of space, and the resumption of service after freeing up space, I think we can close this issue. |
Dogfood rack was mupdated to omicron commit c8f8332
After the mupdate, everything came online expect for one cockroachdb zone on sled 17:
Looking at the log, it does not provide much info other than processes have exited and then restarted:
No core files were found in the expected places.
The text was updated successfully, but these errors were encountered: