-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset UUID conflicts (blueprint vs sled agent) #7265
Comments
#7266 is my proposed fix here, but it has drawbacks. However, my hope is that:
Unfortunately, it does come with some downsides:
|
I balked at it initially, but I'm starting to come around to @davepacheco's suggestion to change RSS to assign dataset UUIDs to match zone UUIDs, which works around this from the other direction: it makes sled-agent's incorrect choice incidentally correct. I think we could recover from this in the blueprint system in the long run (by checking for equal IDs and generating new dataset UUIDs, once we know sled-agent won't clobber them). Happy to chat more tomorrow. |
What about systems with blueprints that have datasets UUIDs != zone UUIDs? Are we making the assumption that these don't exist, in this proposal? EDIT: ah, nevermind, clearly misread:
So presumably, we would just make the blueprint system "handle all cases" - it might have dataset UUIDs == zone UUIDs, or it might not |
Just to confirm, I am seeing this again on madrid after installing 020fde1 and double-checking that I did clean slate properly. The symptoms are the same as we observed on dublin. The correct dataset ID for this particular Crucible dataset is
And blueprint execution fails in the same way:
|
I was assuming that, yeah. I thought prior to R12 there weren't any systems that had blueprints with dataset UUIDs? |
Hmm, I think I'm missing something. The dataset that omicron/sled-agent/src/sled_agent.rs Line 961 in 4a4d6ca
lands in the OmicronZoneTypeExt trait where it only returns Is there a separate path in sled-agent that ensures transient datasets exist? |
Yes. There is a different mechanism for picking the transient dataset: omicron/sled-agent/src/services.rs Lines 3698 to 3711 in 4a4d6ca
|
Closing the loop a little for history's sake: we had a meeting about this, and are planning on going through the #7266 route (do not overwrite the dataset UUID, and keep checking it, but stop setting it to the zone ID in the |
- Avoids overwriting the value of "dataset UUID" when creating datasets from `omicron_zones_ensure`. Instead, don't set any dataset UUID, which lets subsequent calls to `datasets_ensure` set the right value here. Fixes #7265
Context
In the limit, we have the following goals:
However, for backward compatibility, the following code exists in
omicron_zones_ensure
:https://github.com/oxidecomputer/omicron/blob/main/sled-agent/src/sled_agent.rs#L965-L970
The Sled Agent, during zone creation, "ensures" that necessary datasets exist. This is needed for backwards compatibility, because we do not know that all deployed systems will have transient datasets that get created prior to their zones launching. (We have a PR ready to cut this code out -- see: #7160 -- but it hasn't merged yet because systems exist in-field with old blueprints that do not know about datasets. That's why this PR was actually reverted in #7157 -- we still want it, but we need to ensure all systems are ready to forgo this "dataset ensuring within zone ensure" call).
Issue
The issue we're seeing arises from a conflict of "source-of-truth" for dataset UUIDs.
UUID::new_v4()
callomicron_zones_ensure
, within https://github.com/oxidecomputer/omicron/blob/main/sled-agent/src/sled_agent.rs#L965-L970, the dataset UUID is set to "whatever the zone UUID is".This can manifest errors in subtle ways. @iliana observed this in a new rack, where the "dataset ensure" step of blueprint execution failed.
With the code block here:
https://github.com/oxidecomputer/omicron/blob/main/sled-agent/src/sled_agent.rs#L965-L970
We effectively override the "dataset UUID" value for all datasets to the "zone UUID" value whenever we call
omicron_zones_ensure
.However, whenever we try to execute the blueprint -- and call
datasets_ensure
-- we do not use the zone UUID, and instead, rely on the value explicitly in the blueprint. Unfortunately, if this is not the zone UUID, it throws the error @iliana saw: "dataset ... exists with a different uuid".The text was updated successfully, but these errors were encountered: