-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
500 error caused by CRDB TransactionRetryWithProtoRefreshError during concurrent disk deletion #7208
Comments
I'm surprised that these aren't being retried:
One thing that would explain it: are you seeing that these are being retried 10 times? omicron/nexus/db-queries/src/transaction_retry.rs Lines 159 to 169 in 37c7f18
|
I haven't snooped the CRBD queries but this likely has to do with the slow disk deletion issue. It's conceivable that the number of retries has been exhausted for some of the disks when multiple of them are being deleted. The slow query I referenced in that ticket is unfortunately hitting the same rows in the |
We should automatically be producing metrics for oximeter when transactions are retried. The schema is defined in: https://github.com/oxidecomputer/omicron/blob/1922c8ecb8a667c9b1efa876f75d72a7437b3c77/oximeter/oximeter/schema/database-transaction.toml This should include both the name of the transaction and how many times it has been retried |
Following https://github.com/oxidecomputer/omicron/blob/main/docs/clickhouse-debugging.adoc , I'm taking a look at this for dogfood. From the clickhouse zone, I ran the following query:
This is basically showing all transactions that take "five or more attempts to finish", ordered by "when they happened". Here are the results:
|
Notably:
|
It is, yes, the unit is specified in the TOML file. Also, these data are all available via
|
to make sure conversations from chats are getting recorded here: Looking for indices under contention (https://www.cockroachlabs.com/docs/v22.1/crdb-internal#view-the-tables-indexes-with-the-most-time-under-contention) returns:
|
I'm going to try to summarize some of our observations here. Other folks on the call can feel free to respond here, or edit this message directly. Initial Observations
Questions
Further Investigation
HypothesesWhy are there so many statements happening within this transaction? Which transaction is it?We suspect this giant transaction is happening within Why is this happening now?
Many of these sagas are being triggered by region replacement background tasks, that were recently added:
Why is this work happening over and over again, instead of completing successfully?
TestingThis theory is contingent on the ~40 volume deep disk taking a long time, and the region replacement work being stuck attempting to access Crucible on an expunged device. @leftwo is going to go through the process of deleting the disks/snapshots in that long chain, which had volumes related to the unreachable crucibles. Although fixing #7209 should give us the same result, this will help identify whether or not the "very deep VCR record" is actually to blame for the extended contention. |
After the long chain of disk/snapshot deletion, a single disk deletion takes about 30s which is an improvement (from > 70s) but is still much longer than before. So there is still something we need to track down here. One thing I noticed is that the query used for selecting region replacement candidates takes a lot longer on rack2 compared to rack3 (the latter has >2x dataset / pools / physical disks):
I ran some simple select statements again each of the tables involved and
whereas the rack2 one doesn't use the
I wonder if implementing the first index recommendation, i.e., |
A few notes, though I apologize as I haven't had a chance to listen to the call!
Note that when booting a instance, propolis will perform this scrub for every Volume it activates: https://github.com/oxidecomputer/propolis/blob/220a6f367c18f2452dbc4fa9086f3fe73b961739/lib/propolis/src/block/crucible.rs#L104. It's true that we don't do this for all unattached disks though.
I agree with the hypothesis stated here:
The thing that's stuck in a loop in #7209 is the region snapshot replacement start saga, and the volume being deleted when the region snapshot replacement start saga unwinds is hard deleted, which doesn't invoke the volume delete sub saga (invoking the soft delete) |
@askfongjojo : I'm trying to poke at the datasets table. I think that's just one of many tables potentially modified during region replacement (we change 'size used' during allocation/deallocations), which happens to get caught in the contention. I ran My suspicion is that this is because the table is also being locked by a long running transaction that's occasionally popping up. As of December 6th, 10:15 AM PST, I'm still seeing multi-thousand-line-long transactions appearing in Cockroachdb via Here's
This still appears related to region snapshotting - I believe that query comes from This
|
I tried to pull up volumes that still have expensive VCRs, with:
This still looks like a ~40 layer deep VCR to me. Unsure if this is the data being operated upon, but it still exists
|
Looking into this, we discovered that the omicron/nexus/db-queries/src/db/datastore/volume.rs Lines 3722 to 3726 in d01b2ee
This scans every volume in the database and performs invariant checks on each one. This function (guarded by the feature |
#5799 modified the `cargo build` command line omicron-package runs. Previously it built up a list of packages to be built using the `-p` flag; that PR changed it to use `--bin`. The goal was to build only the binaries that are necessary for shipping; this avoids building sled-agent-sim during releng, for instance. We did not realize it at the time, but this invited the specter of rust-lang/cargo#8157 to wreak havoc; namely: - Without `--package`, Cargo uses the `default-members` key of the workspace Cargo.toml to determine which packages to build. `--bin` does not cause the same thing to happen; saying `--bin` does _not_ imply `--package [the package that the bin belongs to]`. - `omicron-dev` belongs to `default-members` and has a normal dependency on `nexus-test-utils`, which enables the `"testing"` feature of `nexus-db-queries`. #7208 is a known result of this problem, but there might be more. Fortunately the solution seems fairly easy, without reverting the relevant changes from #5799: use _both_ `--package` and `--bin`. With this change, the `"testing"` feature is no longer shown in the `cargo build --unit-graph` and `nm target/release/nexus | demangle | grep validate_volume_invar` no longer shows any matching testing-only symbols.
On omicron commit 37c7f18, I ran into 4 disk deletion failures out of 20 instances when using Terraform (with default concurrency) to destroy instances and disks. The same TF plan used to work without error on up to 120 instances/disks consistently.
The text was updated successfully, but these errors were encountered: