Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pantry crashed when processing a region with mismatched information #6353

Closed
askfongjojo opened this issue Aug 15, 2024 · 15 comments · Fixed by #7046
Closed

Pantry crashed when processing a region with mismatched information #6353

askfongjojo opened this issue Aug 15, 2024 · 15 comments · Fixed by #7046
Assignees
Milestone

Comments

@askfongjojo
Copy link

This issue was seen on rack2 after it was updated to omicron commit 4a6be3a40f6c9f876ff53308532f6c37f030b13f. Here are the log lines at the time pantry crashed:

angela@jeeves /staff/core/omicron-pantry-crash $ tail -15 oxide-crucible-pantry\:default.log.1723665490 | looker
19:58:02.159Z INFO crucible-pantry (datafile): client task is exiting
     = downstairs
    client = 2
    session_id = 4b49aac8-6f89-466b-8df4-ce378ea59121
19:58:09.762Z INFO crucible-pantry (datafile): connecting to [fd00:1122:3344:10a::7]:19010
     = downstairs
    client = 0
    session_id = ba4bf6b6-8a2b-4d5e-b861-8ce61b8119a7
19:58:09.762Z INFO crucible-pantry (datafile): ds_connection connected from Ok([fd00:1122:3344:105::4]:41371)
     = downstairs
    client = 0
    session_id = ba4bf6b6-8a2b-4d5e-b861-8ce61b8119a7
19:58:09.762Z INFO crucible-pantry (datafile): ds_transition from New to WaitActive
     = downstairs
    client = 0
    session_id = ba4bf6b6-8a2b-4d5e-b861-8ce61b8119a7
19:58:09.762Z INFO crucible-pantry (datafile): [0] Transition from New to WaitActive
     = downstairs
    client = 0
    session_id = ba4bf6b6-8a2b-4d5e-b861-8ce61b8119a7
19:58:09.763Z INFO crucible-pantry (datafile): downstairs client at Some([fd00:1122:3344:10a::7]:19010) has region UUID 6bf93010-3719-41bf-bc51-a03cd4442b89
     = downstairs
    client = 0
    session_id = ba4bf6b6-8a2b-4d5e-b861-8ce61b8119a7
19:58:09.763Z WARN crucible-pantry (datafile): Replace e03d2ffc-792f-4b2f-b1df-7a477b9ea19b with 6bf93010-3719-41bf-bc51-a03cd4442b89 before active
     = downstairs
    client = 0
    session_id = ba4bf6b6-8a2b-4d5e-b861-8ce61b8119a7
thread 'tokio-runtime-worker' panicked at upstairs/src/client.rs:1981:25:
[0] New downstairs region info mismatch: Received(RegionDefinition { block_size: 512, extent_size: Block { value: 131072, shift: 9 }, extent_count: 16, uuid: 79160ed2-b164-444f-853a-3d83242b79d9, encrypted: true, database_read_version: 1, database_write_version: 1 }) vs. RegionDefinition { block_size: 512, extent_size: Block { value: 131072, shift: 9 }, extent_count: 160, uuid: 6bf93010-3719-41bf-bc51-a03cd4442b89, encrypted: true, database_read_version: 1, database_write_version: 1 }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Aug 14 19:58:10 Stopping because all processes in service exited. ]
[ Aug 14 19:58:10 Executing start method ("/opt/oxide/pantry/bin/crucible-pantry run -l [fd00:1122:3344:105::4]:17000"). ]
note: configured to log to "/dev/stdout"
19:58:10.415Z INFO crucible-pantry (dropshot): listening
    local_addr = [fd00:1122:3344:105::4]:17000
19:58:10.428Z INFO crucible-pantry: listen IP: [fd00:1122:3344:105::4]:17000

The core file is available in catacomb under /staff/dock/rack2/mupdate-20240815/cores.

@askfongjojo
Copy link
Author

Just an extra data point: I happened to be testing sled expungement (details in this ticket). I saw that the regions expunged, including a snapshot, have been successfully replaced. So it would look like pantry was able to work on other regions despite the crash? I haven't tracked down the exact timing of when the replacements happened.

I found only one non-empty pantry log file in the past 24 hours. It has been uploaded to catacomb:/staff/core/omicron-6353.

@askfongjojo
Copy link
Author

The region actually has something to do with a region replaced in the sled I expunged. Here are some nexus log lines at a later time about this region:

20:03:31.119Z INFO 20b100d0-84c3-4119-aa9b-0c632b0b6a3a (ServerContext): saga create
    dag = {"end_node":9,"graph":{"edge_property":"directed","edges":[[0,1,null],[1,2,null],[2,3,null],[3,4,null],[4,5,null],[5,6,null],[6,7,null],[8,0,null],[7,9,null]],"node_holes":[],"nodes":[{"Action":{"action_name":"common.uuid_generate","label":"GenerateSagaId","name":"saga_id"}},{"Action":{"action_name":"common.uuid_generate","label":"GenerateJobId","name":"job_id"}},{"Action":{"action_name":"region_replacement_drive.set_saga_id","label":"SetSagaId","name":"unused_1"}},{"Action":{"action_name":"region_replacement_drive.drive_region_replacement_check","label":"DriveRegionReplacementCheck","name":"check"}},{"Action":{"action_name":"region_replacement_drive.drive_region_replacement_prepare","label":"DriveRegionReplacementPrepare","name":"prepare"}},{"Action":{"action_name":"region_replacement_drive.drive_region_replacement_execute","label":"DriveRegionReplacementExecute","name":"execute"}},{"Action":{"action_name":"region_replacement_drive.drive_region_replacement_commit","label":"DriveRegionReplacementCommit","name":"commit"}},{"Action":{"action_name":"region_replacement_drive.finish_saga","label":"FinishSaga","name":"unused_2"}},{"Start":{"params":{"request":{"id":"1ed5e8e1-1d98-4420-bef0-135cba510238","new_region_id":"e03d2ffc-792f-4b2f-b1df-7a477b9ea19b","old_region_id":"2bf499e2-a4f4-4e2e-9977-875f4914a7ce","old_region_volume_id":"0a416f49-71c0-4268-b901-d175dcb41384","operating_saga_id":null,"replacement_state":"Running","request_time":"2024-08-14T19:43:13.227734Z","volume_id":"be547a21-6638-4ee9-85a8-6b5de4ff5c75"},"serialized_authn":{"kind":{"Authenticated":[{"actor":{"UserBuiltin":{"user_builtin_id":"001de000-05e4-4000-8000-000000000002"}}},null]}}}}},"End"]},"saga_name":"region-replacement-drive","start_node":8}
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.1/src/sec.rs:1146
    saga_id = 84ed4c5f-01ed-45ec-8b20-17dc0d90cd40
    saga_name = region-replacement-drive
    sec_id = 20b100d0-84c3-4119-aa9b-0c632b0b6a3a
20:03:31.119Z INFO 20b100d0-84c3-4119-aa9b-0c632b0b6a3a (ServerContext): creating saga
    file = nexus/db-queries/src/db/sec_store.rs:50
    saga_id = 84ed4c5f-01ed-45ec-8b20-17dc0d90cd40
    saga_name = region-replacement-drive
20:03:31.125Z INFO 20b100d0-84c3-4119-aa9b-0c632b0b6a3a (ServerContext): starting saga
    file = nexus/src/app/saga.rs:310
    saga_id = 84ed4c5f-01ed-45ec-8b20-17dc0d90cd40
    saga_name = region-replacement-drive
20:03:31.125Z INFO 20b100d0-84c3-4119-aa9b-0c632b0b6a3a (ServerContext): saga start
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.1/src/sec.rs:967
    saga_id = 84ed4c5f-01ed-45ec-8b20-17dc0d90cd40
    saga_name = region-replacement-drive
    sec_id = 20b100d0-84c3-4119-aa9b-0c632b0b6a3a
20:03:31.125Z INFO 20b100d0-84c3-4119-aa9b-0c632b0b6a3a (ServerContext): 1ed5e8e1-1d98-4420-bef0-135cba510238: drive saga started ok
    background_task = region_replacement_driver
    file = nexus/src/app/background/tasks/region_replacement_driver.rs:135
20:03:31.125Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a (ServerContext): roles
    background_task = region_replacement_driver
    roles = RoleSet { roles: {} }
20:03:31.125Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a (ServerContext): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    background_task = region_replacement_driver
    resource = Database
    result = Ok(())
20:03:31.127Z INFO 20b100d0-84c3-4119-aa9b-0c632b0b6a3a (ServerContext): preparing saga
    file = nexus/src/app/saga.rs:212
    saga_id = 6809a370-e9b7-49d7-9d9d-b23bb0c7bca4
    saga_name = region-replacement-drive

Apparently it's a replacement of the region 2bf499e2-a4f4-4e2e-9977-875f4914a7ce on the expunged sled. It's unclear why pantry thinks it needs to replace a newly created region.

@leftwo
Copy link
Contributor

leftwo commented Aug 15, 2024

Just an extra data point: I happened to be testing sled expungement (details in this ticket). I saw that the regions expunged, including a snapshot, have been successfully replaced.

The code to replace a snapshot is not in yet, so if it was reported that a snapshot was "fixed" it is not telling the truth here.

@leftwo
Copy link
Contributor

leftwo commented Aug 16, 2024

Looking at dogfood now, OMDB shows region replacements in progress (for some time now):

root@oxz_switch0:~# omdb db region-replacement list
Region replacement requests
ID                                   REQUEST_TIME                   REPLACEMENT_STATE
07a289e6-af50-42f1-8888-062229547888 2024-08-14 19:43:13.318246 UTC Running
08c449a6-25c1-46f8-b070-9223cb924cb0 2024-08-14 19:43:13.245003 UTC Complete
1ed5e8e1-1d98-4420-bef0-135cba510238 2024-08-14 19:43:13.227734 UTC Running
20811f9d-f6ed-40fe-8403-305edc4b3e8c 2024-08-14 19:43:13.291419 UTC Complete
52bd1034-2a77-4a17-8871-77e535b509d5 2024-08-14 19:43:13.259033 UTC Complete
5c43d936-d5c4-4d4e-b1ca-890aabeeae87 2024-08-14 19:43:13.357786 UTC Requested
915f8726-a882-4960-9e5c-dfe359f54911 2024-08-14 19:43:13.279 UTC    Complete
a4e697b6-5f02-4f75-ad41-f5bc3e906f0b 2024-08-14 19:43:13.344742 UTC Complete
c2b5ef9d-a89a-46c8-b57e-1eebb5487804 2024-08-14 19:43:13.331527 UTC Complete
f0e009b5-1575-42f0-bca5-7df6db846ef0 2024-08-14 19:43:13.304655 UTC Complete
root@oxz_switch0:~# date
Thu Aug 15 23:48:06 UTC 2024

From the pantry, we have the line:

9:58:09.763Z WARN crucible-pantry (datafile): Replace e03d2ffc-792f-4b2f-b1df-7a477b9ea19b with 6bf93010-3719-41bf-bc51-a03cd4442b89 before active

Of all the region-replacement tasks, none are replacing e03d2ffc with 6bf93010.
The closest we have is, what looks like a prior swap for ID: 1ed5e8e1-1d98-4420-bef0-135cba510238

      started: 2024-08-14 19:43:13.227734 UTC
        state: Running
old region id: 2bf499e2-a4f4-4e2e-9977-875f4914a7ce
new region id: Some(e03d2ffc-792f-4b2f-b1df-7a477b9ea19b)

Repair notifications
TIME                           REPAIR_ID                            REPAIR_TYPE    UPSTAIRS_ID                          SESSION_ID                           NOTIFICATION_TYPE
2024-08-14 19:43:26.616434 UTC dbe5eac5-3324-4769-b045-9c6841d8a22b Reconciliation 9da060c6-a336-4849-82c9-2be5da6bcb3d ba4bf6b6-8a2b-4d5e-b861-8ce61b8119a7 Started
2024-08-14 19:43:29.615747 UTC dbe5eac5-3324-4769-b045-9c6841d8a22b Reconciliation 9da060c6-a336-4849-82c9-2be5da6bcb3d ba4bf6b6-8a2b-4d5e-b861-8ce61b8119a7 Failed

progress:        ███░░░░ [     19/     36]

Repair steps
TIME                           STEP_TYPE DETAILS
2024-08-14 19:43:26.504837 UTC Pantry    address Some(fd00:1122:3344:105::4):Some(SqlU16(17000)) job Some(118f3992-99bb-4a48-a594-b50771a8a696)

No region with 6bf93010 shows up in the region-repair list.

@leftwo
Copy link
Contributor

leftwo commented Aug 16, 2024

With the three regions I'm tracking so far:
e03d2ffc-792f-4b2f-b1df-7a477b9ea19b,
6bf93010-3719-41bf-bc51-a03cd4442b89,
2bf499e2-a4f4-4e2e-9977-875f4914a7ce

Looking for any of our regions, I only find one:

root@oxz_switch0:~# omdb db region list 2> /dev/null | egrep "id|6bf93010"                                           
 id                                   | dataset_id                           | volume_id                            | block_size | blocks_per_extent | extent_count | read_only 
root@oxz_switch0:~#                                     
root@oxz_switch0:~# omdb db region list 2> /dev/null | egrep "id|2bf499e2"                                           
 id                                   | dataset_id                           | volume_id                            | block_size | blocks_per_extent | extent_count | read_only 
 2bf499e2-a4f4-4e2e-9977-875f4914a7ce | 3ec0e848-39de-495e-be0e-88241e11d0fb | 0a416f49-71c0-4268-b901-d175dcb41384 | 512        | 131072            | 16           | false     
root@oxz_switch0:~#                                                                                                                                                                                                                     
root@oxz_switch0:~# omdb db region list 2> /dev/null | egrep "id|e03d2ffc"                                           
 id                                   | dataset_id                           | volume_id                            | block_size | blocks_per_extent | extent_count | read_only 

This could make sense I think if the saga failed for a replacement, the new region would be deleted when things unwound.
For the region I do see in the regions output, it reports to be part of this volume: 0a416f49-71c0-4268-b901-d175dcb41384:

root@oxz_switch0:~# omdb db volumes info 0a416f49-71c0-4268-b901-d175dcb41384 2> /dev/null
ID                                   CREATED                        MODIFIED                       DELETED 
0a416f49-71c0-4268-b901-d175dcb41384 2024-08-14 19:43:22.478364 UTC 2024-08-14 19:43:22.478364 UTC NULL    
ID                                   BS SUB_VOLUMES READ_ONLY_PARENT 
0a416f49-71c0-4268-b901-d175dcb41384 0  1           false            

SUB VOLUME 0
    ID                                   BS BPE EC GEN READ_ONLY 
    0a416f49-71c0-4268-b901-d175dcb41384 0  0   0  0   false     
    [fd00:1122:3344:124::22]:19000

That looks a lot like a temporary volume used during replacement.

@jmpesp
Copy link
Contributor

jmpesp commented Aug 16, 2024

Having a look now

@leftwo
Copy link
Contributor

leftwo commented Aug 16, 2024

Some rough notes James collected during a debugging session with myself, @jmpesp, @iximeow, and @papertigers :

TODO:

  • issues for timeout when snapshot was expunged in omdb db validate-region-snapshots
  • pin in volume references failures - orthogonal to pantry crash
  • pin in 'region snapshot was deleted, please remove its record' output
  • region replacement should try again if operation fails
  • Make searching all nexus logs for a specific ID easier
  • omdb db sagas show also dump start params
  • omdb command for upstairs repair notifications

do some validate checks:

root@oxz_switch0:~# omdb db validate validate-volume-references
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (87.0.0)
 dataset_id                            region_id                             snapshot_id                           error                                               
 c6fde82d-8dae-4ef0-b557-6c3d094d9454  131b0034-74f0-45de-b2e8-44c1a7c34a60  98f93626-f2ac-409f-9371-5666755cf1e3  record has 0 volume references when it should be 1!

seems orthogonal?

root@[fd00:1122:3344:105::3]:32221/omicron> select * from region where dataset_id in (select id from dataset where pool_id in (select id from zpool where sled_id = '1efda86b-caef-489f-9792-589d7677e59a'));
                   id                  |         time_created          |         time_modified         |              dataset_id              |              volume_id               | block_size | blocks_per_extent | extent_count | port  | read_only
---------------------------------------+-------------------------------+-------------------------------+--------------------------------------+--------------------------------------+------------+-------------------+--------------+-------+------------
  ff29fe1d-1b09-486a-ab17-eddcd351921c | 2024-04-25 04:23:09.743311+00 | 2024-04-25 04:23:09.743311+00 | eb71bb55-37fb-4fc4-bdee-a5382a480271 | e2647e08-fdf0-4e8c-961d-a9b87730be2d |        512 |            131072 |           64 | 19000 |   false
  2bf499e2-a4f4-4e2e-9977-875f4914a7ce | 2024-04-25 03:01:58.144942+00 | 2024-04-25 03:01:58.144942+00 | 3ec0e848-39de-495e-be0e-88241e11d0fb | 0a416f49-71c0-4268-b901-d175dcb41384 |        512 |            131072 |           16 | 19000 |   false
  a649771a-179a-4dcb-8a70-02c29ea4de82 | 2024-05-01 19:40:19.140847+00 | 2024-05-01 19:40:19.140847+00 | 09f14045-df78-447a-b7d8-217e0ca8ee09 | fd601a67-e348-4457-b8f1-803d889c5d6f |       2048 |             32768 |           16 | 19002 |   false
(3 rows)c
root@[fd00:1122:3344:105::3]:32221/omicron> select * from region_snapshot where dataset_id in (select id from dataset where pool_id in (select id from zpool where sled_id = '1efda86b-caef-489f-9792-589d7677e59a'));
               dataset_id              |              region_id               |             snapshot_id              |         snapshot_addr          | volume_references | deleting
---------------------------------------+--------------------------------------+--------------------------------------+--------------------------------+-------------------+-----------
  09f14045-df78-447a-b7d8-217e0ca8ee09 | a649771a-179a-4dcb-8a70-02c29ea4de82 | 09d5d8f1-e39a-44dd-addc-d36ab935e592 | [fd00:1122:3344:124::24]:19003 |                 1 |  false
  3ec0e848-39de-495e-be0e-88241e11d0fb | 2bf499e2-a4f4-4e2e-9977-875f4914a7ce | 39618f4c-0040-4a8b-9faf-0391a560d960 | [fd00:1122:3344:124::22]:19001 |                 1 |  false
  eb71bb55-37fb-4fc4-bdee-a5382a480271 | ff29fe1d-1b09-486a-ab17-eddcd351921c | 59bdabdd-aa9a-47b5-b263-f39ce6a420d0 | [fd00:1122:3344:124::27]:19001 |                 2 |  false
(3 rows)

== issue: new region id is gone!

omdb db region-replacement info 07a289e6-af50-42f1-8888-062229547888

      started: 2024-08-14 19:43:13.318246 UTC
        state: Running
old region id: a649771a-179a-4dcb-8a70-02c29ea4de82
new region id: Some(a79e2116-dc81-45ae-af24-46a732e187ef)

root@[fd00:1122:3344:105::3]:32221/omicron> select * from region where id in ('a649771a-179a-4dcb-8a70-02c29ea4de82', 'a79e2116-dc81-45ae-af24-46a732e187ef');
                   id                  |         time_created          |         time_modified         |              dataset_id              |              volume_id               | block_size | blocks_per_extent | extent_count | port  | read_only
---------------------------------------+-------------------------------+-------------------------------+--------------------------------------+--------------------------------------+------------+-------------------+--------------+-------+------------
  a649771a-179a-4dcb-8a70-02c29ea4de82 | 2024-05-01 19:40:19.140847+00 | 2024-05-01 19:40:19.140847+00 | 09f14045-df78-447a-b7d8-217e0ca8ee09 | fd601a67-e348-4457-b8f1-803d889c5d6f |       2048 |             32768 |           16 | 19002 |   false
(1 row)

region-replacement-start (fe446f50-e43b-46cb-86b5-1b1651c10edd) saga shows it was ensured ok:

fe446f50-e43b-46cb-86b5-1bfe446f50-e43b-46cb-86b5-1b1651c10edd | 2024-08-14T19:43:20.646Z |          |   6: region_replacement_start.new_region_ensure                 | succeeded  | "ensured_dataset_and_region" => [{"identity":{"id":"46d1afcc-cc3f-4b17-aafc-054dd4862d15","time_created":"2023-08-30T18:59:10.758515Z","time_modified":"2023-08-30T18:59:10.758515Z"},"ip":"fd00:1122:3344:106::5","kind":"Crucible","pool_id":"7f778610-7328-4554-98f6-b17f74f551c7","port":32345,"rcgen":1,"size_used":386547056640,"time_deleted":null},{"block_size":2048,"encrypted":true,"extent_count":16,"extent_size":32768,"id":"a79e2116-dc81-45ae-af24-46a732e187ef","port_number":19013,"read_only":false,"state":"created"}]
1651c10edd | 2024-08-14T19:43:20.646Z |          |   6: region_replacement_start.new_region_ensure                 | succeeded  | "ensured_dataset_and_region" => [{"identity":{"id":"46d1afcc-cc3f-4b17-aafc-054dd4862d15","time_created":"2023-08-30T18:59:10.758515Z","time_modified":"2023-08-30T18:59:10.758515Z"},"ip":"fd00:1122:3344:106::5","kind":"Crucible","pool_id":"7f778610-7328-4554-98f6-b17f74f551c7","port":32345,"rcgen":1,"size_used":386547056640,"time_deleted":null},{"block_size":2048,"encrypted":true,"extent_count":16,"extent_size":32768,"id":"a79e2116-dc81-45ae-af24-46a732e187ef","port_number":19013,"read_only":false,"state":"created"}]

but it was deleted right after?!

root@oxz_switch0:~# omdb crucible-agent regions list --crucible-agent-url 'http://[fd00:1122:3344:106::5]:32345' | grep a79e2116-dc81-45ae-af24-46a732e187ef
a79e2116-dc81-45ae-af24-46a732e187ef destroyed 2048       32768       16           19013

which agent?

BRM42220006 # /opt/oxide/oxlog/oxlog logs --archived oxz_crucible_46d1afcc-cc3f-4b17-aafc-054dd4862d15 | grep agent | xargs grep -l 'a79e2116-dc81-45ae-af24-46a732e187ef'
/pool/ext/7f778610-7328-4554-98f6-b17f74f551c7/crypt/debug/oxz_crucible_46d1afcc-cc3f-4b17-aafc-054dd4862d15/oxide-crucible-agent:default.log.1723677495

create logs:

19:43:20.461Z INFO crucible-agent (datafile): region a79e2116-dc81-45ae-af24-46a732e187ef state: Requested -> Created
19:43:20.465Z INFO crucible-agent (worker): applying SMF actions post create...
19:43:20.476Z INFO crucible-agent (worker): creating missing downstairs instance downstairs-a79e2116-dc81-45ae-af24-46a732e187ef
19:43:20.485Z INFO crucible-agent (worker): ok, have svc:/oxide/crucible/downstairs:downstairs-a79e2116-dc81-45ae-af24-46a732e187ef
19:43:20.503Z INFO crucible-agent (worker): creating config property group
19:43:20.505Z INFO crucible-agent (worker): reconfiguring svc:/oxide/crucible/downstairs:downstairs-a79e2116-dc81-45ae-af24-46a732e187ef
19:43:20.505Z INFO crucible-agent (worker): ensure directory SCF_TYPE_ASTRING /data/regions/a79e2116-dc81-45ae-af24-46a732e187ef
19:43:20.505Z INFO crucible-agent (worker): ensure port SCF_TYPE_COUNT 19013
19:43:20.505Z INFO crucible-agent (worker): ensure address SCF_TYPE_ASTRING fd00:1122:3344:106::5
19:43:20.505Z INFO crucible-agent (worker): commit
19:43:20.507Z INFO crucible-agent (worker): ok!
19:43:20.522Z INFO crucible-agent (worker): SMF ok!
19:43:20.640Z INFO crucible-agent (dropshot): request completed
    latency_us = 117
    local_addr = [fd00:1122:3344:106::5]:32345
    method = POST
    remote_addr = [fd00:1122:3344:102::3]:45318
    req_id = 5402aa46-f0ee-47cc-90e7-6d5053cc3f89
    response_code = 200
    uri = /crucible/0/regions

something requested delete 1 second later!

19:43:21.953Z INFO crucible-agent (datafile): path is "/data/regions/a79e2116-dc81-45ae-af24-46a732e187ef"
19:43:21.971Z INFO crucible-agent (datafile): dataset is oxp_7f778610-7328-4554-98f6-b17f74f551c7/crucible/regions/a79e2116-dc81-45ae-af24
-46a732e187ef
19:43:21.986Z INFO crucible-agent (datafile): region a79e2116-dc81-45ae-af24-46a732e187ef state: Created -> Tombstoned
19:43:21.989Z INFO crucible-agent (dropshot): request completed
    latency_us = 36552
    local_addr = [fd00:1122:3344:106::5]:32345
    method = DELETE
    remote_addr = [fd00:1122:3344:104::3]:62088
    req_id = f6eecb21-e2a8-4777-bce3-41641c940213
    response_code = 204
    uri = /crucible/0/regions/a79e2116-dc81-45ae-af24-46a732e187ef

zpool history:

BRM42220006 # zpool history | grep a79e2116
2024-08-14.19:43:20 zfs create -o reservation=1342177280 -o quota=3221225472 oxp_7f778610-7328-4554-98f6-b17f74f551c7/crucible/regions/a79e2116-dc81-45ae-af24-46a732e187ef
2024-08-14.19:43:27 zfs destroy oxp_7f778610-7328-4554-98f6-b17f74f551c7/crucible/regions/a79e2116-dc81-45ae-af24-46a732e187ef

what called that request id?

BRM42220057 # /tmp/rg -i '.*delete.*a79e2116-dc81-45ae-af24-46a732e187ef' /pool/ext/*/crypt/debug/*nexus*/oxide-nexus:default.log.1723* 
/pool/ext/2ec2a731-3340-4777-b1bb-4a906c598174/crypt/debug/oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a/oxide-nexus:default.log.1723665601
75432:{"msg":"Transient(region not deleted yet), trying again in 109.763131ms","v":0,"name":"nexus","level":30,"time":"2024-08-14T19:43:21.99480811Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":11429,"saga_id":"51743c7a-32b9-4296-811d-84db45d2ce6a","saga_name":"region-replacement-finish","component":"SagaExecutor","component":"nexus","component":"ServerContext","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"nexus/src/app/crucible.rs:827","region_id":"a79e2116-dc81-45ae-af24-46a732e187ef"}
75448:{"msg":"Transient(region not deleted yet), trying again in 76.296423ms","v":0,"name":"nexus","level":30,"time":"2024-08-14T19:43:22.111636406Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":11429,"saga_id":"51743c7a-32b9-4296-811d-84db45d2ce6a","saga_name":"region-replacement-finish","component":"SagaExecutor","component":"nexus","component":"ServerContext","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"nexus/src/app/crucible.rs:827","region_id":"a79e2116-dc81-45ae-af24-46a732e187ef"}
75495:{"msg":"Transient(region not deleted yet), trying again in 190.63163ms","v":0,"name":"nexus","level":30,"time":"2024-08-14T19:43:22.196813874Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":11429,"saga_id":"51743c7a-32b9-4296-811d-84db45d2ce6a","saga_name":"region-replacement-finish","component":"SagaExecutor","component":"nexus","component":"ServerContext","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"nexus/src/app/crucible.rs:827","region_id":"a79e2116-dc81-45ae-af24-46a732e187ef"}
75518:{"msg":"region deleted","v":0,"name":"nexus","level":30,"time":"2024-08-14T19:43:22.392886785Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":11429,"saga_id":"51743c7a-32b9-4296-811d-84db45d2ce6a","saga_name":"region-replacement-finish","component":"SagaExecutor","component":"nexus","component":"ServerContext","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"nexus/src/app/crucible.rs:812","region_id":"a79e2116-dc81-45ae-af24-46a732e187ef"}

why is region-replacement-finish happening here?!

51743c7a-32b9-4296-811d-84db45d2ce6a | 20b100d0-84c3-4119-aa9b-0c632b0b6a3a | 2024-08-14T19:43:21.246Z | region-replacement-finish | SagaCachedState(Done) | {"region_volume_id":"dc299258-4710-47b0-8614-5489037b804e","request":{"id":"f0e009b5-1575-42f0-bca5-7df6db846ef0","new_region_id":"f56ebcaf-3277-412f-a106-aeac7bda011e","old_region_id":"8de54b9a-29e6-4c1f-aed6-4863541e3949","old_region_volume_id":"dc299258-4710-47b0-8614-5489037b804e","operating_saga_id":null,"replacement_state":"ReplacementDone","request_time":"2024-08-14T19:43:13.304655Z","volume_id":"5693a347-1e82-4898-a172-78ebb9b24fc3"},"serialized_authn":{"kind":{"Authenticated":[{"actor":{"UserBuiltin":{"user_builtin_id":"001de000-05e4-4000-8000-000000000002"}}},null]}}}

this replacement request id is completely unrelated to the replacement we started looking at

but in the volume delete sub saga in delete_freed_crucible_regions, it's picking up the region we just created:

51743c7a-32b9-4296-811d-84db45d2ce6a | 2024-08-14 19:43:21.374464 UTC |   9: volume_delete.delete_freed_crucible_regions     | started    | 
51743c7a-32b9-4296-811d-84db45d2ce6a | 2024-08-14 19:43:22.451587 UTC |   9: volume_delete.delete_freed_crucible_regions     | succeeded  | 

(these timestamps line up with the crucible agent request log)

so why?!

wtf

let's look at the first drive saga:

d9eedea1-4fd9-4db8-869a-1227eca19acd | 20b100d0-84c3-4119-aa9b-0c632b0b6a3a | 2024-08-14T19:43:21.030Z | region-replacement-drive       | SagaCachedState(Done) | {"request":{"id":"07a289e6-af50-42f1-8888-062229547888","new_region_id":"a79e2116-dc81-45ae-af24-46a732e187ef","old_region_id":"a649771a-179a-4dcb-8a70-02c29ea4de82","old_region_volume_id":"fd601a67-e348-4457-b8f1-803d889c5d6f","operating_saga_id":null,"replacement_state":"Running","request_time":"2024-08-14T19:43:13.318246Z","volume_id":"4cb47e24-db75-4c4b-abdc-5ba4ccfbdb5e"},"serialized_authn":{"kind":{"Authenticated":[{"actor":{"UserBuiltin":{"user_builtin_id":"001de000-05e4-4000-8000-000000000002"}}},null]}}}

this first drive saga overlaps with the other region replacement finish saga that deleted the new region

but that doesn't matter, it shouldn't be deleting the region anyway

@askfongjojo
Copy link
Author

With the three regions I'm tracking so far: e03d2ffc-792f-4b2f-b1df-7a477b9ea19b, 6bf93010-3719-41bf-bc51-a03cd4442b89, 2bf499e2-a4f4-4e2e-9977-875f4914a7ce

Looking for any of our regions, I only find one:

I forgot about passing the fetch-limit previously and got confused. You might need to re-check in case this matters:

root@oxz_switch0:/tmp# ./omdbv2 db --fetch-limit 2000 regions list 2> /dev/null | egrep "id|6bf93010" 
 id                                   | dataset_id                           | volume_id                            | block_size | blocks_per_extent | extent_count 
 6bf93010-3719-41bf-bc51-a03cd4442b89 | fcb75972-836b-4f55-ba21-9722832cf5c2 | cfd7e1d7-4e7f-40d7-a35a-3a5a4446bf57 | 512        | 131072            | 160          

# this one is still not found
root@oxz_switch0:/tmp# ./omdbv2 db --fetch-limit 2000 regions list 2> /dev/null | egrep "id|e03d2ffc"
 id                                   | dataset_id                           | volume_id                            | block_size | blocks_per_extent | extent_count 

@leftwo
Copy link
Contributor

leftwo commented Aug 21, 2024

I looked into the failed/stuck repair at id: 1ed5e8e1-1d98-4420-bef0-135cba510238
The TL;DR is this repair suffers from the same issue as the repair we tracked down
during the live debugging session, where another saga has deleted a region out from under us.

Here are the details:

root@oxz_switch0:~# omdb db region-replacement info 1ed5e8e1-1d98-4420-bef0-135cba510238 2> /dev/null
      started: 2024-08-14 19:43:13.227734 UTC
        state: Running
old region id: 2bf499e2-a4f4-4e2e-9977-875f4914a7ce
new region id: Some(e03d2ffc-792f-4b2f-b1df-7a477b9ea19b)

Repair notifications
TIME                           REPAIR_ID                            REPAIR_TYPE    UPSTAIRS_ID                          SESSION_ID                           NOTIFICA
TION_TYPE                      
2024-08-14 19:43:26.616434 UTC dbe5eac5-3324-4769-b045-9c6841d8a22b Reconciliation 9da060c6-a336-4849-82c9-2be5da6bcb3d ba4bf6b6-8a2b-4d5e-b861-8ce61b8119a7 Started
2024-08-14 19:43:29.615747 UTC dbe5eac5-3324-4769-b045-9c6841d8a22b Reconciliation 9da060c6-a336-4849-82c9-2be5da6bcb3d ba4bf6b6-8a2b-4d5e-b861-8ce61b8119a7 Failed

progress:        ██████████████████████████████████████████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ [    
 19/     36]     

Repair steps
TIME                           STEP_TYPE DETAILS
2024-08-14 19:43:26.504837 UTC Pantry    address Some(fd00:1122:3344:105::4):Some(SqlU16(17000)) job Some(118f3992-99bb-4a48-a594-b50771a8a696)

This failed repair happened on the pantry fd00:1122:3344:105::4
Sled 9 has this pantry:

 9  BRM44220005        ok: oxz_crucible_pantry_375296e5-0a23-466c-b605-4204080f8103
underlay0/sled6   static   ok           fd00:1122:3344:105::1/64

On sled 9, looking for pantry logs:

/opt/oxide/oxlog/oxlog logs oxz_crucible_pantry_375296e5-0a23-466c-b605-4204080f8103 --archived

Then, searching them with:

BRM44220005 # grep e03d2ffc /pool/ext/*/crypt/debug/oxz_crucible_pantry_3*/oxide-crucible-pantry*

Just get the /pool/ext path that matches what we know we have, then put in the zone name and path to the log file.

for id in $(ls /pool/ext/*/crypt/debug/oxz_crucible_pantry_3*/oxide-crucible-pantry* | awk -F/ '{print $4}' | sort -n | uniq); do grep e03d2ffc /pool/ext/${id}/crypt
/debug/oxz_crucible_pantry_375296e5-0a23-466c-b605-4204080f8103/oxide-crucible-pantry:default.log.17*; done

That gives us our log as:
/pool/ext/17eff217-f0b1-4353-b133-0f68bbd5ceaa/crypt/debug/oxz_crucible_pantry_375296e5-0a23-466c-b605-4204080f8103/oxide-crucible-pantry:default.log.1723665490

So, the regions we care about are:
old region id: 2bf499e2-a4f4-4e2e-9977-875f4914a7ce
new region id: e03d2ffc-792f-4b2f-b1df-7a477b9ea19b

The old region ID won't exist in the pantry logs because we are starting up the pantry
to replace it, and the pantry will just have the new region ID in the VCR and
will rely on reconciliation to repair it.

Here is the log from the pantry starting with our "new" region:

19:43:26.607Z INFO crucible-pantry (datafile): downstairs client at Some([fd00:1122:3344:10a::7]:19010) has region UUID e03d2ffc-792f-4b2f-b1df-7a477b9ea19b

Pantry starts up and starts reconcile.
But, downstairs 0 goes away:

19:43:29.615Z WARN crucible-pantry (datafile): downstairs disconnected
     = downstairs
    client = 0
    session_id = ba4bf6b6-8a2b-4d5e-b861-8ce61b8119a7

This aborts the reconcile:

19:43:29.615Z WARN crucible-pantry (datafile): downstairs task for 0 stopped due to Finished
    session_id = ba4bf6b6-8a2b-4d5e-b861-8ce61b8119a7
19:43:29.615Z INFO crucible-pantry (datafile): Gone missing, transition from Reconcile to Disconnected
     = downstairs
    client = 0
    session_id = ba4bf6b6-8a2b-4d5e-b861-8ce61b8119a7
19:43:29.615Z INFO crucible-pantry (datafile): Disconnected -> New
     = downstairs
    client = 0
    session_id = ba4bf6b6-8a2b-4d5e-b861-8ce61b8119a7
19:43:29.615Z WARN crucible-pantry (datafile): aborting reconciliation
     = downstairs
    session_id = ba4bf6b6-8a2b-4d5e-b861-8ce61b8119a7

The downstairs client is at fd00:1122:3344:10a::7
Let's find the sled and logs and the situation over in that crucible reigon

10a:: is on 23:

23  BRM42220016        ok: underlay0/sled6   static   ok           fd00:1122:3344:10a::1/64

that IP is this zone:

oxz_crucible_fcb75972-836b-4f55-ba21-9722832cf5c2
oxControlService3/omicron6 static ok    fd00:1122:3344:10a::7/64

So, over on 23, in the global zone with zfs looking for region e03d2ffc
We can see it was created, then destroyed.

BRM42220016 # zpool history | grep e03d2ffc
2024-08-14.19:43:22 zfs create -o reservation=1342177280 -o quota=3221225472 oxp_9005671f-3d90-4ed1-be15-ad65b9a65bd5/crucible/regions/e03d2ffc-792f-4b2f-b1df-7a477b9ea19b
2024-08-14.19:43:34 zfs destroy oxp_9005671f-3d90-4ed1-be15-ad65b9a65bd5/crucible/regions/e03d2ffc-792f-4b2f-b1df-7a477b9ea19b

Inside the crucible zone that has the agent responsible for this, we look
in the crucible_region.json and we have recorded the region as destroyed:

    "e03d2ffc-792f-4b2f-b1df-7a477b9ea19b": {
      "id": "e03d2ffc-792f-4b2f-b1df-7a477b9ea19b",
      "state": "destroyed",
      "block_size": 512,
      "extent_size": 131072,
      "extent_count": 16,
      "encrypted": true,
      "port_number": 19010,
      "cert_pem": null,
      "key_pem": null,
      "root_pem": null,
      "source": null,
      "read_only": false
    },

For the region, we can find the SMF logs from it where it has started up and is
in the process of reconciliation when it is shut down:
From BRM42220016-oxz_crucible_fcb75972-836b-4f55-ba21-9722832cf5c2-oxide-crucible-downstairs:downstairs-e03d2ffc-792f-4b2f-b1df-7a477b9ea19b.log.1723664609

19:43:29.197Z INFO crucible: Copy files from "/data/regions/e03d2ffc-792f-4b2f-b1df-7a477b9ea19b/00/000/005.replace" in "/data/regions/e03d2ffc-792f-4b2f-b1df-7a477b9ea19b/00/000"
19:43:29.356Z INFO crucible: Move directory  "/data/regions/e03d2ffc-792f-4b2f-b1df-7a477b9ea19b/00/000/005.replace" to "/data/regions/e03d2ffc-792f-4b2f-b1df-7a477b9ea19b/00/000/005.completed"
19:43:29.359Z INFO crucible: Created copy dir "/data/regions/e03d2ffc-792f-4b2f-b1df-7a477b9ea19b/00/000/008.copy"
19:43:29.359Z INFO crucible: eid:8 Found repair files: ["008"]
[ Aug 14 19:43:29 Stopping because service disabled. ]
[ Aug 14 19:43:29 Executing stop method (:kill). ]

This is the same time the upstairs sees it disconnect.

The agent on 23 in the crucible zone creates the region here:

19:43:22.163Z INFO crucible-agent (worker): Region size:1073741824 reservation:1342177280 quota:3221225472
19:43:22.186Z INFO crucible-agent (worker): zfs set reservation of 1342177280 for oxp_9005671f-3d90-4ed1-be15-ad65b9a65bd5/crucible/regions/e03d2ffc-792f-4b2f-b1df-7a477b9ea19b
19:43:22.186Z INFO crucible-agent (worker): zfs set quota of 3221225472 for oxp_9005671f-3d90-4ed1-be15-ad65b9a65bd5/crucible/regions/e03d2ffc-792f-4b2f-b1df-7a477b9ea19b
19:43:22.280Z INFO crucible-agent (worker): creating region Region { id: RegionId("e03d2ffc-792f-4b2f-b1df-7a477b9ea19b"), state: Requested, block_size: 512, extent_size: 131072, extent_count: 16, encrypted: true, port_number: 19010, cert_pem: None, key_pem: None, root_pem: None, source: None, read_only: false } at "/data/regions/e03d2ffc-792f-4b2f-b1df-7a477b9ea19b"
    region = e03d2ffc-792f-4b2f-b1df-7a477b9ea19b
19:43:22.286Z INFO crucible-agent (worker): downstairs create with: env -i "/opt/oxide/crucible/bin/crucible-downstairs" "create" "--uuid" "e03d2ffc-792f-4b2f-b1df-7a477b9ea19b" "--data" "/data/regions/e03d2ffc-792f-4b2f-b1df-7a477b9ea19b" "--block-size" "512" "--extent-size" "131072" "--extent-count" "16" "--encrypted"
    region = e03d2ffc-792f-4b2f-b1df-7a477b9ea19b
19:43:22.344Z INFO crucible-agent (worker): region files created ok
    region = e03d2ffc-792f-4b2f-b1df-7a477b9ea19b
19:43:22.344Z INFO crucible-agent (datafile): region e03d2ffc-792f-4b2f-b1df-7a477b9ea19b state: Requested -> Created
19:43:22.348Z INFO crucible-agent (worker): applying SMF actions post create...
19:43:22.358Z INFO crucible-agent (worker): creating missing downstairs instance downstairs-e03d2ffc-792f-4b2f-b1df-7a477b9ea19b
19:43:22.374Z INFO crucible-agent (worker): ok, have svc:/oxide/crucible/downstairs:downstairs-e03d2ffc-792f-4b2f-b1df-7a477b9ea19b
19:43:22.390Z INFO crucible-agent (dropshot): request completed

Then, the agent is requested to delete it, 7 seconds later:

19:43:29.556Z INFO crucible-agent (dropshot): accepted connection
    local_addr = [fd00:1122:3344:10a::7]:32345
    remote_addr = [fd00:1122:3344:103::3]:59722
19:43:29.560Z INFO crucible-agent (dropshot): request completed
    latency_us = 3168
    local_addr = [fd00:1122:3344:10a::7]:32345
    method = GET
    remote_addr = [fd00:1122:3344:103::3]:59722
    req_id = 67e1e597-3bb9-4006-bf7c-f7f38f5aaf5f
    response_code = 200
    uri = /crucible/0/regions/e03d2ffc-792f-4b2f-b1df-7a477b9ea19b
19:43:29.564Z INFO crucible-agent (datafile): path is "/data/regions/e03d2ffc-792f-4b2f-b1df-7a477b9ea19b"
19:43:29.582Z INFO crucible-agent (datafile): dataset is oxp_9005671f-3d90-4ed1-be15-ad65b9a65bd5/crucible/regions/e03d2ffc-792f-4b2f-b1df-7a477b9ea19b
19:43:29.597Z INFO crucible-agent (datafile): region e03d2ffc-792f-4b2f-b1df-7a477b9ea19b state: Created -> Tombstoned
19:43:29.600Z INFO crucible-agent (worker): applying SMF actions before removal...
19:43:29.600Z INFO crucible-agent (dropshot): request completed

From that agent request, we can see where the delete came from:
remote_addr = [fd00:1122:3344:103::3]:59722

Which is, over on sled 8 in the nexus zone there:

 8  BRM44220011        ok: underlay0/sled6   static   ok           fd00:1122:3344:103::1/64
root@oxz_nexus_65a11c18:~# ipadm | grep fd00
oxControlService11/omicron6 static ok   fd00:1122:3344:103::3/64

Now looking in the Nexus log:

/pool/ext/cf940e15-dbc5-481b-866a-4de4b018898e/crypt/debug/oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7/oxide-nexus:default.log.1723665601

here is the region-replacement-drive saga starting:

19:43:29.323Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): preparing saga
    file = nexus/src/app/saga.rs:212
    saga_id = e6216174-35d3-452d-aef7-a173474b2d77
    saga_name = region-replacement-drive
19:43:29.323Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): saga create
    dag = {"end_node":9,"graph":{"edge_property":"directed","edges":[[0,1,null],[1,2,null],[2,3,null],[3,4,null],[4,5,null],[5,6,null],[6,7,null],[8,0,null],[7,9,null]],"node_holes":[],"nodes":[{"Action":{"action_name":"common.uuid_generate","label":"GenerateSagaId","name":"saga_id"}},{"Action":{"action_name":"common.uuid_generate","label":"GenerateJobId","name":"job_id"}},{"Action":{"action_name":"region_replacement_drive.set_saga_id","label":"SetSagaId","name":"unused_1"}},{"Action":{"action_name":"region_replacement_drive.drive_region_replacement_check","label":"DriveRegionReplacementCheck","name":"check"}},{"Action":{"action_name":"region_replacement_drive.drive_region_replacement_prepare","label":"DriveRegionReplacementPrepare","name":"prepare"}},{"Action":{"action_name":"region_replacement_drive.drive_region_replacement_execute","label":"DriveRegionReplacementExecute","name":"execute"}},{"Action":{"action_name":"region_replacement_drive.drive_region_replacement_commit","label":"DriveRegionReplacementCommit","name":"commit"}},{"Action":{"action_name":"region_replacement_drive.finish_saga","label":"FinishSaga","name":"unused_2"}},{"Start":{"params":{"request":{"id":"1ed5e8e1-1d98-4420-bef0-135cba510238","new_region_id":"e03d2ffc-792f-4b2f-b1df-7a477b9ea19b","old_region_id":"2bf499e2-a4f4-4e2e-9977-875f4914a7ce","old_region_volume_id":"0a416f49-71c0-4268-b901-d175dcb41384","operating_saga_id":null,"replacement_state":"Running","request_time":"2024-08-14T19:43:13.227734Z","volume_id":"be547a21-6638-4ee9-85a8-6b5de4ff5c75"},"serialized_authn":{"kind":{"Authenticated":[{"actor":{"UserBuiltin":{"user_builtin_id":"001de000-05e4-4000-8000-000000000002"}}},null]}}}}},"End"]},"saga_name":"region-replacement-drive","start_node":8}
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.1/src/sec.rs:1146
    saga_id = e6216174-35d3-452d-aef7-a173474b2d77
    saga_name = region-replacement-drive
    sec_id = 65a11c18-7f59-41ac-b9e7-680627f996e7

Starting a little later, we have this saga:

BRM44220011 # grep 534b4387-d638-4abb-a9a1-ef42cf0ff8e9 /pool/ext/cf940e15-dbc5-481b-866a-4de4b018898e/crypt/debug/oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7/oxide-nexus:default.log.1723665601 | looker
19:43:29.430Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): preparing saga
    file = nexus/src/app/saga.rs:212
    saga_id = 534b4387-d638-4abb-a9a1-ef42cf0ff8e9
    saga_name = region-replacement-finish
19:43:29.430Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): saga create
    dag = {"end_node":14,"graph":{"edge_property":"directed","edges":[[0,1,null],[1,2,null],[2,3,null],[4,5,null],[4,6,null],[6,7,null],[5,7,null],[7,8,null],[8,9,null],[9,10,null],[3,4,null],[10,11,null],[11,12,null],[13,0,null],[12,14,null]],"node_holes":[],"nodes":[{"Action":{"action_name":"common.uuid_generate","label":"GenerateSagaId","name":"saga_id"}},{"Action":{"action_name":"region_replacement_finish.set_saga_id","label":"SetSagaId","name":"unused_1"}},{"Constant":{"name":"params_for_volume_delete_subsaga","value":{"serialized_authn":{"kind":{"Authenticated":[{"actor":{"UserBuiltin":{"user_builtin_id":"001de000-05e4-4000-8000-000000000002"}}},null]}},"volume_id":"81c16b0d-2607-4635-83fd-3b041cfc5723"}}},{"SubsagaStart":{"params_node_name":"params_for_volume_delete_subsaga","saga_name":"volume-delete"}},{"Action":{"action_name":"volume_delete.decrease_crucible_resource_count","label":"DecreaseCrucibleResourceCount","name":"crucible_resources_to_delete"}},{"Action":{"action_name":"volume_delete.delete_crucible_regions","label":"DeleteCrucibleRegions","name":"no_result_1"}},{"Action":{"action_name":"volume_delete.delete_crucible_running_snapshots","label":"DeleteCrucibleRunningSnapshots","name":"no_result_2"}},{"Action":{"action_name":"volume_delete.delete_crucible_snapshots","label":"DeleteCrucibleSnapshots","name":"no_result_3"}},{"Action":{"action_name":"volume_delete.delete_crucible_snapshot_records","label":"DeleteCrucibleSnapshotRecords","name":"no_result_4"}},{"Action":{"action_name":"volume_delete.delete_freed_crucible_regions","label":"DeleteFreedCrucibleRegions","name":"no_result_5"}},{"Action":{"action_name":"volume_delete.hard_delete_volume_record","label":"HardDeleteVolumeRecord","name":"final_no_result"}},{"SubsagaEnd":{"name":"volume_delete_subsaga_no_result"}},{"Action":{"action_name":"region_replacement_finish.update_request_record","label":"UpdateRequestRecord","name":"unused_2"}},{"Start":{"params":{"region_volume_id":"81c16b0d-2607-4635-83fd-3b041cfc5723","request":{"id":"915f8726-a882-4960-9e5c-dfe359f54911","new_region_id":"db8899d6-3f2f-4fd2-975e-8c51c0e7f91f","old_region_id":"75f88f27-eaac-42c7-a9cd-492dfac0223c","old_region_volume_id":"81c16b0d-2607-4635-83fd-3b041cfc5723","operating_saga_id":null,"replacement_state":"ReplacementDone","request_time":"2024-08-14T19:43:13.279Z","volume_id":"de0c5bd3-0dbf-44fa-bc2f-f0a5ddb4bcf3"},"serialized_authn":{"kind":{"Authenticated":[{"actor":{"UserBuiltin":{"user_builtin_id":"001de000-05e4-4000-8000-000000000002"}}},null]}}}}},"End"]},"saga_name":"region-replacement-finish","start_node":13}
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.1/src/sec.rs:1146
    saga_id = 534b4387-d638-4abb-a9a1-ef42cf0ff8e9
    saga_name = region-replacement-finish
    sec_id = 65a11c18-7f59-41ac-b9e7-680627f996e7

This later saga is part of the region-replace 915f8726-a882-4960-9e5c-dfe359f54911
This later saga ends up deleting our region e03d2ffc-792f-4b2f-b1df-7a477b9ea19b which was
our original replacement region.
The times line up for when this saga deleted out region and when it goes away on the other side.
We can see from the Nexus log of this other saga ID: 534b4387, it makes the call to delete
our original replacement saga e03d2ffc.

19:43:29.516Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): recording saga event
    event_type = Started
    node_id = 9
    saga_id = 534b4387-d638-4abb-a9a1-ef42cf0ff8e9
19:43:29.541Z ERRO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): region_get saw Gone
    dataset_id = 09f14045-df78-447a-b7d8-217e0ca8ee09
    file = nexus/src/app/crucible.rs:482
    region_id = 75f88f27-eaac-42c7-a9cd-492dfac0223c
    saga_id = 534b4387-d638-4abb-a9a1-ef42cf0ff8e9
    saga_name = region-replacement-finish
19:43:29.541Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): dataset is gone
    dataset_id = 09f14045-df78-447a-b7d8-217e0ca8ee09
    file = nexus/src/app/crucible.rs:753
    saga_id = 534b4387-d638-4abb-a9a1-ef42cf0ff8e9
    saga_name = region-replacement-finish
19:43:29.605Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): Transient(region not deleted yet), trying again in 70.191213ms
    file = nexus/src/app/crucible.rs:827
    region_id = e03d2ffc-792f-4b2f-b1df-7a477b9ea19b
    saga_id = 534b4387-d638-4abb-a9a1-ef42cf0ff8e9
    saga_name = region-replacement-finish
19:43:29.680Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): Transient(region not deleted yet), trying again in 78.531439ms
    file = nexus/src/app/crucible.rs:827
    region_id = e03d2ffc-792f-4b2f-b1df-7a477b9ea19b
    saga_id = 534b4387-d638-4abb-a9a1-ef42cf0ff8e9
    saga_name = region-replacement-finish
19:43:29.763Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): Transient(region not deleted yet), trying again in 75.801798ms
    file = nexus/src/app/crucible.rs:827
    region_id = e03d2ffc-792f-4b2f-b1df-7a477b9ea19b
    saga_id = 534b4387-d638-4abb-a9a1-ef42cf0ff8e9
    saga_name = region-replacement-finish
19:43:29.845Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): region deleted
    file = nexus/src/app/crucible.rs:812
    region_id = e03d2ffc-792f-4b2f-b1df-7a477b9ea19b
    saga_id = 534b4387-d638-4abb-a9a1-ef42cf0ff8e9
    saga_name = region-replacement-finish

That sagas info:

root@oxz_switch0:~# /tmp/omdb-saga db sagas show 534b4387-d638-4abb-a9a1-ef42cf0ff8e9
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
WARN: found schema version 87.0.0, expected 7.0.0
It's possible the database is running a version that's different from what this
tool understands.  This may result in errors or incorrect output.
 id                                   | time_created                   | name                      | state
--------------------------------------+--------------------------------+---------------------------+-----------------------
 534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.430944 UTC | region-replacement-finish | SagaCachedState(Done)

                             saga id | event time                     | node id                                              | event type | data
------------------------------------ | ------------------------------ | ---------------------------------------------------- | ---------- | ---
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.438874 UTC |  13: start                                           | started    |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.444750 UTC |  13: start                                           | succeeded  |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.447494 UTC |   0: common.uuid_generate                            | started    |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.450439 UTC |   0: common.uuid_generate                            | succeeded  | "71694aeb-0921-487f-a6e4-814e4c9a2ab0"
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.453443 UTC |   1: region_replacement_finish.set_saga_id           | started    |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.461379 UTC |   1: region_replacement_finish.set_saga_id           | succeeded  |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.464073 UTC |   2: params_for_volume_delete_subsaga                | started    |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.466701 UTC |   2: params_for_volume_delete_subsaga                | succeeded  | {"serialized_authn":{"kind":{"Authenticated":[{"actor":{"UserBuiltin":{"user_builtin_id":"001de000-05e4-4000-8000-000000000002"}}},null]}},"volume_id":"81c16b0d-2607-4635-83fd-3b041cfc5723"}534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.469902 UTC |   3: subsaga start volume-delete                     | started    |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.472859 UTC |   3: subsaga start volume-delete                     | succeeded  |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.475834 UTC |   4: volume_delete.decrease_crucible_resource_count  | started    |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.488791 UTC |   4: volume_delete.decrease_crucible_resource_count  | succeeded  | {"V3":{"region_snapshots":[],"regions":[]}}
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.491580 UTC |   5: volume_delete.delete_crucible_regions           | started    |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.494396 UTC |   6: volume_delete.delete_crucible_running_snapshots | started    |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.497713 UTC |   5: volume_delete.delete_crucible_regions           | succeeded  |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.500805 UTC |   6: volume_delete.delete_crucible_running_snapshots | succeeded  |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.503598 UTC |   7: volume_delete.delete_crucible_snapshots         | started    |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.507112 UTC |   7: volume_delete.delete_crucible_snapshots         | succeeded  |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.509817 UTC |   8: volume_delete.delete_crucible_snapshot_records  | started    |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.512997 UTC |   8: volume_delete.delete_crucible_snapshot_records  | succeeded  |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.516050 UTC |   9: volume_delete.delete_freed_crucible_regions     | started    |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.857005 UTC |   9: volume_delete.delete_freed_crucible_regions     | succeeded  |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.859874 UTC |  10: volume_delete.hard_delete_volume_record         | started    |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.866813 UTC |  10: volume_delete.hard_delete_volume_record         | succeeded  |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.869767 UTC |  11: subsaga end volume_delete_subsaga_no_result     | started    |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.872595 UTC |  11: subsaga end volume_delete_subsaga_no_result     | succeeded  |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.875640 UTC |  12: region_replacement_finish.update_request_record | started    |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.889185 UTC |  12: region_replacement_finish.update_request_record | succeeded  |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.891913 UTC |  14: end                                             | started    |
534b4387-d638-4abb-a9a1-ef42cf0ff8e9 | 2024-08-14 19:43:29.894629 UTC |  14: end                                             | succeeded  |

At the end of all this, we are left with why did this other region-replacement-finish saga decide that our region was okay to be deleted when we were still using it?

@askfongjojo askfongjojo added this to the 10 milestone Aug 22, 2024
@morlandi7 morlandi7 modified the milestones: 10, 11 Aug 26, 2024
@leftwo
Copy link
Contributor

leftwo commented Aug 26, 2024

The pantry panic here is a 2nd order issue, let me explain how it happaned.

A pantry was started up to handle a region replacement.
After the pantry was running (and this is the bug) another repair job incorrectly deleted the new region
that this pantry was repairing to.
The pantry, non the wiser, noticed the downstairs went away and then just waited for it to come back.

meanwhile...
eventually..

A new downstairs region was created for some other purpose but it re-used the same IP:Port that
the old pantry was still trying to repair from. When this new downstairs showed up, the pantry took
one look at it and decided it did not match what it expected and paniced.

jmpesp added a commit to jmpesp/omicron that referenced this issue Sep 19, 2024
Create new enum types and return those to give more information to
callers:

- `create_region_snapshot_replacement_step` and
  `insert_region_snapshot_replacement_step` now return
  `InsertRegionSnapshotReplacementStepResult`

- `volume_replace_region` and `volume_replace_snapshot` now return
  `VolumeReplaceResult`

Notably, `VolumeReplaceResult::ExistingVolumeDeleted` replaces the
previous error type `TargetVolumeDeleted`, and is not an error, allowing
the caller to take action of the existing volume was deleted.

This commit was peeled off work in progress to address oxidecomputer#6353.
jmpesp added a commit to jmpesp/omicron that referenced this issue Sep 19, 2024
Create new enum types and return those to give more information to
callers:

- `create_region_snapshot_replacement_step` and
  `insert_region_snapshot_replacement_step` now return
  `InsertRegionSnapshotReplacementStepResult`

- `volume_replace_region` and `volume_replace_snapshot` now return
  `VolumeReplaceResult`

Notably, `VolumeReplaceResult::ExistingVolumeDeleted` replaces the
previous error type `TargetVolumeDeleted`, and is not an error, allowing
the caller to take action of the existing volume was deleted.

This commit was peeled off work in progress to address oxidecomputer#6353.
@jmpesp
Copy link
Contributor

jmpesp commented Sep 19, 2024

The underlying issue that I've been addressing since this was opened is that region replacement and region snapshot replacement do not currently account for soft-deleted or hard-deleted volumes. Originally I thought I would have to deal with this by having three new related sagas that would transition each of the related objects into a new state to represent that the associated volume was deleted, but this week have tried to incorporate checks into the existing sagas instead, and this seems to work without a whole lot of extra effort. I'm going to take this approach unless I find a fundamental flaw during testing.

I've been peeling off this work in progress in order to have smaller PRs to review. Here's what have been opened so far related to this work:

jmpesp added a commit that referenced this issue Sep 20, 2024
Create new enum types and return those to give more information to
callers:

- `create_region_snapshot_replacement_step` and
`insert_region_snapshot_replacement_step` now return
`InsertRegionSnapshotReplacementStepResult`

- `volume_replace_region` and `volume_replace_snapshot` now return
`VolumeReplaceResult`

Notably, `VolumeReplaceResult::ExistingVolumeDeleted` replaces the
previous error type `TargetVolumeDeleted`, and is not an error, allowing
the caller to take action of the existing volume was deleted.

This commit was peeled off work in progress to address #6353.
hawkw pushed a commit that referenced this issue Sep 21, 2024
Create new enum types and return those to give more information to
callers:

- `create_region_snapshot_replacement_step` and
`insert_region_snapshot_replacement_step` now return
`InsertRegionSnapshotReplacementStepResult`

- `volume_replace_region` and `volume_replace_snapshot` now return
`VolumeReplaceResult`

Notably, `VolumeReplaceResult::ExistingVolumeDeleted` replaces the
previous error type `TargetVolumeDeleted`, and is not an error, allowing
the caller to take action of the existing volume was deleted.

This commit was peeled off work in progress to address #6353.
jmpesp added a commit to jmpesp/omicron that referenced this issue Sep 24, 2024
Volumes can be deleted at any time, but the tasks and sagas that perform
region replacement did not account for this. This commit adds checks in
a few places for if a volume is soft-deleted or hard-deleted, and bails
out of any affected region replacement accordingly:

- If the replacement request is in the Requested state and the volume
  was seen to be soft-deleted or hard-deleted in the "region
  replacement" background task, then transition the region replacement
  request to Complete

- If the replacement request is in the Running state, and the volume was
  seen to be soft-deleted or hard-deleted in the region replacement
  drive saga, then skip any operations on that volume in that saga and
  allow that saga to transition the region replacement request to
  ReplacementDone. Later the rest of the region replacement machinery
  will transition the request to Complete and clean up resources as
  appropriate.

Testing this required fleshing out the simulated Crucible Pantry with
support for the new endpoints that the region replacement drive saga
queries. Full parity is left for future work, and the endpoints required
were left in but commented out.

This commit was peeled off work in progress to address oxidecomputer#6353.
@jmpesp
Copy link
Contributor

jmpesp commented Sep 24, 2024

The underlying issue that I've been addressing since this was opened is that region replacement and region snapshot replacement do not currently account for soft-deleted or hard-deleted volumes.

Fix for deleted volumes during region replacement addresses this problem for region replacement.

jmpesp added a commit that referenced this issue Oct 2, 2024
Volumes can be deleted at any time, but the tasks and sagas that perform
region replacement did not account for this. This commit adds checks in
a few places for if a volume is soft-deleted or hard-deleted, and bails
out of any affected region replacement accordingly:

- If the replacement request is in the Requested state and the volume
was seen to be soft-deleted or hard-deleted in the "region replacement"
background task, then transition the region replacement request to
Complete

- If the replacement request is in the Running state, and the volume was
seen to be soft-deleted or hard-deleted in the region replacement drive
saga, then skip any operations on that volume in that saga and allow
that saga to transition the region replacement request to
ReplacementDone. Later the rest of the region replacement machinery will
transition the request to Complete and clean up resources as
appropriate.

Testing this required fleshing out the simulated Crucible Pantry with
support for the new endpoints that the region replacement drive saga
queries. Full parity is left for future work, and the endpoints required
were left in but commented out.

This commit was peeled off work in progress to address #6353.
@jmpesp
Copy link
Contributor

jmpesp commented Oct 2, 2024

Last week I discovered that read-only regions do not have proper reference counting like region snapshots do. This is only a problem if region snapshots are replaced through an expungement or by a manual request, as that's the only thing that currently creates read-only regions. I've confirmed that dogfood is not affected by this: there are no on-going region snapshot replacements, nor any read-only regions that have already been allocated.

I've opened #6728 to disable region snapshot replacement until I can fix this.

hawkw pushed a commit that referenced this issue Oct 2, 2024
Volumes can be deleted at any time, but the tasks and sagas that perform
region replacement did not account for this. This commit adds checks in
a few places for if a volume is soft-deleted or hard-deleted, and bails
out of any affected region replacement accordingly:

- If the replacement request is in the Requested state and the volume
was seen to be soft-deleted or hard-deleted in the "region replacement"
background task, then transition the region replacement request to
Complete

- If the replacement request is in the Running state, and the volume was
seen to be soft-deleted or hard-deleted in the region replacement drive
saga, then skip any operations on that volume in that saga and allow
that saga to transition the region replacement request to
ReplacementDone. Later the rest of the region replacement machinery will
transition the request to Complete and clean up resources as
appropriate.

Testing this required fleshing out the simulated Crucible Pantry with
support for the new endpoints that the region replacement drive saga
queries. Full parity is left for future work, and the endpoints required
were left in but commented out.

This commit was peeled off work in progress to address #6353.
@jmpesp
Copy link
Contributor

jmpesp commented Oct 9, 2024

#6805 adds reference counting for read-only regions.

@morlandi7 morlandi7 modified the milestones: 11, 12 Oct 11, 2024
jmpesp added a commit to jmpesp/omicron that referenced this issue Nov 12, 2024
Volumes can be deleted at any time, but the tasks and sagas that perform
region snapshot replacement did not account for this. This commit adds
checks in a few places for if a volume is soft-deleted or hard-deleted,
and bails out of any affected region snapshot replacement accordingly:

- if a volume that has the region snapshot being replaced is
  soft-deleted, then skip making a region snapshot replacement step for
  it

- if a region snapshot replacement step has the volume deleted after the
  step was created, transition it directly to the VolumeDeleted state

- if a region snapshot replacement step has the volume deleted during
  the saga invocation, then skip notifying any Upstairs and allow the
  saga to transition the request to Complete, where then associated
  clean up can proceed

An interesting race condition emerged during unit testing: the read-only
region allocated to replace a region snapshot would be swapped into the
snapshot volume, but would be susceptible to being deleted by the user,
and therefore unable to be swapped into other volumes that have that
snapshot volume as a read-only parent.

This requires an additional volume that used that read-only region in
order to bump the reference count associated with that region, so that
the user cannot delete it before it was used to replace all other uses
of the region snapshot it was meant to replace.

This additional volume's lifetime lives as long as the region snapshot
replacement, and therefore needs to be deleted when the region snapshot
replacement is finished. This required a new region snapshot replacement
finish saga, which required a new "Completing" state to perform the same
type of state based lock on the replacement request done for all the
other sagas.

Testing also revealed that there were scenarios where
`find_deleted_volume_regions` would return volumes for hard-deletion
prematurely. The function now returns a struct instead of a list of
tuples, and in that struct, regions freed for deletion are now distinct
from volumes freed for deletion. Volumes are now only returned for
hard-deletion when all associated read/write regions have been (or are
going to be) deleted.

Fixes oxidecomputer#6353
jmpesp added a commit to jmpesp/omicron that referenced this issue Nov 12, 2024
Volumes can be deleted at any time, but the tasks and sagas that perform
region snapshot replacement did not account for this. This commit adds
checks in a few places for if a volume is soft-deleted or hard-deleted,
and bails out of any affected region snapshot replacement accordingly:

- if a volume that has the region snapshot being replaced is
  soft-deleted, then skip making a region snapshot replacement step for
  it

- if a region snapshot replacement step has the volume deleted after the
  step was created, transition it directly to the VolumeDeleted state

- if a region snapshot replacement step has the volume deleted during
  the saga invocation, then skip notifying any Upstairs and allow the
  saga to transition the request to Complete, where then associated
  clean up can proceed

An interesting race condition emerged during unit testing: the read-only
region allocated to replace a region snapshot would be swapped into the
snapshot volume, but would be susceptible to being deleted by the user,
and therefore unable to be swapped into other volumes that have that
snapshot volume as a read-only parent.

This requires an additional volume that used that read-only region in
order to bump the reference count associated with that region, so that
the user cannot delete it before it was used to replace all other uses
of the region snapshot it was meant to replace.

This additional volume's lifetime lives as long as the region snapshot
replacement, and therefore needs to be deleted when the region snapshot
replacement is finished. This required a new region snapshot replacement
finish saga, which required a new "Completing" state to perform the same
type of state based lock on the replacement request done for all the
other sagas.

Testing also revealed that there were scenarios where
`find_deleted_volume_regions` would return volumes for hard-deletion
prematurely. The function now returns a struct instead of a list of
tuples, and in that struct, regions freed for deletion are now distinct
from volumes freed for deletion. Volumes are now only returned for
hard-deletion when all associated read/write regions have been (or are
going to be) deleted.

Fixes oxidecomputer#6353
@jmpesp
Copy link
Contributor

jmpesp commented Nov 12, 2024

Finally, Handle region snapshot replacement volume deletes will address the problem of deleted volumes for all of region snapshot replacement, closing this issue.

jmpesp added a commit that referenced this issue Dec 19, 2024
Volumes can be deleted at any time, but the tasks and sagas that perform
region snapshot replacement did not account for this. This commit adds
checks in a few places for if a volume is soft-deleted or hard-deleted,
and bails out of any affected region snapshot replacement accordingly:

- if a volume that has the region snapshot being replaced is
soft-deleted, then skip making a region snapshot replacement step for it

- if a region snapshot replacement step has the volume deleted after the
step was created, transition it directly to the VolumeDeleted state

- if a region snapshot replacement step has the volume deleted during
the saga invocation, then skip notifying any Upstairs and allow the saga
to transition the request to Complete, where then associated clean up
can proceed

An interesting race condition emerged during unit testing: the read-only
region allocated to replace a region snapshot would be swapped into the
snapshot volume, but would be susceptible to being deleted by the user,
and therefore unable to be swapped into other volumes that have that
snapshot volume as a read-only parent.

This requires an additional volume that used that read-only region in
order to bump the reference count associated with that region, so that
the user cannot delete it before it was used to replace all other uses
of the region snapshot it was meant to replace.

This additional volume's lifetime lives as long as the region snapshot
replacement, and therefore needs to be deleted when the region snapshot
replacement is finished. This required a new region snapshot replacement
finish saga, which required a new "Completing" state to perform the same
type of state based lock on the replacement request done for all the
other sagas.

Testing also revealed that there were scenarios where
`find_deleted_volume_regions` would return volumes for hard-deletion
prematurely. The function now returns a struct instead of a list of
tuples, and in that struct, regions freed for deletion are now distinct
from volumes freed for deletion. Volumes are now only returned for
hard-deletion when all associated read/write regions have been (or are
going to be) deleted.

Fixes #6353
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants