topology_random_failures: deselect more cases which can cause #21534 #22044

enaydanov · 2024-12-24T15:17:41Z

There are many CI failures (repros of #21534) which caused by stop_after_setting_mode_to_normal_raft_topology and stop_before_becoming_raft_voter error injections in combination with some cluster events.

Need to deselect them for now to make CI more stable. First batch deselected in #21658

Also, add the handling of topology state rollback caused by stop_before_streaming or stop_after_updating_cdc_generation error injections as a separate commit.

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_random_failures/test_random_failures
✅ - Docker Test
✅ - Offline-installer Artifact Tests
✅ - Unit Tests

Build Details:

Duration: 4 hr 47 min
Builder: spider8.cloudius-systems.com

…s sleep The node is hanging and the coordinator just rollback a topology state. It's different from `stop_after_sending_join_node_request` and `stop_after_bootstrapping_initial_raft_configuration` because in these cases the coordinator just not able to start the topology change at all and a message in the coordinator's log is different. Error injections handled: - `stop_after_updating_cdc_generation` - `stop_before_streaming` And, actually, it can be any cluster event which lasts more than 30s.

More cases found which can cause the same 'local_is_initialized()' assertion during the node's bootstrap.

enaydanov · 2024-12-25T06:42:58Z

* I think (but not sure) that we need to test the failures in debug run also.

@mykaul We run topology_random_failures in debug only.

enaydanov · 2024-12-25T06:46:34Z

Started a BYO build with 1000 repeats of topology_random_failures: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2648/

Got 3/1000 failures. One of them for stop_after_updating_cdc_generation. Also, I able to reproduce failures for stop_after_updating_cdc_generation-restart_non_coordinator_node locally -- it's a node's hang, so added this error injection to the first commit in this PR (check coordinator's log for a rollback message)

Another 1000-runs batch for updated PR: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2649/

test/topology_random_failures/error_injections.py

scylladb-promoter · 2024-12-25T11:22:52Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_random_failures/test_random_failures
✅ - Docker Test
✅ - Offline-installer Artifact Tests
✅ - Unit Tests

Build Details:

Duration: 4 hr 42 min
Builder: spider8.cloudius-systems.com

enaydanov · 2024-12-25T15:43:48Z

Another 1000-runs batch for updated PR: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2649/

Success!

mykaul · 2024-12-26T10:35:25Z

@kostja - please review.

mykaul · 2024-12-26T12:03:49Z

@scylladb/scylla-maint - please review for merge - it kills our CI stability. This is an intermediate step before fixing the real issue, just to reduce noise in our CI.

nyh

Seeing that both @kostja ans @mykaul asked for this patch to go in, I'll merge this. I would much have preferred to see a patch that fixes a bug rather than a patch that hides a bunch of tests.

test/topology_random_failures/error_injections.py

nyh · 2024-12-30T08:51:30Z

test/topology_random_failures/cluster_events.py

@@ -67,6 +67,14 @@ def add_deselected_metadata(fn: Callable[P, T]) -> Callable[P, T]:
 #       >>> await anext(cluster_event, None)


+@deselect_for(


By the way, I didn't understand what this "delesect_for" does. Why are we using it and not @xfail or @skip on specific tests?

The logic behind this is following: the full matrix of tests is generating by pytest_generate_tests hook. Then we pick one random test from the matrix. And we want to run something useful. Not a skip or xfail. So, this decorator puts hints for the hook to remove (deselect) tests from the matrix.

enaydanov requested review from kbr-scylla, kostja and temichus December 24, 2024 15:17

enaydanov added backport/none Backport is not required symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/test.py area/raft labels Dec 24, 2024

enaydanov mentioned this pull request Dec 24, 2024

Revert scylladb/scylladb#21658 when scylladb/scylladb#21534 will be fixed #21711

Open

enaydanov mentioned this pull request Dec 24, 2024

[x86_64, debug] topology_random_failures/test_random_failures failed with #21872

Open

scylladbbot added the status/ci in progress label Dec 24, 2024

scylladbbot removed the status/ci in progress label Dec 24, 2024

enaydanov force-pushed the disable-some-random-failures branch from 68b19db to c529bd8 Compare December 25, 2024 06:24

scylladbbot added the status/ci in progress label Dec 25, 2024

enaydanov added 2 commits December 25, 2024 06:38

test.py: topology_random_failures: more deselects for scylladb#21534

5992e8b

More cases found which can cause the same 'local_is_initialized()' assertion during the node's bootstrap.

enaydanov force-pushed the disable-some-random-failures branch from c529bd8 to 5992e8b Compare December 25, 2024 06:39

enaydanov requested a review from mykaul December 25, 2024 06:47

scylladbbot removed the status/ci in progress label Dec 25, 2024

mykaul reviewed Dec 25, 2024

View reviewed changes

test/topology_random_failures/error_injections.py Show resolved Hide resolved

kostja approved these changes Dec 26, 2024

View reviewed changes

enaydanov mentioned this pull request Dec 29, 2024

test.py: topology_random_failures: log randomization parameters in test #22055

Closed

nyh approved these changes Dec 30, 2024

View reviewed changes

test/topology_random_failures/error_injections.py Show resolved Hide resolved

nyh reviewed Dec 30, 2024

View reviewed changes

scylladb-promoter closed this in 2718062 Dec 30, 2024

scylladb-promoter merged commit 2718062 into scylladb:master Dec 30, 2024
15 checks passed

scylladbbot added the promoted-to-master label Dec 30, 2024

enaydanov deleted the disable-some-random-failures branch December 31, 2024 06:56

mykaul mentioned this pull request Jan 6, 2025

test_mv_resurrected_rows_after_decommission_interrupt[with_rbno] failed in [Service = cdc::generation_service]: Assertion 'local_is_initialized()' #22181

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

topology_random_failures: deselect more cases which can cause #21534 #22044

topology_random_failures: deselect more cases which can cause #21534 #22044

enaydanov commented Dec 24, 2024 •

edited

Loading

enaydanov commented Dec 24, 2024

mykaul commented Dec 24, 2024

scylladb-promoter commented Dec 24, 2024

enaydanov commented Dec 25, 2024

enaydanov commented Dec 25, 2024 •

edited

Loading

scylladb-promoter commented Dec 25, 2024

enaydanov commented Dec 25, 2024

mykaul commented Dec 26, 2024

mykaul commented Dec 26, 2024

nyh left a comment

nyh Dec 30, 2024

enaydanov Dec 30, 2024

		@@ -67,6 +67,14 @@ def add_deselected_metadata(fn: Callable[P, T]) -> Callable[P, T]:
		# >>> await anext(cluster_event, None)


		@deselect_for(

topology_random_failures: deselect more cases which can cause #21534 #22044

topology_random_failures: deselect more cases which can cause #21534 #22044

Conversation

enaydanov commented Dec 24, 2024 • edited Loading

enaydanov commented Dec 24, 2024

mykaul commented Dec 24, 2024

scylladb-promoter commented Dec 24, 2024

🟢 CI State: SUCCESS

Build Details:

enaydanov commented Dec 25, 2024

enaydanov commented Dec 25, 2024 • edited Loading

scylladb-promoter commented Dec 25, 2024

🟢 CI State: SUCCESS

Build Details:

enaydanov commented Dec 25, 2024

mykaul commented Dec 26, 2024

mykaul commented Dec 26, 2024

nyh left a comment

Choose a reason for hiding this comment

nyh Dec 30, 2024

Choose a reason for hiding this comment

enaydanov Dec 30, 2024

Choose a reason for hiding this comment

enaydanov commented Dec 24, 2024 •

edited

Loading

enaydanov commented Dec 25, 2024 •

edited

Loading