-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
topology_random_failures: deselect more cases which can cause #21534 #22044
topology_random_failures: deselect more cases which can cause #21534 #22044
Conversation
Started a BYO build with 1000 repeats of topology_random_failures: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2648/ |
|
🟢 CI State: SUCCESS✅ - Build Build Details:
|
68b19db
to
c529bd8
Compare
…s sleep The node is hanging and the coordinator just rollback a topology state. It's different from `stop_after_sending_join_node_request` and `stop_after_bootstrapping_initial_raft_configuration` because in these cases the coordinator just not able to start the topology change at all and a message in the coordinator's log is different. Error injections handled: - `stop_after_updating_cdc_generation` - `stop_before_streaming` And, actually, it can be any cluster event which lasts more than 30s.
More cases found which can cause the same 'local_is_initialized()' assertion during the node's bootstrap.
c529bd8
to
5992e8b
Compare
@mykaul We run topology_random_failures in debug only. |
Got 3/1000 failures. One of them for Another 1000-runs batch for updated PR: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2649/ |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
Success! |
@kostja - please review. |
@scylladb/scylla-maint - please review for merge - it kills our CI stability. This is an intermediate step before fixing the real issue, just to reduce noise in our CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -67,6 +67,14 @@ def add_deselected_metadata(fn: Callable[P, T]) -> Callable[P, T]: | |||
# >>> await anext(cluster_event, None) | |||
|
|||
|
|||
@deselect_for( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I didn't understand what this "delesect_for" does. Why are we using it and not @xfail
or @skip
on specific tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic behind this is following: the full matrix of tests is generating by pytest_generate_tests hook. Then we pick one random test from the matrix. And we want to run something useful. Not a skip or xfail. So, this decorator puts hints for the hook to remove (deselect) tests from the matrix.
There are many CI failures (repros of #21534) which caused by
stop_after_setting_mode_to_normal_raft_topology
andstop_before_becoming_raft_voter
error injections in combination with some cluster events.Need to deselect them for now to make CI more stable. First batch deselected in #21658
Also, add the handling of topology state rollback caused by
stop_before_streaming
orstop_after_updating_cdc_generation
error injections as a separate commit.See also #21872 and #21957