Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

topology_random_failures: deselect more cases which can cause #21534 #22044

Merged

Conversation

enaydanov
Copy link
Contributor

@enaydanov enaydanov commented Dec 24, 2024

There are many CI failures (repros of #21534) which caused by stop_after_setting_mode_to_normal_raft_topology and stop_before_becoming_raft_voter error injections in combination with some cluster events.

Need to deselect them for now to make CI more stable. First batch deselected in #21658

Also, add the handling of topology state rollback caused by stop_before_streaming or stop_after_updating_cdc_generation error injections as a separate commit.

See also #21872 and #21957

@enaydanov
Copy link
Contributor Author

Started a BYO build with 1000 repeats of topology_random_failures: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2648/

@mykaul
Copy link
Contributor

mykaul commented Dec 24, 2024

  • Please add 'Refs' so we know what issues it refereneces.
  • I think (but not sure) that we need to test the failures in debug run also.

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_random_failures/test_random_failures
✅ - Docker Test
✅ - Offline-installer Artifact Tests
✅ - Unit Tests

Build Details:

  • Duration: 4 hr 47 min
  • Builder: spider8.cloudius-systems.com

…s sleep

The node is hanging and the coordinator just rollback a topology state.  It's different from
`stop_after_sending_join_node_request` and `stop_after_bootstrapping_initial_raft_configuration`
because in these cases the coordinator just not able to start the topology change at all and
a message in the coordinator's log is different.

Error injections handled:
  - `stop_after_updating_cdc_generation`
  - `stop_before_streaming`

And, actually, it can be any cluster event which lasts more than 30s.
More cases found which can cause the same 'local_is_initialized()' assertion
during the node's bootstrap.
@enaydanov enaydanov force-pushed the disable-some-random-failures branch from c529bd8 to 5992e8b Compare December 25, 2024 06:39
@enaydanov
Copy link
Contributor Author

* I think (but not sure) that we need to test the failures in debug run also.

@mykaul We run topology_random_failures in debug only.

@enaydanov
Copy link
Contributor Author

enaydanov commented Dec 25, 2024

Started a BYO build with 1000 repeats of topology_random_failures: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2648/

Got 3/1000 failures. One of them for stop_after_updating_cdc_generation. Also, I able to reproduce failures for stop_after_updating_cdc_generation-restart_non_coordinator_node locally -- it's a node's hang, so added this error injection to the first commit in this PR (check coordinator's log for a rollback message)

Another 1000-runs batch for updated PR: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2649/

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 topology_random_failures/test_random_failures
✅ - Docker Test
✅ - Offline-installer Artifact Tests
✅ - Unit Tests

Build Details:

  • Duration: 4 hr 42 min
  • Builder: spider8.cloudius-systems.com

@enaydanov
Copy link
Contributor Author

Another 1000-runs batch for updated PR: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2649/

Success!

@mykaul
Copy link
Contributor

mykaul commented Dec 26, 2024

@kostja - please review.

@mykaul
Copy link
Contributor

mykaul commented Dec 26, 2024

@scylladb/scylla-maint - please review for merge - it kills our CI stability. This is an intermediate step before fixing the real issue, just to reduce noise in our CI.

Copy link
Contributor

@nyh nyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing that both @kostja ans @mykaul asked for this patch to go in, I'll merge this. I would much have preferred to see a patch that fixes a bug rather than a patch that hides a bunch of tests.

@@ -67,6 +67,14 @@ def add_deselected_metadata(fn: Callable[P, T]) -> Callable[P, T]:
# >>> await anext(cluster_event, None)


@deselect_for(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I didn't understand what this "delesect_for" does. Why are we using it and not @xfail or @skip on specific tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic behind this is following: the full matrix of tests is generating by pytest_generate_tests hook. Then we pick one random test from the matrix. And we want to run something useful. Not a skip or xfail. So, this decorator puts hints for the hook to remove (deselect) tests from the matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/raft backport/none Backport is not required promoted-to-master symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/test.py
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants