Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed TaskGroup and CancelScope exit issues on asyncio #774

Merged
merged 33 commits into from
Sep 21, 2024
Merged

Conversation

agronholm
Copy link
Owner

Changes

This fixes both cancel scope exit behavior on asyncio where an outer cancel scope has been cancelled, and constant unfruitful cancellation attempts of a task that is waiting on TaskGroup__aexit__().

Fixes #695.
Fixes #698.

Checklist

If this is a user-facing code change, like a bugfix or a new feature, please ensure that
you've fulfilled the following conditions (where applicable):

  • You've added tests (in tests/) added which would fail without your patch
  • You've updated the documentation (in docs/, in case of behavior changes or new
    features)
  • You've added a new changelog entry (in docs/versionhistory.rst).

If this is a trivial change, like a typo fix or a code reformatting, then you can ignore
these instructions.

Updating the changelog

If there are no entries after the last release, use **UNRELEASED** as the version.
If, say, your patch fixes issue #123, the entry should look like this:

* Fix big bad boo-boo in task groups (#123 <https://github.com/agronholm/anyio/issues/123>_; PR by @yourgithubaccount)

If there's no issue linked, just link to your pull request instead by updating the
changelog after you've created the PR.

@agronholm agronholm changed the title Fix 695 Fixed TaskGroup and CancelScope exit issues on asyncio Aug 29, 2024
@agronholm agronholm requested a review from Zac-HD August 29, 2024 14:41
Copy link
Collaborator

@gschaffner gschaffner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one has been melting my mind a bit, sorry it has taken a while.

  • Consider adding a regression test for the case written in Different cancel scope behaviour on asyncio vs Trio #698. cac85cc (test_cancelled_raises_beyond_origin).

    • Another suggested case to add: 56cb8b2.
  • Consider adding a regression test for a case closely related to the case written in Different cancel scope behaviour on asyncio vs Trio #698. 73f1ab7 (test_deadline_based_checkpoint). Note that the failure mode of test_deadline_based_checkpoint on master is different than the failure mode of test_cancelled_raises_beyond_origin on master.

    One reason that I like this test case is because I found Trio's behavior in the case written in Different cancel scope behaviour on asyncio vs Trio #698 ("a level cancellation exception can keep propagating beyond the level that spawned the exception") to be surprising. (It violated my assumptions about Trio's level cancellation, and as pointed out in https://trio.discourse.group/t/do-cancelled-exceptions-know-which-block-they-belong-to/508 it seems to violate the first paragraph of Trio's "cancellation semantics" docs too.)

    The nice thing about this test, IMO, is that if you accept the premise that

    with CancelScope(deadline=-math.inf) as inner_scope:
        await sleep_forever()

    is a valid way to implement checkpoint(), then the strange behavior follows directly from that (mostly). (I say "mostly" because in the test_deadline_based_checkpoint case, the Cancelled was legitimately triggered by either scope (I think it is undefined which scope is responsible for triggering it). The test_cancelled_raises_beyond_origin case is stranger, because at the time of coro.throw(Cancelled()), it is unambiguous which scope is responsible for throwing the Cancelled, but the exception still propagates up beyond that scope.)

  • Consider adding a regression test for a case closely related to Different cancel scope behaviour on asyncio vs Trio #698. c1a4a85 (test_cancelled_scope_based_checkpoint). This is very similar to test_deadline_based_checkpoint, except (IIUC) deadline_based_checkpoint was a valid checkpoint() implementation prior to Delay deciding which cancel scope a Cancelled exception belongs to python-trio/trio#860 but cancelled_scope_based_checkpoint only became equivalent to checkpoint() in Delay deciding which cancel scope a Cancelled exception belongs to python-trio/trio#860.

  • A bug: CancelScope.cancelled_caught is wrong in some cases. (I did not notice this until a lot later, but for the sake of the review being more organized, I've retroactively added tests for this to the four tests suggested above, which fail on this branch right now.)

  • A longer comment regarding Different cancel scope behaviour on asyncio vs Trio #698:

    • AnyIO currently uses task.cancel(msg) to associate a level-CancelledError with the cancel scope that made the corresponding task.cancel(msg="cancelled by scope {origin}") call (the level-CancelledError's "origin" scope).

      These associations are what Different cancel scope behaviour on asyncio vs Trio #698 is all about. Trio does not associate each Cancelled with a particular "origin" scope (Delay deciding which cancel scope a Cancelled exception belongs to python-trio/trio#860) 1. The asyncio backend does make such associations, however. Thinking about it from Trio's perspective, this seems strange. Why would AsyncIOBackend associate each level-CancelledError with a particular cancel scope? AsyncIOBackend needs to behave like Trio, so it needs to behave like it doesn't make such associations. On asyncio, is making these associations for some reason necessary in order to produce the correct, association-free, behavior?

    • Something you mentioned: When Python 3.8 was supported, AnyIO could not rely on cancellation messages. It could do so in sys.version_info blocks, but so long as 3.8 was supported, the overall strategy of the CancelScope and TaskGroup code was restricted to not rely much on cancellation messages being available. IIUC, you dropped 3.8 ahead of EoL specifically so that AnyIO can always rely on cancellation messages in _uncancel.

      I had the thought that it's possible that _uncancel is not the only avenue that the msg support opens up. AnyIO now has the (new?) ability to differentiate between a level-CancelledError and a native-CancelledError. That could possibly be useful beyond _uncancel.

    So, one way to state Different cancel scope behaviour on asyncio vs Trio #698 is: AnyIO needs to behave as if it's not "tagging" a get_cancelled_exc_class() at .throw-time with an origin scope. Maybe a good way to do that would be to stop tagging each level-CancelledError with an origin scope in the first place? Is that even possible? (AnyIO does have this new(?) ability mentioned above, so perhaps it could help?)

    During the process of trying to learn/understand both Delay deciding which cancel scope a Cancelled exception belongs to python-trio/trio#860 and this PR I ended up playing with the code a lot, and at some point I realized I had basically attempted to do the above. It seems to work. (At least, it passes all tests, but it could easily still be broken around uncancellation or something.) Link: c1a4a85...d802a81. I did not find the cancelled_caught misbehavior until later but this implementation passes those four suggested tests as well.

    Do you think that the idea of relying on msg like this could be useful? (It may be that this particular implementation is better to use only as a toy model for the sake of me building understanding, but the msg idea may still be independently useful.)

    (There are also a few things I found valuable about this implementation of CancelScope.__exit__ & TaskGroup.__aexit__ that I think are worth pointing out. One is that it is very feasible to cross-reference with the equivalent code in the Trio project. While I was trying to get the behavior right, being able to compare line-by-line with the equivalent code in Trio was very helpful in finding and fixing cases that I missed/misunderstood. My mental model of cancel scopes also maps much more directly onto this implementation, line-by-line, so I find it much easier to reason about. I would be curious if it is this way for others as well. However, this implementation is more verbose (is this good or bad? it depends on the use-case) and the processing cost during __(a)exit__ is likely a bit higher because of the .split calls (which Trio does too) and because the _parent_cancelled() call.

    I point out these details of the implementation because I think the CancelScope.__exit__ & TaskGroup.__aexit__ implementations on master and in this PR are very difficult to read through in order to audit correctness of, and to make changes to when a bug is found2. I would be curious to know if you and others feel the same or not. My thought is that perhaps these things could be made less difficult if the implementation aligned itself in structure more directly with the mental model it is implementing.)

Footnotes

  1. This is part of Trio's documented behavior; it's in the changelog (although some other parts of Trio's docs got out of sync and contradict this).

  2. I still don't yet fully understand __exit__'s swallowing logic in this PR :/

@gschaffner
Copy link
Collaborator

I added two commits to my toy implementation to fix uncancellation: c1a4a85...cbc8c2c (new changes only: d802a81...cbc8c2c)

@agronholm
Copy link
Owner Author

My takeaways here:

  • Semantics should match asyncio vs Trio, and in the current state of this PR, they don't
  • The cancellation message: I'd prefer it to include the ID of the cancel scope where it originated, for easier debugging, but maybe we shouldn't use that ID to decide whether to swallow the exception in CancelScope.__exit__()

@agronholm
Copy link
Owner Author

What I have some difficulties understanding is this:

async def test_cancelled_raises_beyond_origin() -> None:
    with CancelScope() as outer_scope:
        with CancelScope() as inner_scope:
            inner_scope.cancel()
            try:
                await checkpoint()
            finally:
                outer_scope.cancel()

            pytest.fail("checkpoint should have raised")

    assert not inner_scope.cancelled_caught
    assert outer_scope.cancelled_caught

By what logic should outer_scope set the cancelled_caught flag? This is what Trio's documentation says about it:

Records whether this scope caught a Cancelled exception. This requires two things: (1) the with block exited with a Cancelled exception, and (2) this scope is the one that was responsible for triggering this Cancelled exception.

Here, the inner scope should not have it set because it doesn't catch the CancelledError, but the outer scope does. However, the outer scope did not trigger the CancelledError – it was the inner scope that did so. The outer scope was explicitly cancelled, but the CancelledError was passed through by the inner scope, so it doesn't belong to the outer scope.

Perhaps the logic is that, given that both cancel scopes were cancelled, it's the outer scope that's wholly responsible for the cancellation? And if the inner scope was shielded, it would be doing its own cancellation? But in that case, what would be the logic for outer_scope.cancelled_caught being False, if outer_scope was explicitly cancelled and inner_scope would raise a new CancelledError on exit?

I made two separate versions of this test case for clarity, and added comments reflecting how I think Trio's logic goes:

async def test_cancelled_raises_beyond_origin_unshielded() -> None:
    with CancelScope() as outer_scope:
        with CancelScope() as inner_scope:
            inner_scope.cancel()
            try:
                await checkpoint()
            finally:
                outer_scope.cancel()

            pytest.fail("checkpoint should have raised")

    # Here, the outer scope is responsible for the cancellation, so the inner scope
    # won't catch the cancellation exception, but the outer scope will
    assert not inner_scope.cancelled_caught
    assert outer_scope.cancelled_caught


async def test_cancelled_raises_beyond_origin_shielded() -> None:
    with CancelScope() as outer_scope:
        with CancelScope(shield=True) as inner_scope:
            inner_scope.cancel()
            try:
                await checkpoint()
            finally:
                outer_scope.cancel()

            pytest.fail("checkpoint should have raised")

    # Here, the inner scope is the one responsible for cancellation, and given that the
    # outer scope was also cancelled, it is not considered to have "caught" the
    # cancellation, even though it swallows it, because the inner scope triggered it
    assert inner_scope.cancelled_caught
    assert not outer_scope.cancelled_caught

Could you validate or correct my assumptions here?

@agronholm
Copy link
Owner Author

After a day of debugging, I realized that I had gotten one thing wrong with my asyncio cancel scope fix: apparently when exiting a cancel scope, it should only reraise a cancellation exception if the inner scope was not shielded. I don't know how I could've deduced that from anything...

@agronholm
Copy link
Owner Author

It would seem that my solution of raising a new CancelledError was wrong to begin with. If an unshielded Trio cancel scope is exited with a cancellation error and an eligible enclosing scope is in a cancelled state, it just passes through the same exception instance.

@agronholm
Copy link
Owner Author

I've refactored the asyncio CancelScope implementation to resemble the Trio implementation much more closely now, hopefully making them easier to compare. I also fixed another bug that wasn't tested for before – empty task groups not yielding on exit on asyncio if there were no child tasks.

@agronholm agronholm requested a review from gschaffner September 8, 2024 10:57
@Zac-HD
Copy link
Collaborator

Zac-HD commented Sep 10, 2024

I've still got a long review backlog after vacation + illness, sorry. Skimming this briefly, it might be easier to review if we separated out the drop-py-38 changes from the taskgroup and cancelscope changes.

@agronholm
Copy link
Owner Author

I've still got a long review backlog after vacation + illness, sorry. Skimming this briefly, it might be easier to review if we separated out the drop-py-38 changes from the taskgroup and cancelscope changes.

@gschaffner is reviewing this now, and I trust his insights in the matter. I wasn't going to drop 3.8 support, but it was a hard requirement to fix these issues.

@gschaffner
Copy link
Collaborator

gschaffner commented Sep 12, 2024

By what logic should outer_scope set the cancelled_caught flag? This is what Trio's documentation says about it:

Records whether this scope caught a Cancelled exception. This requires two things: (1) the with block exited with a Cancelled exception, and (2) this scope is the one that was responsible for triggering this Cancelled exception.

Trio's documentation is slightly wrong here. As Arthur pointed out on discourse, Trio's docs weren't updated (in various locations) when the behavior was changed in python-trio/trio#901. Reading through that PR and corresponding issue, I think it's clear that Trio's cancelled_caught docs would be more correct to say e.g.

Records whether this scope caught a Cancelled exception. This requires two things: (1) the with block exited with a Cancelled exception, and (2) no cancelled parent scope is visible to this scope's body when this scope exits.

(cc @arthur-tacca: Have you thought about raising a GH issue with Trio about the various places this is wrong in the documentation? I agree with your findings on Discourse and it seems like the docs should presumably get fixed upstream.)

Perhaps the logic is that, given that both cancel scopes were cancelled, it's the outer scope that's wholly responsible for the cancellation? [...]

[...]

[...]
    # Here, the outer scope is responsible for the cancellation, so the inner scope
    # won't catch the cancellation exception, but the outer scope will
[...]
    # Here, the inner scope is the one responsible for cancellation, and given that the
    # outer scope was also cancelled, it is not considered to have "caught" the
    # cancellation, even though it swallows it, because the inner scope triggered it
[...]

Could you validate or correct my assumptions here?

I believe one wrong assumption here is that Trio is using "which scope is responsible for [catching] this cancellation?" logic.

  • AFAIU Trio doesn't have any "which scope is responsible for this cancellation?" logic, at all. Instead Trio's logic is "am I the only scope that is responsible for this cancellation?". I.e. "can I (an __exit__ing scope) see another scope that would be required to swallow this cancellation exception if I didn't swallow it?".

  • More concretely, "which scope is responsible for this cancellation?" logic certainly cannot explain Trio's behavior in the general case, because there are cases where "which scope is responsible for this cancellation" doesn't have a single answer. E.g.:

    deadline = anyio.current_time() + 1
    with CancelScope(deadline=deadline) as outer_scope:
        with CancelScope(deadline=deadline) as inner_scope:
            await anyio.sleep(2)

    Here, the scopes have the same exact deadline. But, "which is responsible for the cancellation"? I think one would have to admit that either

    • Both scopes are equally responsible for triggering the cancellation.

      But this doesn't make sense given the documentation of cancelled_caught.

    • One scope is responsible for triggering the cancellation, but it's undefined which scope is responsible. (I.e. it's left as an implementation detail determined by the order that the Trio loop's deadlines get processed in).

      But this doesn't make sense given the documentation added in Delay deciding which cancel scope a Cancelled exception belongs to python-trio/trio#901.

@gschaffner
Copy link
Collaborator

gschaffner commented Sep 12, 2024

After a day of debugging, I realized that I had gotten one thing wrong with my asyncio cancel scope fix: apparently when exiting a cancel scope, it should only reraise a cancellation exception if the inner scope was not shielded. I don't know how I could've deduced that from anything...

It would seem that my solution of raising a new CancelledError was wrong to begin with. If an unshielded Trio cancel scope is exited with a cancellation error and an eligible enclosing scope is in a cancelled state, it just passes through the same exception instance.

Yeah. A comment that I would recommend reading if you haven't yet is python-trio/trio#860 (comment). For me at least, this is the clearest explanation of the behavior that I have found (although YMMV of course).

The documentation of the behavior in the docs is quite buried and the docs are subtly wrong about it in various places (self-inconsistent) though :/

Comment on lines +771 to 772
if not self.cancel_scope._effectively_cancelled:
self.cancel_scope.cancel()
Copy link
Collaborator

@gschaffner gschaffner Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this is incorrect. (However: this is not a regression here; it was already incorrect on master too but I noticed it during this review. See #787.).

Do you want to fix #787 in this PR alongside the other bugs? If so, I think suggested changes would be:

src/anyio/_backends/_asyncio.py Outdated Show resolved Hide resolved
src/anyio/_backends/_asyncio.py Outdated Show resolved Hide resolved
src/anyio/_backends/_asyncio.py Outdated Show resolved Hide resolved
tests/test_taskgroups.py Outdated Show resolved Hide resolved
Comment on lines +558 to +560
task.cancel(f"Cancelled by cancel scope {id(origin):x}")
if task is origin._host_task:
origin._cancel_calls += 1
Copy link
Collaborator

@gschaffner gschaffner Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a case that's failing still:

    async def test_cancelled_scope_based_checkpoint(self) -> None:
        """Regression test closely related to #698."""
        with CancelScope() as outer_scope:
            outer_scope.cancel()

            try:
                # The following three lines are a way to implement a checkpoint
                # function. See also https://github.com/python-trio/trio/issues/860.
                with CancelScope() as inner_scope:
                    inner_scope.cancel()
                    await sleep_forever()
            finally:
                assert cast(asyncio.Task, asyncio.current_task()).cancelling()

            pytest.fail("checkpoint should have raised")

        assert not cast(asyncio.Task, asyncio.current_task()).cancelling()

I believe this case should pass. In my toy implementation I fixed it via #774 (comment).

Copy link
Collaborator

@gschaffner gschaffner Sep 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a note, in case needed in the the future: it was decided not to deal with this right now, and see if it causes issues. the discussion was at:

#774 (comment)

#774 (comment)

https://matrix.to/#/!JfFIjeKHlqEVmAsxYP:gitter.im/$lwgNmFfGbPHRcsIcXgkc83VK1in4JH6_QwD1sp4iZHc?via=gitter.im

docs/versionhistory.rst Outdated Show resolved Hide resolved
src/anyio/_backends/_asyncio.py Outdated Show resolved Hide resolved
src/anyio/_backends/_asyncio.py Outdated Show resolved Hide resolved
@agronholm agronholm merged commit 01a37c6 into master Sep 21, 2024
16 checks passed
@agronholm agronholm deleted the fix-695 branch September 21, 2024 10:16
@gschaffner
Copy link
Collaborator

For organization's sake/future reference, here's an enumeration of the potential upstream follow-ups here:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Different cancel scope behaviour on asyncio vs Trio 100% CPU load after cancel
3 participants