Fixed TaskGroup and CancelScope exit issues on asyncio #774

agronholm · 2024-08-29T14:20:45Z

Changes

This fixes both cancel scope exit behavior on asyncio where an outer cancel scope has been cancelled, and constant unfruitful cancellation attempts of a task that is waiting on TaskGroup__aexit__().

Fixes #695.
Fixes #698.

Checklist

If this is a user-facing code change, like a bugfix or a new feature, please ensure that
you've fulfilled the following conditions (where applicable):

You've added tests (in tests/) added which would fail without your patch
You've updated the documentation (in docs/, in case of behavior changes or new
features)
You've added a new changelog entry (in docs/versionhistory.rst).

If this is a trivial change, like a typo fix or a code reformatting, then you can ignore
these instructions.

Updating the changelog

If there are no entries after the last release, use **UNRELEASED** as the version.
If, say, your patch fixes issue #123, the entry should look like this:

* Fix big bad boo-boo in task groups (#123 <https://github.com/agronholm/anyio/issues/123>_; PR by @yourgithubaccount)

If there's no issue linked, just link to your pull request instead by updating the
changelog after you've created the PR.

for more information, see https://pre-commit.ci

gschaffner

This one has been melting my mind a bit, sorry it has taken a while.

Consider adding a regression test for the case written in Different cancel scope behaviour on asyncio vs Trio #698. cac85cc (test_cancelled_raises_beyond_origin).
- Another suggested case to add: 56cb8b2.
Consider adding a regression test for a case closely related to the case written in Different cancel scope behaviour on asyncio vs Trio #698. 73f1ab7 (test_deadline_based_checkpoint). Note that the failure mode of test_deadline_based_checkpoint on master is different than the failure mode of test_cancelled_raises_beyond_origin on master.

One reason that I like this test case is because I found Trio's behavior in the case written in Different cancel scope behaviour on asyncio vs Trio #698 ("a level cancellation exception can keep propagating beyond the level that spawned the exception") to be surprising. (It violated my assumptions about Trio's level cancellation, and as pointed out in https://trio.discourse.group/t/do-cancelled-exceptions-know-which-block-they-belong-to/508 it seems to violate the first paragraph of Trio's "cancellation semantics" docs too.)

The nice thing about this test, IMO, is that if you accept the premise that
```
with CancelScope(deadline=-math.inf) as inner_scope:
    await sleep_forever()
```
is a valid way to implement checkpoint(), then the strange behavior follows directly from that (mostly). (I say "mostly" because in the test_deadline_based_checkpoint case, the Cancelled was legitimately triggered by either scope (I think it is undefined which scope is responsible for triggering it). The test_cancelled_raises_beyond_origin case is stranger, because at the time of coro.throw(Cancelled()), it is unambiguous which scope is responsible for throwing the Cancelled, but the exception still propagates up beyond that scope.)
Consider adding a regression test for a case closely related to Different cancel scope behaviour on asyncio vs Trio #698. c1a4a85 (test_cancelled_scope_based_checkpoint). This is very similar to test_deadline_based_checkpoint, except (IIUC) deadline_based_checkpoint was a valid checkpoint() implementation prior to Delay deciding which cancel scope a Cancelled exception belongs to python-trio/trio#860 but cancelled_scope_based_checkpoint only became equivalent to checkpoint() in Delay deciding which cancel scope a Cancelled exception belongs to python-trio/trio#860.
A bug: CancelScope.cancelled_caught is wrong in some cases. (I did not notice this until a lot later, but for the sake of the review being more organized, I've retroactively added tests for this to the four tests suggested above, which fail on this branch right now.)
A longer comment regarding Different cancel scope behaviour on asyncio vs Trio #698:
- AnyIO currently uses task.cancel(msg) to associate a level-CancelledError with the cancel scope that made the corresponding task.cancel(msg="cancelled by scope {origin}") call (the level-CancelledError's "origin" scope).
  
  These associations are what Different cancel scope behaviour on asyncio vs Trio #698 is all about. Trio does not associate each Cancelled with a particular "origin" scope (Delay deciding which cancel scope a Cancelled exception belongs to python-trio/trio#860) ¹. The asyncio backend does make such associations, however. Thinking about it from Trio's perspective, this seems strange. Why would AsyncIOBackend associate each level-CancelledError with a particular cancel scope? AsyncIOBackend needs to behave like Trio, so it needs to behave like it doesn't make such associations. On asyncio, is making these associations for some reason necessary in order to produce the correct, association-free, behavior?
- Something you mentioned: When Python 3.8 was supported, AnyIO could not rely on cancellation messages. It could do so in sys.version_info blocks, but so long as 3.8 was supported, the overall strategy of the CancelScope and TaskGroup code was restricted to not rely much on cancellation messages being available. IIUC, you dropped 3.8 ahead of EoL specifically so that AnyIO can always rely on cancellation messages in _uncancel.
  
  I had the thought that it's possible that _uncancel is not the only avenue that the msg support opens up. AnyIO now has the (new?) ability to differentiate between a level-CancelledError and a native-CancelledError. That could possibly be useful beyond _uncancel.
So, one way to state Different cancel scope behaviour on asyncio vs Trio #698 is: AnyIO needs to behave as if it's not "tagging" a get_cancelled_exc_class() at .throw-time with an origin scope. Maybe a good way to do that would be to stop tagging each level-CancelledError with an origin scope in the first place? Is that even possible? (AnyIO does have this new(?) ability mentioned above, so perhaps it could help?)

During the process of trying to learn/understand both Delay deciding which cancel scope a Cancelled exception belongs to python-trio/trio#860 and this PR I ended up playing with the code a lot, and at some point I realized I had basically attempted to do the above. It seems to work. (At least, it passes all tests, but it could easily still be broken around uncancellation or something.) Link: c1a4a85...d802a81. I did not find the cancelled_caught misbehavior until later but this implementation passes those four suggested tests as well.

Do you think that the idea of relying on msg like this could be useful? (It may be that this particular implementation is better to use only as a toy model for the sake of me building understanding, but the msg idea may still be independently useful.)

(There are also a few things I found valuable about this implementation of CancelScope.__exit__ & TaskGroup.__aexit__ that I think are worth pointing out. One is that it is very feasible to cross-reference with the equivalent code in the Trio project. While I was trying to get the behavior right, being able to compare line-by-line with the equivalent code in Trio was very helpful in finding and fixing cases that I missed/misunderstood. My mental model of cancel scopes also maps much more directly onto this implementation, line-by-line, so I find it much easier to reason about. I would be curious if it is this way for others as well. However, this implementation is more verbose (is this good or bad? it depends on the use-case) and the processing cost during __(a)exit__ is likely a bit higher because of the .split calls (which Trio does too) and because the _parent_cancelled() call.

I point out these details of the implementation because I think the CancelScope.__exit__ & TaskGroup.__aexit__ implementations on master and in this PR are very difficult to read through in order to audit correctness of, and to make changes to when a bug is found². I would be curious to know if you and others feel the same or not. My thought is that perhaps these things could be made less difficult if the implementation aligned itself in structure more directly with the mental model it is implementing.)

This is part of Trio's documented behavior; it's in the changelog (although some other parts of Trio's docs got out of sync and contradict this). ↩
I still don't yet fully understand __exit__'s swallowing logic in this PR :/ ↩

gschaffner · 2024-09-02T10:19:43Z

I added two commits to my toy implementation to fix uncancellation: c1a4a85...cbc8c2c (new changes only: d802a81...cbc8c2c)

agronholm · 2024-09-07T08:33:05Z

My takeaways here:

Semantics should match asyncio vs Trio, and in the current state of this PR, they don't
The cancellation message: I'd prefer it to include the ID of the cancel scope where it originated, for easier debugging, but maybe we shouldn't use that ID to decide whether to swallow the exception in CancelScope.__exit__()

agronholm · 2024-09-07T11:19:04Z

What I have some difficulties understanding is this:

async def test_cancelled_raises_beyond_origin() -> None:
    with CancelScope() as outer_scope:
        with CancelScope() as inner_scope:
            inner_scope.cancel()
            try:
                await checkpoint()
            finally:
                outer_scope.cancel()

            pytest.fail("checkpoint should have raised")

    assert not inner_scope.cancelled_caught
    assert outer_scope.cancelled_caught

By what logic should outer_scope set the cancelled_caught flag? This is what Trio's documentation says about it:

Records whether this scope caught a Cancelled exception. This requires two things: (1) the with block exited with a Cancelled exception, and (2) this scope is the one that was responsible for triggering this Cancelled exception.

Here, the inner scope should not have it set because it doesn't catch the CancelledError, but the outer scope does. However, the outer scope did not trigger the CancelledError – it was the inner scope that did so. The outer scope was explicitly cancelled, but the CancelledError was passed through by the inner scope, so it doesn't belong to the outer scope.

Perhaps the logic is that, given that both cancel scopes were cancelled, it's the outer scope that's wholly responsible for the cancellation? And if the inner scope was shielded, it would be doing its own cancellation? But in that case, what would be the logic for outer_scope.cancelled_caught being False, if outer_scope was explicitly cancelled and inner_scope would raise a new CancelledError on exit?

I made two separate versions of this test case for clarity, and added comments reflecting how I think Trio's logic goes:

async def test_cancelled_raises_beyond_origin_unshielded() -> None:
    with CancelScope() as outer_scope:
        with CancelScope() as inner_scope:
            inner_scope.cancel()
            try:
                await checkpoint()
            finally:
                outer_scope.cancel()

            pytest.fail("checkpoint should have raised")

    # Here, the outer scope is responsible for the cancellation, so the inner scope
    # won't catch the cancellation exception, but the outer scope will
    assert not inner_scope.cancelled_caught
    assert outer_scope.cancelled_caught


async def test_cancelled_raises_beyond_origin_shielded() -> None:
    with CancelScope() as outer_scope:
        with CancelScope(shield=True) as inner_scope:
            inner_scope.cancel()
            try:
                await checkpoint()
            finally:
                outer_scope.cancel()

            pytest.fail("checkpoint should have raised")

    # Here, the inner scope is the one responsible for cancellation, and given that the
    # outer scope was also cancelled, it is not considered to have "caught" the
    # cancellation, even though it swallows it, because the inner scope triggered it
    assert inner_scope.cancelled_caught
    assert not outer_scope.cancelled_caught

Could you validate or correct my assumptions here?

agronholm · 2024-09-07T18:44:59Z

After a day of debugging, I realized that I had gotten one thing wrong with my asyncio cancel scope fix: apparently when exiting a cancel scope, it should only reraise a cancellation exception if the inner scope was not shielded. I don't know how I could've deduced that from anything...

agronholm · 2024-09-07T20:02:17Z

It would seem that my solution of raising a new CancelledError was wrong to begin with. If an unshielded Trio cancel scope is exited with a cancellation error and an eligible enclosing scope is in a cancelled state, it just passes through the same exception instance.

agronholm · 2024-09-08T10:54:45Z

I've refactored the asyncio CancelScope implementation to resemble the Trio implementation much more closely now, hopefully making them easier to compare. I also fixed another bug that wasn't tested for before – empty task groups not yielding on exit on asyncio if there were no child tasks.

The pre-release version now works properly on 3.13.

Zac-HD · 2024-09-10T23:48:37Z

I've still got a long review backlog after vacation + illness, sorry. Skimming this briefly, it might be easier to review if we separated out the drop-py-38 changes from the taskgroup and cancelscope changes.

agronholm · 2024-09-11T07:59:53Z

I've still got a long review backlog after vacation + illness, sorry. Skimming this briefly, it might be easier to review if we separated out the drop-py-38 changes from the taskgroup and cancelscope changes.

@gschaffner is reviewing this now, and I trust his insights in the matter. I wasn't going to drop 3.8 support, but it was a hard requirement to fix these issues.

gschaffner · 2024-09-12T06:33:10Z

By what logic should outer_scope set the cancelled_caught flag? This is what Trio's documentation says about it:

Records whether this scope caught a Cancelled exception. This requires two things: (1) the with block exited with a Cancelled exception, and (2) this scope is the one that was responsible for triggering this Cancelled exception.

Trio's documentation is slightly wrong here. As Arthur pointed out on discourse, Trio's docs weren't updated (in various locations) when the behavior was changed in python-trio/trio#901. Reading through that PR and corresponding issue, I think it's clear that Trio's cancelled_caught docs would be more correct to say e.g.

Records whether this scope caught a Cancelled exception. This requires two things: (1) the with block exited with a Cancelled exception, and (2) no cancelled parent scope is visible to this scope's body when this scope exits.

(cc @arthur-tacca: Have you thought about raising a GH issue with Trio about the various places this is wrong in the documentation? I agree with your findings on Discourse and it seems like the docs should presumably get fixed upstream.)

Perhaps the logic is that, given that both cancel scopes were cancelled, it's the outer scope that's wholly responsible for the cancellation? [...]

[...]

[...]
    # Here, the outer scope is responsible for the cancellation, so the inner scope
    # won't catch the cancellation exception, but the outer scope will
[...]
    # Here, the inner scope is the one responsible for cancellation, and given that the
    # outer scope was also cancelled, it is not considered to have "caught" the
    # cancellation, even though it swallows it, because the inner scope triggered it
[...]

Could you validate or correct my assumptions here?

I believe one wrong assumption here is that Trio is using "which scope is responsible for [catching] this cancellation?" logic.

AFAIU Trio doesn't have any "which scope is responsible for this cancellation?" logic, at all. Instead Trio's logic is "am I the only scope that is responsible for this cancellation?". I.e. "can I (an __exit__ing scope) see another scope that would be required to swallow this cancellation exception if I didn't swallow it?".
More concretely, "which scope is responsible for this cancellation?" logic certainly cannot explain Trio's behavior in the general case, because there are cases where "which scope is responsible for this cancellation" doesn't have a single answer. E.g.:
```
deadline = anyio.current_time() + 1
with CancelScope(deadline=deadline) as outer_scope:
    with CancelScope(deadline=deadline) as inner_scope:
        await anyio.sleep(2)
```
Here, the scopes have the same exact deadline. But, "which is responsible for the cancellation"? I think one would have to admit that either
- Both scopes are equally responsible for triggering the cancellation.
  
  But this doesn't make sense given the documentation of cancelled_caught.
- One scope is responsible for triggering the cancellation, but it's undefined which scope is responsible. (I.e. it's left as an implementation detail determined by the order that the Trio loop's deadlines get processed in).
  
  But this doesn't make sense given the documentation added in Delay deciding which cancel scope a Cancelled exception belongs to python-trio/trio#901.

gschaffner · 2024-09-12T06:33:51Z

After a day of debugging, I realized that I had gotten one thing wrong with my asyncio cancel scope fix: apparently when exiting a cancel scope, it should only reraise a cancellation exception if the inner scope was not shielded. I don't know how I could've deduced that from anything...

It would seem that my solution of raising a new CancelledError was wrong to begin with. If an unshielded Trio cancel scope is exited with a cancellation error and an eligible enclosing scope is in a cancelled state, it just passes through the same exception instance.

Yeah. A comment that I would recommend reading if you haven't yet is python-trio/trio#860 (comment). For me at least, this is the clearest explanation of the behavior that I have found (although YMMV of course).

The documentation of the behavior in the docs is quite buried and the docs are subtly wrong about it in various places (self-inconsistent) though :/

gschaffner · 2024-09-12T06:39:20Z

src/anyio/_backends/_asyncio.py

+                    if not self.cancel_scope._effectively_cancelled:
                        self.cancel_scope.cancel()


I believe that this is incorrect. (However: this is not a regression here; it was already incorrect on master too but I noticed it during this review. See #787.).

Do you want to fix #787 in this PR alongside the other bugs? If so, I think suggested changes would be:

Suggested change

if not self.cancel_scope._effectively_cancelled:

self.cancel_scope.cancel()

self.cancel_scope.cancel()

the hypothetical task_done changes I described in Child tasks don't cancel their group's scope correctly when they raise an exception on asyncio #787 (assuming that that approach is viable)

regression test (test_shield_after_task_failed from there)

changelog entry

src/anyio/_backends/_asyncio.py

tests/test_taskgroups.py

gschaffner · 2024-09-19T10:39:42Z

src/anyio/_backends/_asyncio.py

+                    task.cancel(f"Cancelled by cancel scope {id(origin):x}")
+                    if task is origin._host_task:
+                        origin._cancel_calls += 1


Here's a case that's failing still:

async def test_cancelled_scope_based_checkpoint(self) -> None: """Regression test closely related to #698.""" with CancelScope() as outer_scope: outer_scope.cancel() try: # The following three lines are a way to implement a checkpoint # function. See also https://github.com/python-trio/trio/issues/860. with CancelScope() as inner_scope: inner_scope.cancel() await sleep_forever() finally: assert cast(asyncio.Task, asyncio.current_task()).cancelling() pytest.fail("checkpoint should have raised") assert not cast(asyncio.Task, asyncio.current_task()).cancelling()

I believe this case should pass. In my toy implementation I fixed it via #774 (comment).

a note, in case needed in the the future: it was decided not to deal with this right now, and see if it causes issues. the discussion was at:

#774 (comment)

#774 (comment)

https://matrix.to/#/!JfFIjeKHlqEVmAsxYP:gitter.im/$lwgNmFfGbPHRcsIcXgkc83VK1in4JH6_QwD1sp4iZHc?via=gitter.im

…tion

docs/versionhistory.rst

src/anyio/_backends/_asyncio.py

Co-authored-by: Ganden Schaffner <[email protected]>

gschaffner · 2024-09-21T10:47:08Z

For organization's sake/future reference, here's an enumeration of the potential upstream follow-ups here:

Fix Trio's inconsistent docs about Delay deciding which cancel scope a Cancelled exception belongs to python-trio/trio#860 (see also Different cancel scope behaviour on asyncio vs Trio #698)
Clarify (at best) or correct (at worst) asyncio's uncancel documentation: prior discussion has been in various places, but mostly this sub-thread was contained in Fixed TaskGroup and CancelScope exit issues on asyncio #774 (comment), in Fixed TaskGroup and CancelScope exit issues on asyncio #774 (comment), and following https://matrix.to/#/!JfFIjeKHlqEVmAsxYP:gitter.im/$lvMx05v0K-owiWWTomlb6acj8UVVjw9XYEZ8MPOSMBU?via=gitter.im

agronholm added 12 commits August 7, 2024 23:55

WIP fix for #695

278d4b6

WIP: debugging issue #695

437c507

WIP

d5cc818

Tests pass on 3.9

d80af79

Merge branch 'master' into fix-695

a0baae7

Dropped Python 3.8 support

8e3eeb1

Dropped Python 3.8 support

903bc71

Merge branch 'master' into fix-695

fc63721

Fixed subprocess finalization on cancellation

c0a8222

Simplified the uncancellation logic

9bd3c5e

Fixed the last failing test

2cd17af

Added changelog note

388af89

agronholm changed the title ~~Fix 695~~ Fixed TaskGroup and CancelScope exit issues on asyncio Aug 29, 2024

agronholm requested a review from Zac-HD August 29, 2024 14:41

agronholm and others added 4 commits August 29, 2024 21:53

Merge branch 'master' into fix-695

bed20dc

Merge branch 'master' into fix-695

5266431

Merge branch 'master' into fix-695

d19b506

[pre-commit.ci] auto fixes from pre-commit.com hooks

c03fc1d

for more information, see https://pre-commit.ci

gschaffner reviewed Sep 2, 2024

View reviewed changes

agronholm added 4 commits September 2, 2024 20:22

Added the test case from #698

23e687c

Merge branch 'master' into fix-695

7a6b1b0

Added new failing test case

3a90e74

Merge branch 'master' into fix-695

ce5ddb0

agronholm added 2 commits September 8, 2024 13:47

All tests pass now

947c56e

Merge branch 'master' into fix-695

986baf5

agronholm requested a review from gschaffner September 8, 2024 10:57

agronholm added 2 commits September 8, 2024 14:24

Enabled uvloop to be used in the test suite on Python 3.13

f1b2738

The pre-release version now works properly on 3.13.

Merge branch 'master' into fix-695

f9a1e1a

gschaffner mentioned this pull request Sep 12, 2024

Child tasks don't cancel their group's scope correctly when they raise an exception on asyncio #787

Open

2 tasks

gschaffner reviewed Sep 12, 2024

View reviewed changes

agronholm added 4 commits September 12, 2024 18:19

Merge branch 'master' into fix-695

9bce41c

Fixed inconsistent uncancellation by asyncio cancel scopes

3435c72

Fixed another cancel scope issue

6eca825

Fixed TaskGroup swallowing native cancellation exceptions

ab5ebb8

agronholm requested a review from gschaffner September 14, 2024 11:23

dolamroth mentioned this pull request Sep 19, 2024

Ensure all raises of anyio.get_exception_class() propagate novel anyio cancel context dolamroth/starlette-web#93

Open

gschaffner reviewed Sep 19, 2024

View reviewed changes

agronholm added 3 commits September 19, 2024 13:50

Merge branch 'master' into fix-695

093e065

Updated the changelog

c082056

Fixed CancelScope not uncancelling if it was exited with a real excep…

9081213

…tion

gschaffner reviewed Sep 20, 2024

View reviewed changes

docs/versionhistory.rst Outdated Show resolved Hide resolved

src/anyio/_backends/_asyncio.py Outdated Show resolved Hide resolved

src/anyio/_backends/_asyncio.py Outdated Show resolved Hide resolved

agronholm and others added 2 commits September 20, 2024 12:00

Fixed unbalanced parentheses

2f68895

Update tests/test_taskgroups.py

15a4bcb

Co-authored-by: Ganden Schaffner <[email protected]>

gschaffner approved these changes Sep 21, 2024

View reviewed changes

agronholm merged commit 01a37c6 into master Sep 21, 2024
16 checks passed

agronholm deleted the fix-695 branch September 21, 2024 10:16

gschaffner mentioned this pull request Sep 22, 2024

Fix asyncio.Task.cancelling issues #790

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed TaskGroup and CancelScope exit issues on asyncio #774

Fixed TaskGroup and CancelScope exit issues on asyncio #774

agronholm commented Aug 29, 2024

gschaffner left a comment •

edited

Loading

gschaffner commented Sep 2, 2024

agronholm commented Sep 7, 2024

agronholm commented Sep 7, 2024

agronholm commented Sep 7, 2024

agronholm commented Sep 7, 2024

agronholm commented Sep 8, 2024

Zac-HD commented Sep 10, 2024

agronholm commented Sep 11, 2024

gschaffner commented Sep 12, 2024 •

edited

Loading

gschaffner commented Sep 12, 2024 •

edited

Loading

gschaffner Sep 12, 2024 •

edited

Loading

gschaffner Sep 19, 2024 •

edited

Loading

gschaffner Sep 21, 2024 •

edited

Loading

gschaffner commented Sep 21, 2024

		if not self.cancel_scope._effectively_cancelled:
		self.cancel_scope.cancel()

Fixed TaskGroup and CancelScope exit issues on asyncio #774

Fixed TaskGroup and CancelScope exit issues on asyncio #774

Conversation

agronholm commented Aug 29, 2024

Changes

Checklist

Updating the changelog

gschaffner left a comment • edited Loading

Choose a reason for hiding this comment

Footnotes

gschaffner commented Sep 2, 2024

agronholm commented Sep 7, 2024

agronholm commented Sep 7, 2024

agronholm commented Sep 7, 2024

agronholm commented Sep 7, 2024

agronholm commented Sep 8, 2024

Zac-HD commented Sep 10, 2024

agronholm commented Sep 11, 2024

gschaffner commented Sep 12, 2024 • edited Loading

gschaffner commented Sep 12, 2024 • edited Loading

gschaffner Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

gschaffner Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

gschaffner Sep 21, 2024 • edited Loading

Choose a reason for hiding this comment

gschaffner commented Sep 21, 2024

gschaffner left a comment •

edited

Loading

gschaffner commented Sep 12, 2024 •

edited

Loading

gschaffner commented Sep 12, 2024 •

edited

Loading

gschaffner Sep 12, 2024 •

edited

Loading

gschaffner Sep 19, 2024 •

edited

Loading

gschaffner Sep 21, 2024 •

edited

Loading