What's the deal with partial cancellation #889

njsmith · 2019-01-28T07:39:12Z

Recently we were discussing what happens if a subprocess call gets cancelled, but you want to at least find out what the subprocess said before it got killed. And @oremanj wrote:

I appreciate the simplicity of Trio's stance that "a cancelled operation didn't happen", but it doesn't necessarily compose very well -- if an operation is built out of multiple other underlying operations that can't readily be rolled back, either the "cancelled = didn't happen" rule has to break or the entire higher-level operation has to be uncancellable once started. I don't think we want to propose the latter, so maybe we should think about a user-friendly way to talk about the circumstances in which the rule gets bent?

It's a fair point! The "cancelled operation didn't happen" thing was only ever supposed to apply to low-level, primitive operations. In that context, it's a pretty important rule, because without it you can't ever hope to build anything sensible on top. But it's never made any sense for higher-level operations (i.e., the ones that working programmers are actually interacting with 99.99% of the time). Of course, at the time the initial docs were being written, I was struggling to figure out how to get the primitive operations to work at all and there were no higher-level operations. So that rule probably gets more prominence then it should :-). But things have changed and we should have a better story here.

Recently in a discussion of how to talk about cancellation in the docs, @smurfix wrote:

The result of cancelling something is either (a) the called code didn't do anything, raising a Cancelled exception, or (b) the called code did what it was supposed to do, returning its result normally. Of course there's also the possibility of (c) the called code got part way through and left whatever it tried to accomplish in an inconsistent state.

It's probably out of Trio's scope to signal that state to the caller; there should be an attribute "is this object still usable", and/or the object should raise an InconsistentStateError when it's used again. We might want to document that as best practice, and maybe add that exception to Trio as a sensible default for trio-using libraries to raise.

So that's one idea for how Trio could provide concrete advice to users about how to work with partial cancellation.

I don't have any organized thoughts here, so I'm just going to dump a bunch of unorganized ones.

There were two concrete proposals that @oremanj made in the subprocess discussion (unless there were more and I'm forgetting some :-)):

Add timeout and deadline arguments to trio.run_process. These would have a similar effect to wrapping a cancel scope around run_process, except that if the timeout expires, then run_process wouldn't raise Cancelled, it would raise CalledProcessError, which would be a special exception with attributes recording whatever partial output, return code, etc., we got from the process.

The downside of this is that it's extremely specific to subprocesses, which feels weird. The problem is really "what do you do if an operation times out and you want partial results?" – I actually have no idea what makes subprocesses special here, as compared to, I don't know, calling some docker API or something. So a solution that's specific to subprocesses doesn't feel natural. OTOH it would work, and maybe there's some reason that people need partial results from subprocesses a lot, and don't in other cases, so something simple and specific is fine.
Give run_process a special (optional) semantics, where if while running it say a Cancelled exception materialize, it would automatically replace it with CalledProcessError.

This is a really intriguing idea, but makes me uncomfortable because we have no idea where that Cancelled is coming from – in particular, we don't know whether the code that was going to process the partial results is also cancelled, or not.

I don't actually know why @oremanj is so eager to get at partial results in this case; I gather he has some use case where he needs this feature, but I don't know what it is.

Another notorious example where cancellation loses information in an important way is Stream.send_all. Right now, if send_all gets cancelled, you effectively have to throw away that stream and give up, because you have no idea what data you have or haven't sent.

It wasn't always like this: originally, if send_all was cancelled, there was a hack where we'd attach an attribute to the Cancelled exception recording how many bytes we'd sent, and a sufficiently clever caller could potentially use that to reconstruct the state of the stream.

Then I added SSLStream and it quickly became clear that this design was no good. There are two major issues:

exceptions may start out in some nice well-defined operation like SocketStream.send_all, but they propagate. That's what exceptions do! Right across abstraction boundaries. So, for example, if you called SSLStream.send_all, and it called SocketStream.send_all, then if you weren't careful then you could get an exception out of SSLStream.send_all that has metadata attached saying how many bytes SocketStream.send_all sent, which is catastrophically misleading.
SSLStream actually has some pretty complicated internal state, because, well, you know. Cryptography. In particular, cancellation is very different: with something like SocketStream, if send_all is cancelled in the middle, that's pretty simple: you sent the first N bytes, but not the rest. With SSLStream, though, then send_all immediately commits to sending all the bytes, before it sends any of them. So if it gets cancelled, then we're in this weird state where it's sent some of the bytes, but it's committed to sending the rest of the bytes, but it hasn't yet. Oh, and we don't even know how many user-level bytes have actually been transmitted in a way that the other side can read them. (Like, we might know sent 500 bytes on the underlying socket, but maybe 100 of those are protocol framing, and then the last 50 are actual application data but it's application data that the other side can't decrypt until we send another 50 bytes to complete that frame, ... it's really messy.) There just is no useful way to communicate the state of an SSLStream after send_all is cancelled, no matter what metadata we attach to what exceptions.

So, instead, we've been going ahead with the rule that once a send_all is cancelled, your stream is doomed. We haven't done anything to detect this and e.g. raise an error if you try calling send_all again after a cancelled send_all, like in @smurfix's suggestion.... maybe we should?

And then as a consequence, for downstream users, like trio-websocket, what we've been converging on is basically the rule that only one task should "own" a Stream for sending at a time – if you want to a stream to survive sending from multiple tasks, then you create a background task that handles the send_all calls, and the other tasks send stuff to that task over some kind of channel. As @mehaase recently pointed out in #328 (comment), we might want to start documenting this more thoroughly? (#328 is generally relevant to these issues – it's ostensibly about send_all and locking, but really it's about sharing a stream between multiple tasks, and cancellation turns out to be a major consideration there.)

This does seem to be working out pretty well. So I guess the moral is that at least in this area, "partial results" just aren't an important case to think about. All the cases we care about are either "leaves the state inconsistent" or "atomic", and you can build the latter on top of the former (!) by using a background task + a channel, b/c the channel's send operation is atomic.

Some of this comment also feels relevant, especially the bit about "what does cancellation mean" near the end: #147 (comment)

The text was updated successfully, but these errors were encountered:

oremanj · 2019-01-29T19:46:19Z

I don't actually know why @oremanj is so eager to get at partial results in this case; I gather he has some use case where he needs this feature, but I don't know what it is.

The use case is pretty common, I think:

run a subprocess with a timeout
if the subprocess is killed because the timeout expired, log something that will help someone track down what was unusually slow

For many subprocesses, their output is a good indicator of what they were doing when they were killed.

I agree there's nothing subprocess-specific about the "right" solution to this problem; I wanted to solve it in a subprocess-specific way mostly because I wasn't sure how to solve it generally.

Here's a sketch of a potential more-general solution: we could have things like SocketStream.send_all and run_process and so on support a save_partial_results=<cancel scope> kwarg. Then they would still propagate Cancelled, but would save whatever useful partial results they had on that cancel scope, so you could write something like:

with move_on_after(5) as scope:
    await trio.run_process(["something"], save_partial_results=scope)
if scope.cancelled_caught:
    log_error("process timed out:\n" + scope.partial_results[trio.run_process].stdout)

It could also be a different object; I just figure the cancel scope is handy and available. It could also be done implicitly (without having to be requested), and ambiguities resolved using an appropriate dictionary key. But maybe this is unwarranted complexity for an unproven problem.

njsmith · 2019-02-13T00:10:49Z

Here's a sketch of a potential more-general solution: we could have things like SocketStream.send_all and run_process and so on support a save_partial_results=<cancel scope> kwarg

HMMM. I think this may be on to something!

Using a cancel scope seems a bit ad hoc, like you say. What's the range of options? Let's take run_process as our example:

pass in an empty CompletedProcess object that the function fills in with partial results?
pass in some kind of container, like an empty dict or list or object, and we assign to a key/append something/assign to an attribute?
same, but create a dedicated container for this, like trio.PartialResults, that just acts like a mutable cell?
add a slot on CancelScope to hold partial results, basically making it our PartialResults object?

At a first glance, to me the empty CompletedProcess and the trio.PartialResults seem like the nicer options among these four. Does CompletedProcess have a public constructor?

oremanj · 2019-02-13T00:13:34Z

subprocess.CompletedProcess can't be constructed without at least some info:

>>> subprocess.CompletedProcess()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __init__() missing 2 required positional arguments: 'args' and 'returncode'
>>>

I like the trio.PartialResults idea.

njsmith · 2019-02-13T05:10:37Z

On further thought, it should be more like trio.PartialResult (no s), because it only holds one value :-)

Do we have any examples of places where we want to use this besides trio.run_process? We could do it for SocketStream.send_all, but I'm not sure if it's worth bothering, given that it can't be supported by SSLStream.send_all or Stream.send_all.

jab · 2019-05-08T10:16:14Z

(For .receive_some, if you timeout before downloading the requested number of bytes, is a PartialResult not useful for recovering what was downloaded? Sorry if I’m missing something obvious!)

njsmith · 2019-05-09T00:58:06Z

@jab Stream.receive_some(N) means: receive some bytes – definitely no more than N. So it effectively already supports partial results: if they asked for up-to-10, and you have 5, you can just return 5 – that's a normal successful return, no need for an exception at all.

(Important context: Trio generally follows the rule that if an operation completed successfully, then it doesn't raise Cancelled. The intuition is, if you asked for a timeout of 30 seconds, and the operation completed successfully after 30.01 seconds, before our timeout logic had a chance to kick in and stop it... then oh well, we might as well call it a success; no point in throwing away the successful result.)

jab · 2019-05-10T03:26:06Z

Thanks @njsmith! So only in the "0 bytes downloaded" case would receive_some raise Cancelled after a timeout – makes sense.

GalaxySnail · 2021-12-04T11:31:00Z

Another notorious example where cancellation loses information in an important way is Stream.send_all. Right now, if send_all gets cancelled, you effectively have to throw away that stream and give up, because you have no idea what data you have or haven't sent.

That's right. Cancellation on FdStream.send_all is really a problem. For example:

# pipes.py

import os
import trio


async def main():
    r, w = os.pipe()
    b = b"a" * (65536 * 2)
    
    async with trio.lowlevel.FdStream(r) as rstream:

        async with trio.lowlevel.FdStream(w) as wstream:
            with trio.move_on_after(1):
                print(await wstream.send_all(b))

        async for data in rstream:
            print(len(data))


if __name__ == "__main__":
    trio.run(main)

$ python pipes.py
65536
$

And another example of subprocess:

# proc.py

import subprocess
from functools import partial
import trio
# trio at 356db30e901fcde82b8fd0acdd3c109ca61e2156 (2021.7.14) or later


async def main():
    async with trio.open_nursery() as nursery:
        proc = await nursery.start(partial(
            trio.run_process,
            ["/bin/cat"],
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
        ))

        b = b"a" * (65536 * 16)

        async with proc.stdin:
            with trio.move_on_after(1):
                await proc.stdin.send_all(b)

        async with proc.stdout:
            async for data in proc.stdout:
                print(len(data))


if __name__ == "__main__":
    trio.run(main)

$ python proc.py
65536
65536
65536
$

The similar thing happens on SocketStream too:

# sockets.py

from functools import partial
import trio
from trio.testing import open_stream_to_socket_listener


async def handler(event, cancel_scope, stream):
    await event.wait()

    async for data in stream:
        print(len(data))

    cancel_scope.cancel()


async def main():
    b = b"a" * (65536 * 16)
    event = trio.Event()

    async with trio.open_nursery() as nursery:
        listeners = await nursery.start(
            trio.serve_tcp,
            partial(handler, event, nursery.cancel_scope),
            0
        )

        async with await open_stream_to_socket_listener(listeners[0]) as stream:
            with trio.move_on_after(1):
                await stream.send_all(b)

        event.set()


if __name__ == "__main__":
    trio.run(main)

$ python sockets.py
65536
32687
$

So, for this case, send_all is difficult to use:

async for data in recv_stream:
    with trio.move_on_after(0.1):
        send_stream.send_all(data)
    # update the progress bar every 0.1s
    progress_bar.update()
    # We don't know how many bytes are sent, so we can't even retry it.

In my opinion, a new PartialSendStream.send_some may be helpful:

class PartialSendStream(SendStream):

    @abstractmethod
    async def send_some(data: ReadableBuffer) -> int: ...

    @abstractmethod
    def send_some_nowait(data: ReadableBuffer) -> int: ...

send_some is like posix write, it returns the number of bytes actually sent. But the key is it breaks the cancellation semantics: When send_some gets cancelled, if nothing was sent, it raises Cancelled as usual; otherwise, some bytes were already sent, it returns the number of bytes sent immediately without any exceptions. Since cancellation is level triggered, I think it's not a big problem. If the caller calls it repeatedly, everything will be fine. But if send_some is the last checkpoint in a function, the caller has responsibility to call trio.lowlevel.checkpoint_if_cancelled to follow the cancellation semantics.

For example:

async for data in recv_stream:
    data = memoryview(data)
    remaining = len(data)
    while remaining:
        with trio.move_on_after(0.1) as cancel_scope:
            sent = await partial_send_stream.send_some(data[-remaining:])
            # consume its return value before the next checkpoint
            remaining -= sent
            await do_some_other_checkpoints()
        progress_bar.update()

send_some_nowait is like posix write with the O_NONBLOCK flag set. It isn't a checkpoint, just like trio.MemorySendChannel.send_nowait.

Morevoer, I think a PartialSendStream.wait_send_some_wont_block method might be useful, but I'm not sure if it can be implemented portably.

class PartialSendStream(SendStream):

    @abstractmethod
    async def wait_send_some_wont_block(at_least: int) -> None: ...

It's like SendStream.wait_send_all_might_not_block, but it's for send_some. After it returns, send_some bytes with its length not greater than at_least will definitely not block. If at_least is too large (for example, larger than the kernel buffer size), this method can raise a ValueError. I think it's more useful than wait_send_all_might_not_block.

If PartialSendStream is introduced, SocketStream and FdStream should implement it, but SSLStream shouldn't. And if send_all gets cancelled, all operations on the stream will raise BrokenResourceError like SSLStream, even for SocketStream and FdStream.

However, there is a problem that how to name the "partial-sendable" version of abc.Stream, abc.HalfCloseableStream and StapledStream. I have no clear idea about it.

By the way, a receive_some_nowait method for ReceiveStream may also be useful.

njsmith mentioned this issue Jan 28, 2019

Can we talk about "cancelled" instead of "cancelled_caught" and "raised Cancelled" and similar circumlocutions? #885

Open

oremanj added cancellation design discussion labels May 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the deal with partial cancellation #889

What's the deal with partial cancellation #889

njsmith commented Jan 28, 2019

oremanj commented Jan 29, 2019 •

edited

Loading

njsmith commented Feb 13, 2019

oremanj commented Feb 13, 2019

njsmith commented Feb 13, 2019

jab commented May 8, 2019 •

edited

Loading

njsmith commented May 9, 2019

jab commented May 10, 2019

GalaxySnail commented Dec 4, 2021 •

edited

Loading

What's the deal with partial cancellation #889

What's the deal with partial cancellation #889

Comments

njsmith commented Jan 28, 2019

oremanj commented Jan 29, 2019 • edited Loading

njsmith commented Feb 13, 2019

oremanj commented Feb 13, 2019

njsmith commented Feb 13, 2019

jab commented May 8, 2019 • edited Loading

njsmith commented May 9, 2019

jab commented May 10, 2019

GalaxySnail commented Dec 4, 2021 • edited Loading

oremanj commented Jan 29, 2019 •

edited

Loading

jab commented May 8, 2019 •

edited

Loading

GalaxySnail commented Dec 4, 2021 •

edited

Loading