-
-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More ergonomic support for graceful shutdown #147
Comments
For the record, the reason I didn't do this in the first place is that I feel like Maybe it will turn out the I'm wrong and In the mean time, I'm sort of feeling like maybe we should remove the kwarg from While we're at it, here's another idea: maybe we can have a way to say "yes, I know I'm cancelled, but please stop raising it unless an operation actually blocks". Like a shield with a zero second per operation timeout. So you could attempt to do things like send a goodbye message, but if it doesn't succeed immediately, oh well, let's move on. It would at least remove the issue of having to pick some magic timeout value way down inside the code... ... Though, I guess this doesn't actually help for your use case with websockets, since the main reason websockets have a goodbye message is so the initiator can then wait until they see the corresponding goodbye message from their peer, and know that both sides successfully processed all the data that was sent. (Since otherwise TCP can lose some data at the end of a connection.) Sending a goodbye message and then immediately closing the socket is pretty useless, but that's all my idea above would allow for. Otoh I'm not entirely sure I buy that it makes sense to send websocket goodbye messages on control-C in the first place :-). "Potential data loss" sounds bad, but here it just means that the connection might act like you hit control-C a little earlier than you actually did... and who's to say that's not what happened? (Unless you're carefully monitoring the connection state and choosing when to hit control-C with millisecond precision, which seems like a weird hobby.) But maybe the problem is more general – certainly there are a lot of protocols and applications that have some concept of a graceful shutdown. |
At the application level, I think a graceful shutdown might be fairly common. For example, I may want to make a call to a local database (ie something with time constraints I understand fairly well), storing the last record acknowledged by a remote WebSocket client. With nursery cancellation as the best way to manage a group of tasks, I think this may come up in many circumstances other than control-C (eg explicitly deciding to close a single connection) I am not sure I understand your concerns about adding |
I think there are two things making me uneasy here:
These two concerns are closely related, because if we're hard-coding timeouts into little pieces of the code that can't see the big picture, then we're very likely to make the wrong trade-offs; OTOH if timeouts are being configured as late as possible and at as high a level as possible, then it's much more plausible that we can make the right trade-offs. I almost feel like what we want is for a top-down cancellation to be able to specify a "grace period" – so the code that calls
That's enough for cases where the connection is active, so e.g. we can imagine a HTTP server that every time it goes to send a response, it checks for One obvious way to do this would be to send in an exception, and then code that wants to handle it gracefully can catch it and do whatever clever thing it wants. But... in general, you can't necessarily catch a Maybe it's enough to make the soft-shield state toggleable, so you could do something like: async def handler(ws_conn):
with soft_shield_enabled:
while True:
try:
with soft_shield_disabled:
msg = await ws_conn.next_message()
except Cancelled:
await do_cleanup(ws_conn)
return
await ws_conn.send_message(make_response(msg)) ...Need to think about this some more. I feel like there's something here, but it isn't fully gelled yet :-). |
Note: if we do promote graceful cancellation to a first-class concept, then I guess we'll want to rethink |
|
I'm dubious about the wisdom of getting into the state where you have hours of valuable data accumulated in memory in the first place – maybe you should flush snapshots regularly on a timer? But that said, assuming there's a good reason to do things this way, I think you'd really want to implement this as a task that listens for control-C and then flushes a snapshot. There's nothing wrong with doing that if you have needs that are a little more unusual/extreme, and I think "I have hours of valuable data that I need to make sure get flushed to a database over an async connection while shutting down" counts.
Yeah, you'd have to be careful that the code you put inside the async def next_message(self):
while True:
message = self._proto_state.get_next_message()
if message is not None:
return message
else:
# Receive some data and then try again
data = await self._transport.recv(BUFSIZE)
self._proto_state.receive_bytes(data) then we're pretty much OK – the only However... this gets tricky for some protocols. If A similar issue arises at the protocol level. For websockets, the Solving these problems seems really hard :-(. The way these protocols work, once you start trying to send something then you're really committed to doing so – like you just can't stop in the middle of a websocket Pong frame and switch to a Close frame. (For OpenSSL you're even committing to retrying the same OpenSSL API call repeatedly; it's not enough to make sure that the bytes it gave you eventually get sent. But let's ignore that for right now.) So what would be annoying but possible is to switch the sending to always use a Possibly this would become simpler if we change the semantics of Another possibility that we add more shielding states. Specifically, a task could be in the four states listed down the left column here, and the table shows what checkpoints do in each case:
But we don't set these states directly; instead, there's are two context managers: one that marks its contents as "soft-cancellation enabled", and one that marks its contents as "state may become corrupted if a cancellation happens". Then the states are:
So... this is really complicated. But the idea is that in the end, a method like Hmm. |
Can we have a "task that listens for control-C"? I thought this was usually caught by the outer nursery itself after all tasks had canceled. While control-C is an important special case, I would think a problem with a single connection and the cancellation of its associated nursery would be the more routine scenario. I would want a graceful shutdown of these intermediate nurseries, giving a chance to persist a snapshot of the connection's associated application state. Of the options you mention, the four shielding state approach sounds the most elegant although I realize this is beginning to also threaten the elegant simplicity you currently have :-) Still,l I do like this idea of providing extra protection where dragons lie while adding soft-cancellation to give time for a graceful shutdown (although that is easy for the guy not implementing it to say!) |
By default, This blog post goes into a lot more detail about the trade-offs of different ways of handling control-C. In particular, there's discussion of why the default
Yeah, I'm not sure... some kind of graceful shutdown is a pretty common/fundamental desire, so maybe it's worth it? But this is an area where trio's already ahead of the competition, and it's extremely tricky, and novel, so I'll probably let it stew for a while before making any decisions. One thing that did occur to me is that I've already been thinking that in code like async def recv(self, max_bytes):
try:
# ... complicated stuff here ...
except Cancelled:
# our state is broken, but at least we can make sure that any attempt to use it fails fast
self.forceful_close()
raise If I wrapped that up into a context manager like Hmm. |
I wonder if we should care that in principle someone might want to do some sort of graceful shutdown but doesn't care about the integrity of the stream they're reading from right now, it's some other object that needs cleanup. |
This seems like a relevant paper for the whole "cancellation fragility" issue: Kill-Safe Synchronization Abstractions |
Was thinking about this again in the shower today. It seems like could maybe simplify the whole "soft shield" idea by making the default state be soft-shielded. That is, the semantics would be something like:
Code that's cancellation-oblivious gets the full time to complete. This seems pretty reasonable -- it's a bit annoying for code that's cancellation oblivious and runs forever, because it will use up the full graceful period before getting killed, but it's definitely a better default for any code that normally runs for a finite amount of time, and if infinite loops want good graceful cancellation then they generally need to do something special anyway. If you don't want to set a timeout, you can use Then code that wants to do something clever on graceful cancellation can explicitly register some interest -- I guess by flipping a flag to say "please consider graceful cancellations to be cancellations". (That's sufficient to implement Certainly that's enough for an HTTP server that wants to stop accepting new keepalive requests when the graceful shutdown is triggered. For the case of the websocket server that wants to send a goodbye over TLS... well, one option is to say that grace periods just don't mix well with renegotiation, sorry. In general renegotiation and similar is even less well-supported than I realized when I wrote the above. Or we could make |
The terminology here could use some work... graceful is on the tip of my tongue because I just stripped it out of |
Another thing I just realized: for the minimal version of this idea where setting a grace period acts as (a) setting an overall deadline, and (b) setting a flag indicating that everyone should hurry up, and (c) letting specific pieces of code opt-in to receiving a cancellation when this flag is set, then it's actually totally possible to prototype as a third-party library. Having it built in would potentially have extra advantages, specifically in terms of letting the built-in server code integrate and letting |
There's some interesting discussion of graceful shutdown in Martin Sustrik's Structured Concurrency post, from a rather different perspective. |
A few thoughts that could be useful (after seeing the thread here: https://mail.python.org/pipermail/async-sig/2018-January/000437.html): First, I'd make separate the distinction of what "knobs" a library API might want to expose to its users. These could be things like "finish in this total amount of time no matter what (including clean-up)," "allow x time for clean-up step y (no matter what)," or "aim to finish in this amount of time (excluding clean-up)." Then it would be up to the library author to use that information as it sees fit. For example, if a user specified 30 seconds absolute total time and 10 seconds for clean-up, then the library would compute and allot 20 seconds for the "success" case. Or the library could even tell the user their knob values are incompatible. Lastly, it would be up to the framework (trio in this case) to expose to the library author the primitives to support using that information. One consequence of this way of thinking is that it seems like it would be best if the framework didn't even have to know about concepts like "shutdown" and "graceful." Instead, it could expose lower level operations like increasing or decreasing the allotted time, or execute this block of code within this computed amount of time, and it would be up to the library author to map those operations to support graceful shutdown, etc. |
I think once you get into defining a set of knobs whose semantics are specific to the particular function being called, then those are what we call function arguments :-). In general, there's a big trade-off between semantic richness and universality -- if Trio adds something to the core cancellation semantics, then everyone using trio has to deal with it to some extent, whether it makes sense in their case or not. This can be good – the reason trio has core cancellation semantics in the first place is that libraries that don't support cancellation/timeouts are essentially impossible to use correctly, and getting it right requires some global coordination across different libraries, so it's actually good that we force everyone to think about it and provide a common vocabulary for doing so. The need for universality definitely constrains what we can do though. We can't in general phrase things as "make sure to be finished by time X", because this requires being able to estimate ahead of time how long cleanup operations will take, and there's no way to do that for arbitrary code. And just in general I don't want to provide tons of bells and whistles that nobody uses, or are only useful in corner cases. The semantic difference between regular execution and a "soft/graceful cancel" is that you're saying you want things to finish up, but allowing them to take their natural time. This is particularly relevant for operations that run forever, like a server accept loop – for them a graceful cancel is basically the same as a cancel. But for an operation with (hopefully) bounded time, like running an HTTP request handler, it can keep going. So it's... at least kinda general? And it definitely has some important use cases (HTTP servers!). But I'm not sure yet whether it's a good generic abstraction or not. |
Prototype here in case anyone wants to experiment with this: https://gist.github.com/njsmith/1c83788289aaed49e091c8281d85a85e (Please do, that would be very helpful for figuring otu if it's something we really want to do :-)) |
Relevant to python-trio#886, python-trio#606, python-trio#285, python-trio#147, python-trio#70, python-trio#58, maybe others. I was continuing my effort to shoehorn linked cancel scopes and graceful cancellation into `CancelScope` earlier today and it was feeling too much of a mess, so I decided to explore other options. This PR is the result. It makes major changes to Trio's cancellation internals, but barely any to Trio's cancellation semantics -- all tests pass except for one that is especially persnickety about `cancel_called`. No new tests or docs yet as I wanted to get feedback on the approach before polishing. An overview: * New class `CancelBinding` manages a single lexical context (a `with` block or a task) that might get a different cancellation treatment than its surroundings. "All plumbing, no policy." * Each cancel binding has an effective deadline, a _single_ task, and links to parent and child bindings. Each parent lexically encloses its children. The only cancel bindings with multiple children are the ones immediately surrounding nurseries, and they have one child binding per nursery child task plus maybe one in the nested child. * Each cancel binding calculates its effective deadline based on its parent's effective deadline and some additional data. The actual calculation is performed by an associated `CancelLogic` instance (a small ABC). * `CancelScope` now implements `CancelLogic`, providing the deadline/shield semantics we know and love. It manages potentially-multiple `CancelBinding`s. * Cancel stacks are gone. Instead, each task has an "active" (innermost) cancel binding, which changes as the task moves in and out of cancellation regions. The active cancel binding's effective deadline directly determines whether and when `Cancelled` is raised in the task. * `Runner.deadlines` stores tasks instead of cancel scopes. There is no longer a meaningful state of "deadline is in the past but scope isn't cancelled yet" (this is what the sole failing test doesn't like). If the effective deadline of a task's active cancel binding is non-infinite and in the future, it goes in Runner.deadlines. If it's in the past, the task has a pending cancellation by definition. Potential advantages: * Cancellation becomes extensible without changes to _core, via users writing their own CancelLogic and wrapping a core CancelBinding(s) around it. We could even move CancelScope out of _core if we want to make a point. * Nursery.start() is much simpler. * Splitting shielding into a separate object from cancellation becomes trivial (they'd be two kinds of CancelLogic). * Most operations that are performed frequently take constant time: checking whether you're cancelled, checking what your deadline is, entering and leaving a cancel binding. I haven't benchmarked, so it's possible we're losing on constant factors or something, but in theory this should be faster than the old approach. * Since tasks now have well-defined root cancel bindings, I think python-trio#606 becomes straightforward via providing a way to spawn a system task whose cancel binding is a child of something other than the system nursery's cancel binding. Caveats: * We call `current_time()` a lot. Not sure if this is worth worrying about, and could probably be cached if so. * There are probably bugs, because aren't there always? Current cancel logic: ``` def compute_effective_deadline( self, parent_effective_deadline, parent_extra_info, task ): incoming_deadline = inf if self._shield else parent_effective_deadline my_deadline = -inf if self._cancel_called else self._deadline return min(incoming_deadline, my_deadline), parent_extra_info ``` Want to support a grace period? I'm pretty sure it would work with something like ``` def compute_effective_deadline( self, parent_effective_deadline, parent_extra_info, task ): parent_cleanup_deadline = parent_extra_info.get("effective_cleanup_deadline", parent_effective_deadline) if self._shield: parent_effective_deadline = parent_cleanup_deadline = inf my_cleanup_start = min(self._deadline, self._cancel_called_at) merged_cleanup_deadline = min(parent_cleanup_deadline, my_cleanup_start + self._grace_period) my_extra_info = parent_extra_info.set("effective_cleanup_deadline", merged_cleanup_deadline) if self._shield_during_cleanup: effective_deadline = merged_cleanup_deadline else: effective_deadline = min(parent_effective_deadline, my_cleanup_start) return effective_deadline, my_extra_info ``` Maybe that's not quite _simple_ but it is miles better than what I was looking at before. :-)
Yep, pretty much. The devil is in the details, but then, that's always true. :-) I no longer think "graceful nurseries" as discussed above are a good idea -- what we Really Want (TM) is more like the "service nurseries" idea plus the base graceful cancellation support, and that's much much easier to think about. So, what's in "base graceful cancellation support"?
I no longer think the "grace period inheritance" bit in my original proposal is worth the complexity. |
|
I think we might be talking about two different things when we talk about "soft cancellation":
I think these are both useful in different circumstances, but I'm more interested in type 2 for the purposes of this thread, because it's a lot harder to support without handling in _core. (Type 1 does great with the Type 1 wants the default to be "soft cancellation does nothing", and you mark the places that should respond to a soft cancellation (an appropriate point in each the infinite loops); Type 2 wants it to be "soft cancellation is cancellation", and you mark the places that should get the extra time (to a first approximation, all the Type 2 graceful cancellation at a high level probably looks like
I think this is pretty generally useful?
Agreed, it does seem like the best option is to say "if you want graceful shutdown on Ctrl+C, use
Yeah, service nurseries are only useful if the body of the |
So far I've been imagining Type 2 graceful cancellation support using the tri-state cancel scope model (not cancelled / cancelled but shieldable by "soft-shield" scopes ie "cancelled pending cleanup" / cancelled and only shieldable by "full-shield" scopes ie "fully cancelled"). It occurs to me that we might also model it as an edge-triggered cancellation now (each task receives one |
Huh. Yeah, I'm definitely thinking of "Type 1" shutdown. I've never encountered the other type before. I'm trying to figure out if I believe in it as a single coherent category or not :-). What makes me nervous is in your examples, they all feel very specific to the details of the situation. Trio's cancel scopes shouldn't know the difference between SIGTERM and SIGKILL (though maybe OTOH everyone knows what an infinite loop is :-) (which is a nice characterization btw, thank you). If I hand you some random third-party library, like I dunno, a complex HTTP app running under hypercorn, then what do you expect it to do if you send a "type 2 soft cancel"? what kind of guarantees would you expect it to give, that you could rely on, and file bugs if they weren't met? I may just be suffering from a lack of imagination, because I haven't seen systems that supported these more powerful concepts... do you have any more worked-out examples of where you've run into this, as a Not Exactly Median But Let's Call It Close Enough Trio User (or "NEMBLCICETU" for short)? |
The common thread I'm trying to point at with the "type 2 soft cancellation" is the idea that sometimes, while you can just tear everything down forcefully, and you need to be able to do so correctly to deal with misbehaving peers/networks/etc, you will get better outcomes if you can take a moment to tear things down non-forcefully when you get cancelled. And this probably just adds to your total deadline in practice if you hit the deadline because of a network problem, but maybe the slowdown was for some other reason and the network operations required for your graceful closure will complete quickly.
Definitely! But if Trio's cancel scopes provide a common vocabulary for saying "start sweeping the floors and telling customers we're closing in X seconds, and actually shove them out the door Y seconds after that if they haven't left yet", it's natural for phased shutdown of a process to use that vocabulary. (As for why we should care about supporting SIGTERM-wait-SIGKILL: it gives the child process time to clean up its resources, its own child processes, etc.)
Nope -- but if you can distinguish "this call timed out but the cancellation was processed normally" from "this call timed out and also I didn't hear back about my cancellation request in a reasonable amount of time", that's a useful input into the application-level code for deciding that a network partition has happened and coping with it. Or maybe we're not thinking in terms of network partitions at all, but we do want to see any exception that might have been raised by the remote side as a result of buggy handling of the cancellation.
I was using the same grace period for all the examples, but this example probably makes more sense with a tiny one -- do we really not want to wait even 100 ms for the send queue to drain in this case?
I would expect it to document its behavior in the soft-cancelled state, like "existing requests will be allowed to complete but we'll stop accepting new requests and we'll send CLOSE on all websocket connections once we're done sending queued messages", or whatever. I wouldn't use soft-cancel with a library that didn't define semantics for it, unless I was OK with getting hard-cancel instead. But I also wouldn't use "type 1" soft cancel with a library that didn't define semantics for it, because if it's oblivious then I'm just sleeping for a while before doing a hard cancel, and that's a waste of time. Some more examples of "type 2 soft cancel" being useful, off the top of my head:
|
Also, it's worth mentioning that I haven't personally encountered any cases where I needed to set both a deadline and a grace period, but I think that's largely because I'm not using deadlines very much at all in the stuff I've been writing - it's all on reliable internal networks, and cancellation generally occurs on demand. But where I did have occasion to set a grace period, I wanted "not soft shielded" to be the default. |
There's been some discussion around these topics happening in the structured concurrency sub-forum: https://trio.discourse.group/t/graceful-shutdown/93 |
@oremanj I think I somewhat understand what you're saying here, but I'm still struggling to convince myself that a generic global thing like an ambient soft-cancelled state can handle all these heterogenous cases in an appropriate way.
Oh dear. For the well-behaved case, it seems like you can handle this pretty easily using local knowledge? with move_on_after(REQUEST_TIMEOUT) as cancel_scope:
await do_request(...)
if cancel_scope.was_cancelled:
with move_on_after(LOGOUT_TIMEOUT):
await do_logout(...) The nastier case is if we get an outside cancellation, like from a sibling task crashing. But I guess in your synchronous version of this code, you also have some behavior that happens if there's an unexpected exception (e.g.
I'm guessing that when you shut down the server, then sometimes you'll want to kick everyone off, and other times you'll want to keep the current allocations in place (e.g. an in-place upgrade with some kind of hand-off between the old and new versions). And it would be pretty awkward if you were trying to do a in-place upgrade, but the soft-cancel accidentally kicked everyone off, so you'll need some kind of out-of-band signalling mechanism here anyway? And for the policy decisions like the current WhizBang3 user getting pre-empted, or taking down WhizBang3 for maintenance, then it seems like the policy engine making these decisions has to have an intimate relationship with the code that sends the "please remove your frobnicator" messages. Can't they use one of Trio's many fine communication channels? To me the key argument for having soft-cancellation built-in is would be to allow coordination between cancellers and cancellees who don't otherwise know about each other (e.g., signal handlers and accept loops). I'm also trying to imagine: eventually we will write some docs, like "So you want to add graceful shutdown support to your server!" With the "type 1" design, it's something like:
If we make soft-cancel an opt-out thing, then I fear this documentation will become much more complicated... |
I was thinking about your Type 1 and Type 2 cancellation types, and I think they can both be represented by the prototype I made in #941. The "softness" or "gracefulness" is inherited from parent scopes as long as intervening scopes don't explicitly set it. For Type 1, you would do something like: async def accept_loop():
while True:
with trio.CancelScope(graceful=True) as cancel_scope:
# simulate accept
await trio.sleep(2)
if cancel_scope.cancelled_caught:
# do cleanup behavior for accept cancellation
# this will immediately exit for hard cancel
# e.g. send close to a network peer
await trio.sleep(1)
break
# simulate handling request
await trio.sleep(5)
# general cleanup could go here For Type 2: async def soft_cancel_after(timeout, cancel_scope):
logging.info('calling soft cancel in %ss', timeout)
with trio.move_on_after(timeout):
await trio.sleep_forever()
logging.info('soft cancel called')
cancel_scope.graceful_cancel()
@asynccontextmanager
@async_generator
async def move_on_after_with_grace(timeout, grace_period):
# make child scopes graceful
logging.info('calling hard cancel in %ss', timeout + grace_period)
with trio.CancelScope(graceful=True, deadline=trio.current_time() + timeout + grace_period) as cancel_scope:
async with trio.open_nursery() as nursery:
nursery.start_soon(soft_cancel_after, timeout, cancel_scope)
try:
await yield_(cancel_scope)
finally:
cancel_scope.cancel()
async def main():
timeout = 10
grace_period = 5
try:
async with move_on_after_with_grace(timeout, grace_period) as cancel_scope:
logging.info('about to call do_send()')
await do_send()
logging.info('done calling do_send()')
finally:
if cancel_scope.cancelled_caught:
logging.info('do_send cancelled')
async def do_send():
while True:
# do some normal stuff you want to cancel early here
logging.info('do_send normal 2')
await trio.sleep(2)
with trio.CancelScope(graceful=False): # protect from early cancel
# simulate send all
logging.info('do_send send_all 5')
await trio.sleep(5)
trio.run(main) There's probably something that could be done around Or maybe my examples are too simplistic to expose the issues with this design.
I feel like you're starting to describe supervision trees from Elixir/Erlang and all they ways they can decide what processes to restart when one dies. You might be able to model OTP and supervision trees, but I'm not sure how much of the actor concurrency model you'd need to replicate to maintain the same invariants. It's certainly powerful though. If you could make it as easy to use in the simple case as making coroutines and awaiting them, that would be a big win in my mind. |
Erlang supervisors seem pretty different to me, because their policies are totally local. In Trio terms, they're like assigning restart policies to specific nurseries. That's an interesting thing, and one I'm hoping we'll see trio libraries experimenting with, but it's a different thing from cancellation, which is all about coordination between far-flung code that may not know about each other at all. |
I guess I was thinking about the in-place upgrade vs hand-off behavior and how different tasks could be slowly recycled out to be the new task implementations. But you're right, the cancellation behavior doesn't have a good analog. I was trying to think of ways you could abuse If you want to run an If there's code you don't control that is running the In Erlang, if the supervision tree was setup in an amenable way, you could just send a message to the If there was a library like Blinker for the trio ecosystem, library authors could at least expose signals to callers without having to pass control objects from outside, but I think every case will be slightly different and special. To me, graceful cancellation is only meaningful if you will cancel the whole scope in the near future. Anything else isn't really cancellation, it's signaling. And while trio or a library could potentially improve signaling between tasks, it isn't the topic we've been discussing. |
On 20.02.19 23:51, Nick Malaguti wrote:
if the library hasn't given you a way to send a message inside or let
you supply the nursery the workers will spawn in, you don't have many
options.
You have the option to supply a patch (which should be accepted gladly,
since it's a standard pattern), monkey-patch the library in question, or
fork the code.
…--
-- Matthias Urlichs
|
On 20.02.19 23:51, Nick Malaguti wrote:
If there was a library like Blinker
<https://pythonhosted.org/blinker/> for the trio ecosystem
Easy enough to write. Use the existing Blinker library to push messages
to recipient queues ^W channels; start tasks with an async-for loop to
read, maybe-dispatch, and process them.
To me, graceful cancellation is only meaningful if you will cancel the
whole scope in the near future. Anything else isn't really
cancellation, it's signaling. And while trio or a library could
potentially improve signaling between tasks, it isn't the topic we've
been discussing.
A signalling library moves the responsibility to act on the signal from
the sender to the receiver, which is helpful because the latter best
knows how to [structure its cancel scopes and nurseries to] gracefully
handle the signal. With that is in place, the soft-cancellation stuff we
now are talking about should be sufficient.
…--
-- Matthias Urlichs
|
Even simpler: import trio
from blinker import Signal, signal
from typing import Union
accept_loop_signal = Signal()
class NamedCancelScope(trio.CancelScope):
__slots__ = ['_was_signalled']
def __init__(self, *, name: Union[Signal, str], **kwargs):
super().__init__(**kwargs)
if not isinstance(name, Signal):
name = signal(name)
self._was_signalled = False
name.connect(self._signal_cancel)
@property
def was_signalled(self):
return self._was_signalled
def _signal_cancel(self, _sender):
self._was_signalled = True
self.cancel()
def cancel_named(name: Union[Signal, str]):
if not isinstance(name, Signal):
name = signal(name)
name.send()
# Example Usage:
async def accept_loop():
with NamedCancelScope(name=accept_loop_signal) as cancel_scope:
await trio.sleep_forever()
if cancel_scope.was_signalled:
print('signalled')
if cancel_scope.cancelled_caught:
print('cancelled')
async def accept_loop2():
with NamedCancelScope(name=accept_loop_signal) as cancel_scope:
await trio.sleep_forever()
if cancel_scope.was_signalled:
print('signalled2')
if cancel_scope.cancelled_caught:
print('cancelled2')
async def main():
async with trio.open_nursery() as nursery:
nursery.start_soon(accept_loop)
nursery.start_soon(accept_loop2)
await trio.sleep(3)
cancel_named(accept_loop_signal)
if __name__ == '__main__':
trio.run(main) All scopes that share the same name will be cancelled. This doesn't allow you to only cancel the named scopes below your current scope, but I'm not sure if that is an actual use case. There are also things you could try to do around setting deadlines in named scopes (set them all so they each get the same time to finish their work), but again I think a specific use case would better inform the design than trying to support everything. |
There was some good discussion of graceful shutdown issues in gitter today, starting at https://gitter.im/python-trio/general?at=5e1a59a3a859c14fa1ccfd72. |
Maybe it's not helpful to reply to a 5 year old comment, but to answer this question:
The answer is certainly yes. I can think of two options:
def handle_keyboard_interrupt(nursery: trio.Nursery, signum, frame):
nursery.cancel_scope.cancel() You would set it up by passing
It seems to me all the facilities for ergonomic shutdown are already present in Trio (but maybe they were added since this discussion mainly happened, or maybe my applications are not as complex as others). A technique that could often work is to use two nurseries at the top level, one which represents the whole run of the program (including shutdown), and one that just represents normal operation (not including shutdown). When you want to shutdown, cancel the inner nursery (and set a deadline on the outer one) and signal to the remaining tasks in some other way. That way, there is no need for shielding because cancelling is not the mechanism used to tell tasks about graceful shutdown - if a task gets cancelled then it should always give up immediately, because the opportunity to shutdown gracefully has already passed. Here's an example of how it would look. It accepts incoming connections only during the inner nursery, but handlers for those connections are run in the outer nursery and a message is sent to those handlers to notify them they should shut down (I'm assuming a memory channel is open for each one anyway and a list is kept up to date in some accessible place - that depends on the application). async with trio.open_nursery() as outer_nursery:
async with trio.open_nursery() as inner_nursery:
signal.signal(signal.SIGINT, partial(handle_keyboard_interrupt, inner_nursery))
inner_nursery.start_soon(partial(trio.serve_tcp, ..., handler_nursery=outer_nursery))
# Somehow notify your connection handlers that they should shut down, e.g.:
for channel in handler_channels: # Stashed away elsewhere
channel.send_nowait(None) # Or whatever to indicate shutdown
outer_nursery.cancel_scope.deadline = trio.current_time() + 3 # Grace of 3 seconds Personally, I find it easier to understand doing it at the application level like this, rather than using a facility baked into Trio even if it existed. |
[Edit: This issue started as a discussion of the API for cancel scope shielding, motivated by @merrellb using it to catch
Cancelled
errors and do clean up activities like sending "goodbye" messages in #143, but then morphed into a more general discussion of how trio could help users implement "graceful shutdown" functionality in general, so let's make it the standard issue for that. Original text follows. – @njsmith]The documentation suggests assigning a cancel scope and setting shield as two separate steps:
However
trio.open_cancel_scope
has ashield
argument and it seems like it would also make sense with the convenience functions (trio.move_on_after, trio.move_on_at, trio.fail_after and trio.fail_at
). eg:I would be happy to make the pull-request if this change makes sense.
The text was updated successfully, but these errors were encountered: