-
-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rethink KeyboardInterrupt protection APIs #733
Comments
I've been looking at this some yesterday and today. I have a few possible approaches that I'd like feedback on. The evolutionary oneI'd like to see us move away from the magic locals key if we can. It adds some runtime overhead at most protection status changes, adds extra stack frames in tracebacks, makes The dynamic mechanism can just be a contextvar as long as the static mechanism protects all the contextvar machinery. We'll use this to provide "system tasks start out KI-protected", and provide context managers for toggling it back and forth. Then roundtripping through a thread (the only place in Trio where we explicitly disable KI protection) will automatically preserve KI status once it preserves other contextvars (#178). I've prototyped this with the static mechanism being a dictionary from code objects to protection state changes. The KI protection decorators just add the decorated function's code object to the dictionary. This works pretty well; it's a little more expensive at import time but effectively free when going from unprotected to protected or vice versa at runtime. It also lets us KI-protect code we don't control, such as the plumbing of The revolutionary oneThe only reason to ever not be KI-protected is so that you can interrupt an infinite loop that's not executing any checkpoints. If we can come up with a better solution for that, can we ditch the KI-protection state changing machinery entirely, and always be "KI protected" (ie KI delivered only at checkpoints). The current KI protection code does a great job at letting you make protection granular per function. But in my experience that's somewhat too granular for typical use. In practice all of my Trio apps use The easiest "better solution for infinite loops" that I can think of is an easy user-accessible "dump task tree" feature. If we think that having Ctrl+C not work in that case is too surprising, it's probably possible to have a SIGINT handler that spawns a watchdog thread using the principles discussed in #591, but the details seem likely to be super finicky. But it's possible to wind up with Ctrl+C not working on current Trio regardless (using shielding or an improperly written abort_func), for reasons that have nothing to do with KI protection or infinite loops, and it would be nice to have one recipe that works for all such cases. Maybe we distinguish between an initial and a subsequent Ctrl+C by default. A medium-aggressiveness compromiseWhat if we tracked KI protection per package? If the innermost frame is in Trio, a Trio dependency that we trust not to introduce infinite loops (outcome, async_generator, etc), or a library written against Trio that has opted in to KI protection, then defer the KI. If the innermost frame is in another non-stdlib package, deliver the KI immediately. If it's in the stdlib or in a Trio dependency that might be implicated in a non-Trio infinite loop (such as sortedcontainers), walk outward until you get to a package that fits one of the other cases, then do what it says. We could combine this with the existing locals-key-based per-function KI protection to allow granularity less than package-level where needed. But the package-level protection would catch almost all cases, so the locals key would see actual use only rarely, which minimizes most of the above-described drawbacks of using it. (The only use we'd need that I can think of is the one at the top of each task.) |
Hmm. So I guess there are three somewhat-but-not-entirely distinct topics here:
Starting with the last one first:
I hear you. Python's standard But... there's a kind of "when in Rome" principle for API designs. If we were designing our own language, maybe we'd do it differently, but if you're working within an existing platform, then a suboptimal-but-consistent design is often better overall than a design that's theoretically more optimal but requires a constant fight with the underlying platform and user expectations. (See also: trying to shove functional programming into an imperative language or vice-versa, using And in this case... whatever the theoretical problems with So it makes me very reluctant to give up the default user-friendliness of "it probably just works, no matter how broken your code is", in return for improving the handling of theoretical edge cases that 99% of users don't even understand and rarely encounter. What we can definitely do though, is make it easier for folks who care to opt-in to the more rigorous/unforgiving approach. In theory that's what
In particular, with (Note, this is also an issue with The big downside of a custom task, of course, is that writing it, making sure it gets cancelled correctly on shutdown, etc., is pretty onerous compared to setting a single flag in However! There's the independent suggestion that maybe when a
In this approach, maybe we should rename it to
These all seem like valuable things to do regardless of whether we get revolutionary or not :-). I think we're coming to the conclusion that if the user hits control-C, and doesn't have their own SIGINT handler registered, then the only possible correct outcome is that And note that this is true regardless of whether they have some kind of So I think it'd be totally legit to add watchdogs, task-dumping, etc., regardless of what else we decide here.
With a library author hat on, I do think this might be overkill. Of course there's some taste involved, but inasmuch as we have an official guideline, I think it should be something like: libraries should use If a OK, so in conclusion, I do think we should continue to distinguish between special protected code and regular unprotected code, rather than always treating the whole program as protected. So we still need some mechanism for doing that:
Ugh yes I would love to kill that code. I think you're right that contextvars and tracking code objects are probably a better system. And I like your point about propagating the context across threads automatically carrying the protection status with it ... though I think we'd need a One thing to watch out for with code objects is that they can potentially be shared by logically distinction functions: def f():
...
f1 = enable_ki_protection(f)
f2 = disable_ki_protection(f) Currently, I don't think we actually care about this, though. It would also be fine to have some hard-coded special cases, like: |
Thanks so much for the very detailed & thoughtful response!
OK, I'm sufficiently convinced that letting KI be raised everywhere in user code is a reasonable default. :-)
Yep, this is still my goal too! I didn't want to raise too many issues at once, but "KI at a delicate moment should cancel the whole run and then reraise KI just as we're exiting" is my next plan on the KI thread after we figure out the API design questions.
I've been thinking of this in the context of two different ways I imagine people will want to think about the question of "can KI be raised here?", which might want to use two different mechanisms.
It's easy to capture and restore a contextvar; it's harder (more expensive, at the very least) to capture and restore the state implied by the code objects of the frames that are currently on the stack. I'd propose that if we have the "local safeguard" mechanism at all, we just make it not get inherited by child tasks nor passed through threads. That way we don't need
This is not currently true, because the enable_ki_protection state is not inherited by child tasks (since child tasks don't hook up to their parent's stack at the Python interpreter level, although boy would it be nice if they did). I think it's a desirable end state though.
You're right, and on further examination I think I don't actually wind up protecting the whole library in practice. I would do so if it were easy though, just so there's one fewer thing to think about. (I think it's more likely that I'll write something KI-unsafe than something that infinite-loops without executing any checkpoints.)
Yep, that would be a semantic difference: under the new system you're not allowed to do things like that. (You can fall back to the contextvar-level approach if you need to.) Anyway, synthesizing all of this into a concrete API proposal...
|
By "override all other considerations", I guess you mean: if the stack walk hits a frame with the magic locals key, then it exits then with that as its answer?
Setting the contextvar at the top of a task works well, but modifying it within the task is pretty tricky. E.g., suppose you have: async def top():
await middle()
@enable_ki_protection
async def middle():
await bottom()
async def bottom():
with take_ki_immediately():
... # <-- signal handler runs here Here, the contextvar has been explicitly set to take the KI immediately... but we don't know which stack level that was set at. So here the stack walker will encounter the Of course, it does work for cases where there are no decorated functions on the stack, and for the one case we have in mind (
Quick thought: I'm not a big fan of try:
trio.run(...)
except KeyboardInterrupt:
print("Exiting due to control-C")
sys.exit(1) ...or whatever. Plus silently exiting is often not the best default behavior anyway :-). |
Yep. This allows
Sure, happy to remove it! SIGINT-for-clean-exit was the convention at my previous employer, and is also useful for running stuff from the command line, but I don't feel strongly about it. |
On further thought, I wonder if contextvars are maybe a red herring. One thing that's making me wonder is that the simplest semantics for "KI disabled" would be "you never get a So I guess the cases we actually care are:
Is there anything else we actually care about?
...for a bit I thought maybe we could handle it by making the signal stack-walking code smart enough to walk through nurseries, so nurseries could propagate the state without having to actually calculate it ahead of time. But that doesn't work. (Also I dunno how signal-safe the nursery bookkeeping data structures are.) I guess we could measure how expensive it is for nurseries to capture the whole stack when created (up to the task root, or the first sync function, which might be cheaper since you can read it directly off Hmm, also, I guess in the I thought I was on the trail of a nice simple solution but it's not quite there yet is it :-) |
Maybe we can exploit somehow that we don't really want a "enable KI delivery" option, only the option to disable KI delivery in particular spans of code? So it's a strict "AND": if the task vetoes KI, then KI doesn't happen. If the stack vetoes KI, then KI doesn't happen. That makes the anomalous case I pointed out (with the |
I think you're right. This is the only sensible way out of this mess IMHO. |
Yes, that was my intent, which I probably clouded by talking about the backward-compatibility behavior of the The reason we need the weird exception about I don't think inheriting the code-object-based state ( To be clear, I think we should document the contextvar mutator as the primary interface, and the code object-based decorator as an optimization that is applicable under specific circumstances and limitations only.
Oh, that's a nice trick! I don't know if it's necessary though; even if we don't want to store the KI state in a contextvar, we could have a "you're in a task, bool yes/no" contextvar, and if it's true then
I don't think I agree. If you call a function which reenables KI, then that function can get a KeyboardInterrupt, which might escape from your call to it and which you thus might need to handle. I think "starting a task which reenables KI can cause you to need to handle KI yourself" is morally equivalent. And at that point it doesn't really matter whether the reenablement is at top level in the new task, or around the Separately, it would be nice to have an extensible facility for "per-task attribute that is inherited via the nursery, not the spawner". Like a ScopedContextVar or something. Let the nursery do copy_context() (cheap by design!) and say ScopedContextVar wraps ContextVar with some logic to look itself up in the current task's parent nursery's context. (Honestly I sometimes wish all contextvars worked this way, but that ship has long since sailed.) |
Yeah, I guess there are a bunch of ways to do this. Using contextvar would mean we have to jump through some special hoops to make sure the I feel like all these options are a little bit awkward but basically workable, so ¯\(ツ)/¯
Yeah, that seems fairly reasonable.
Okay, but this is the idea of the "strict AND" – if we simply don't offer any API for enabling KI, then this case is inexpressible :-). Of course the one flaw in this elegant design is that it doesn't work, because
Sounds like something that should maybe be split off into another issue? |
Another issue with remembering the root frame of each task is that it breaks when you insert your own gizmo into Task.coro that wraps the original one. I know that's Evil(TM) but it's also an incredibly useful tool when writing monitoring and debugging tools, so I don't want to break it gratuitously. Scanning all tasks from the signal handler doesn't have that problem, but I think protecting the contextvars backport will be less hassle than that.
Oh, I see what you were going for now! The "special version of capture/acapture" was the missing piece for me.
Yep, will do :-) I think I have enough to take another stab at implementing this now - thanks for your help! |
@oremanj maybe the first step should be to stop injecting |
This is a bit of a kluge, and will hopefully be cleaned up in the future when we overhaul KeyboardInterrupt (python-triogh-733) and/or the cancellation checking APIs (python-triogh-961). But 'checkpoint()' is a common operation, and this speeds it up by ~7x, so... not bad for a ~4 line change. Before this change: - `await checkpoint()`: ~24k/second - `await cancel_shielded_checkpoint()`: ~180k/second After this change: - `await checkpoint()`: ~170k/second - `await cancel_shielded_checkpoint()`: ~180k/second Benchmark script: ```python import time import trio LOOP_SIZE = 1_000_000 async def main(): start = time.monotonic() for _ in range(LOOP_SIZE): await trio.lowlevel.checkpoint() #await trio.lowlevel.cancel_shielded_checkpoint() end = time.monotonic() print(f"{LOOP_SIZE / (end - start):.2f} schedules/second") trio.run(main) ```
This is a bit of a kluge, and will hopefully be cleaned up in the future when we overhaul KeyboardInterrupt (python-triogh-733) and/or the cancellation checking APIs (python-triogh-961). But 'checkpoint()' is a common operation, and this speeds it up by ~7x, so... not bad for a ~4 line change. Before this change: - `await checkpoint()`: ~24k/second - `await cancel_shielded_checkpoint()`: ~180k/second After this change: - `await checkpoint()`: ~170k/second - `await cancel_shielded_checkpoint()`: ~180k/second Benchmark script: ```python import time import trio LOOP_SIZE = 1_000_000 async def main(): start = time.monotonic() for _ in range(LOOP_SIZE): await trio.lowlevel.checkpoint() #await trio.lowlevel.cancel_shielded_checkpoint() end = time.monotonic() print(f"{LOOP_SIZE / (end - start):.2f} schedules/second") trio.run(main) ```
Sometime before 1.0, we should go through the KeyboardInterrupt protection APIs and think them through again.
There seem to be a limited number of patterns: decorating functions/methods that should always have KI protection enabled; setting the default protection in a new task (can be overridden if the function has a preference); temporarily disabling and then reverting to what it was before (especially, reverting inside a call to
outcome.capture
oracapture
). Some of these are kind of awkward with the current API, or impossible (in some places I think we do disable/enable, rather than disable/revert-to-previous-state).It would be nice to fix #550. This might be doable on 3.7+ by exploiting the atomicity of
Context.run
?We'll also want to look again at the issues around atomicity when exiting context managers, I guess.
Should tasks inherit their default KI protection state from the parent nursery?
The text was updated successfully, but these errors were encountered: