-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On error handler in pipeline #47
base: main
Are you sure you want to change the base?
Conversation
this on_error handler is executed if any other step in the pipeline failed; and can be e.g. used to trigger a notification on error.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #47 +/- ##
==========================================
- Coverage 71.07% 70.42% -0.65%
==========================================
Files 20 20
Lines 2199 2330 +131
==========================================
+ Hits 1563 1641 +78
- Misses 517 562 +45
- Partials 119 127 +8 ☔ View full report in Codecov by Sentry. |
We need exactly this feature too, how can we help finishing this up and having a release? |
I'm having a look at it right now 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I think we really need this feature. So thanks for taking the initiative here.
I think we need to change the implementation a bit though to fit better into the async handling of prunner so we do not cause issues by the on error task execution (see my comment with the locked mutex).
// we use a detached taskRunner and scheduler to run the onError task, to | ||
// run synchronously (as we are already in an async goroutine here), won't have any cycles, | ||
// and to simplify the code. | ||
taskRunner := r.createTaskRunner(j) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might have an issue cancelation here, since the scheduler / task runner is detached. So this could block the whole thing if the on error task is not finishing.
Locking the pipeline runner mutex (which is for all of prunner) for the whole on error task execution is not good, because we basically block the whole prunner process. It is only okay to write lock the mutex for data structure updates.
I think we need to base this on the *PipelineJob
, where we capture the state/context for each running pipeline job. Maybe it's also enough to run the on error scheduler in a go routine and use the WaitGroup
of *PipelineRunner
to put it into the "normal" waiting behavior.
What about putting the on error task execution in startJob
:
// Run graph asynchronously
r.wg.Add(1)
go func() {
defer r.wg.Done()
lastErr := job.sched.Schedule(graph)
if lastErr != nil {
// TODO Schedule the on error task (sync)
}
r.JobCompleted(job.ID, lastErr)
}()
(we need to implement some kind of first task error state here though, since the last error is not really helpful)
Ideally we would put this behaviour in the taskctl.Scheduler
itself, but it's more generic without real knowledge of output store etc. . I'm thinking of some kind of ghost task that only appears and runs on first error in the pipeline but is already defined. The issue here would be the variables for the failed task stdout/stderr that needs to be put into the variables 🤔.
Thx @skurfuerst and @hlubek for your work on this. To fully explain our use case: We have a long list of exit codes for various failures in the pipeline including a map to translate the error codes for devs and editors in the UI. So the onError task should be able to catch that code and in our case write it to the database for persistence, so we can collect the error reasons for the whole pipeline history and it should also do some cleanup task. |
Any update on this? (Sorry for nagging) |
@skurfuerst Did you find some time to look at my notes about the change? I think we could basically go with the implementation and improve it later, but I fear that we pull in a more complexity and concepts that might be hard to change later. |
@hlubek thanks for your comments :) I sadly did not find time to fix your comments yet; but I can try ro work on this in about two weeks :) In case you can prototype your thoughts, feel free to do so :) All the best, Sebastian |
Ping :D |
Pong ;) |
this on_error handler is executed if any other step in
the pipeline failed; and can be e.g. used to trigger a
notification on error.
The code can already be reviewed; I'll only add a README example.