-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-74: support argo workflow #2976
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: KunWuLuan The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @KunWuLuan. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
/ok-to-test |
65feed7
to
ac94cf8
Compare
ac94cf8
to
baa5ecd
Compare
It seems that some sections are still empty. So, when you fill all sections, we will review this KEP. |
aab364a
to
e6666be
Compare
@tenzen-y Hi, I filled in the empty sections. And some implement details will be added after we discuss and choose the road. |
Co-authored-by: Kevin Hannon <[email protected]>
Thanks for the updates. |
7c71d1c
to
9bc4d08
Compare
9bc4d08
to
2273de6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for driving this @KunWuLuan!
I left my initial questions.
// stage should be implemented as GenericJob | ||
stages := wf.GetActiveStages() | ||
for _, stage := range stages { | ||
// 1. make sure there is only a single existing instance of the workload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by single existing instance of workload here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a specific stage which may refer to a stageGroup in workflow's status in workflow, we create a kueue.workload to track the resource requests.
|
||
#### How to suspend by stages | ||
|
||
In my daily work, I primarily encounter workflow managers in the form of Argo and Tekton. Consequently, this section will focus on these two. Other workflow management systems can draw upon the discussions in this section for implementation guidance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any significant differences between Argo/Tekton and Airflow when we execute workflows in Kubernetes?
cc @shravan-achar @akshaychitneni @bigsur0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems suspend action in Tekton is not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To answer @andreyvelich question:
Argo
Argo seems to have generic support for workloads via custom CRD specs. For example (metaflow does this) One can support a jobset as a node in a dag for workflow. Generally the default integration for Argo is pods. For argo, it seems that one can put labels/annotations on a pod template as part of the workflow definition.
Argo has the concept of workflows and one can define resusable templates (via WorkflowTemplates) but when you create a workflow that usually means that the workflow will be run once it is created.
Tekton
Tekton seems to only allow the creation of tasks which in turn create pods. One can combine tasks into a pipeline. A pipeline is a reusable component. Tasks and Pipelines are created into the API and they are essentially saved for reusue. A user would create either a TaskRun or a PipelineRun to execute the task or pipeline. A user is able to inject metadata into the pipelineRun (TaskRun) if they add labels to the PipelineRun. This will inject those labels/annotations on each task.
It is also possible to specify different labels/annotations on each task also.
Airflow
Airflow is a workflow engine where one defines dags as a python file that will call out to a operator to execute that task. AFAIK Airflow provides support for Job, Pod and SparkOperator (see https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html#). I think Airflow support would work in Kueue also via metadata labels.
spec: | ||
entrypoint: loop-example-depth-1 | ||
templates: | ||
- name: kueue-suspend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that you are going to mutate the existing Workflow with kueue-suspend
step ?
Why not just mutate the suspend: true
parameter to every step ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that you are going to mutate the existing Workflow with kueue-suspend step ?
Yes
Why not just mutate the suspend: true parameter to every step ?
Change suspend to false
will start all suspend steps. Adding suspend template before every leaf template allow us to control the start of specific steps. For example, for the DAG like this:
A
| \
B C
I think step B and step C should wait for different workload admissions.
args: ["3m"] | ||
``` | ||
|
||
Option 3. Kueue Webhook Enhancement: A new webhook is added within Kueue to intercept pod creations in the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How the Argo Workflow controller will reconcile steps if Pods will be in SchedulingGated status ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, SchedulingGated status is almost same with pending status, so it should not have any effect to Argo Workflow controller. Maybe @terrytangyuan can give us some more information.
Co-authored-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]>
} | ||
|
||
|
||
func (r *workflowReconciler)ReconcileGenericWorkflow(ctx context.Context, req ctrl.Request, wf GenericWorkflow) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this workflow dedicated Reconciler?
We want to keep using the ReconcileGenericJob. The CompsableJob is a good example of how to perform the special typed Job in the generic Job reconciler.
Could you propose an approach to extend the ReconcileGenericJob by WorkflowJob interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to consider about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the design.
@terrytangyuan Hi, do you have any comments on these options? I think we need the feedback of workflow managers to determine what Kueue can do in its framework. |
Thanks for the ping. I'll take a look when I get a chance. Will you be willing to drive and implement the changes needed on Argo Workflows side too? |
@terrytangyuan Oh, yes. I am willing to do this. |
Sorry folks. With the k/k enhancements freeze, I had little time to read this. I'll try to make some time soon. cc @mimowo |
|
||
In this section, we will discuss the advantages and disadvantages of queuing Workflows, Stages, and Tasks individually. Additionally, we will explore how each of these can be implemented within Kueue. | ||
|
||
### What are the Stages for Argo Workflows? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me what a stage is.
Is it an Argo concept that I'm missing? If so, can you add a link to its definition in the Argo documentation?
Otherwise, can you please elaborate a bit more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stage is not a concept in Argo. I try to use a new word to represent the process that will create a set of pods in workflow. In Argo this may be a TaskGroup, or a StepGroup.
You can see the section 'What are the Stages for Argo Workflows'. If there are still some question the concept of stage, I will try to explain more in the section.
@KunWuLuan thank you for driving the KEP. I will try to get to it this week. cc @mwielgus |
Sorry folks, this is an important KEP, but I will not be able to review it before 0.9 release is concluded. |
Hello friends, anyone have any comment? |
We assume to aim for this feature after the next release (v0.11) since the v0.10 deadline will come soon, within 2 weeks. |
@KunWuLuan: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
BuildSegmentableJob(childRequest ctrl.Request) SegmentableJob | ||
} | ||
|
||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can drop this code block from the KEP.
What type of PR is this?
/kind documentation
/kind feature
What this PR does / why we need it:
#74
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?