Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-74: support argo workflow #2976

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

KunWuLuan
Copy link
Member

@KunWuLuan KunWuLuan commented Sep 4, 2024

What type of PR is this?

/kind documentation
/kind feature

What this PR does / why we need it:

#74

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/documentation Categorizes issue or PR as related to documentation. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 4, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: KunWuLuan
Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 4, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @KunWuLuan. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 4, 2024
Copy link

netlify bot commented Sep 4, 2024

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit fc0837d
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/670f1bb04733780008909230
😎 Deploy Preview https://deploy-preview-2976--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@kannon92
Copy link
Contributor

kannon92 commented Sep 5, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 5, 2024
@KunWuLuan KunWuLuan changed the title WIP: kep-74: support argo workflow KEP-74: support argo workflow Sep 10, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2024
@tenzen-y
Copy link
Member

It seems that some sections are still empty. So, when you fill all sections, we will review this KEP.

@KunWuLuan
Copy link
Member Author

@tenzen-y Hi, I filled in the empty sections. And some implement details will be added after we discuss and choose the road.

@tenzen-y
Copy link
Member

@tenzen-y Hi, I filled in the empty sections. And some implement details will be added after we discuss and choose the road.

Thanks for the updates.
@KunWuLuan Could you resolve CI errors with "make toc-update"?

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 24, 2024
@KunWuLuan KunWuLuan changed the title [WIP] KEP-74: support argo workflow KEP-74: support argo workflow Sep 24, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 24, 2024
Copy link

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for driving this @KunWuLuan!
I left my initial questions.

keps/74-support-argo-workflow/README.md Outdated Show resolved Hide resolved
keps/74-support-argo-workflow/README.md Outdated Show resolved Hide resolved
keps/74-support-argo-workflow/README.md Outdated Show resolved Hide resolved
keps/74-support-argo-workflow/README.md Outdated Show resolved Hide resolved
// stage should be implemented as GenericJob
stages := wf.GetActiveStages()
for _, stage := range stages {
// 1. make sure there is only a single existing instance of the workload.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by single existing instance of workload here ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a specific stage which may refer to a stageGroup in workflow's status in workflow, we create a kueue.workload to track the resource requests.


#### How to suspend by stages

In my daily work, I primarily encounter workflow managers in the form of Argo and Tekton. Consequently, this section will focus on these two. Other workflow management systems can draw upon the discussions in this section for implementation guidance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any significant differences between Argo/Tekton and Airflow when we execute workflows in Kubernetes?
cc @shravan-achar @akshaychitneni @bigsur0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems suspend action in Tekton is not supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer @andreyvelich question:

Argo

Argo seems to have generic support for workloads via custom CRD specs. For example (metaflow does this) One can support a jobset as a node in a dag for workflow. Generally the default integration for Argo is pods. For argo, it seems that one can put labels/annotations on a pod template as part of the workflow definition.

Argo has the concept of workflows and one can define resusable templates (via WorkflowTemplates) but when you create a workflow that usually means that the workflow will be run once it is created.

Tekton

Tekton seems to only allow the creation of tasks which in turn create pods. One can combine tasks into a pipeline. A pipeline is a reusable component. Tasks and Pipelines are created into the API and they are essentially saved for reusue. A user would create either a TaskRun or a PipelineRun to execute the task or pipeline. A user is able to inject metadata into the pipelineRun (TaskRun) if they add labels to the PipelineRun. This will inject those labels/annotations on each task.

It is also possible to specify different labels/annotations on each task also.

Airflow

Airflow is a workflow engine where one defines dags as a python file that will call out to a operator to execute that task. AFAIK Airflow provides support for Job, Pod and SparkOperator (see https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html#). I think Airflow support would work in Kueue also via metadata labels.

spec:
entrypoint: loop-example-depth-1
templates:
- name: kueue-suspend

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that you are going to mutate the existing Workflow with kueue-suspend step ?
Why not just mutate the suspend: true parameter to every step ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that you are going to mutate the existing Workflow with kueue-suspend step ?

Yes

Why not just mutate the suspend: true parameter to every step ?

Change suspend to false will start all suspend steps. Adding suspend template before every leaf template allow us to control the start of specific steps. For example, for the DAG like this:

A
|  \
B  C

I think step B and step C should wait for different workload admissions.

args: ["3m"]
```

Option 3. Kueue Webhook Enhancement: A new webhook is added within Kueue to intercept pod creations in the cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How the Argo Workflow controller will reconcile steps if Pods will be in SchedulingGated status ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, SchedulingGated status is almost same with pending status, so it should not have any effect to Argo Workflow controller. Maybe @terrytangyuan can give us some more information.

}


func (r *workflowReconciler)ReconcileGenericWorkflow(ctx context.Context, req ctrl.Request, wf GenericWorkflow) {
Copy link
Member

@tenzen-y tenzen-y Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this workflow dedicated Reconciler?
We want to keep using the ReconcileGenericJob. The CompsableJob is a good example of how to perform the special typed Job in the generic Job reconciler.

Could you propose an approach to extend the ReconcileGenericJob by WorkflowJob interface?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to consider about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the design.

@KunWuLuan
Copy link
Member Author

@terrytangyuan Hi, do you have any comments on these options? I think we need the feedback of workflow managers to determine what Kueue can do in its framework.

@terrytangyuan
Copy link
Member

@terrytangyuan Hi, do you have any comments on these options? I think we need the feedback of workflow managers to determine what Kueue can do in its framework.

Thanks for the ping. I'll take a look when I get a chance. Will you be willing to drive and implement the changes needed on Argo Workflows side too?

@KunWuLuan
Copy link
Member Author

@terrytangyuan Oh, yes. I am willing to do this.

@alculquicondor
Copy link
Contributor

Sorry folks. With the k/k enhancements freeze, I had little time to read this. I'll try to make some time soon.

cc @mimowo


In this section, we will discuss the advantages and disadvantages of queuing Workflows, Stages, and Tasks individually. Additionally, we will explore how each of these can be implemented within Kueue.

### What are the Stages for Argo Workflows?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what a stage is.
Is it an Argo concept that I'm missing? If so, can you add a link to its definition in the Argo documentation?
Otherwise, can you please elaborate a bit more?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stage is not a concept in Argo. I try to use a new word to represent the process that will create a set of pods in workflow. In Argo this may be a TaskGroup, or a StepGroup.
You can see the section 'What are the Stages for Argo Workflows'. If there are still some question the concept of stage, I will try to explain more in the section.

@mimowo
Copy link
Contributor

mimowo commented Oct 14, 2024

@KunWuLuan thank you for driving the KEP. I will try to get to it this week. cc @mwielgus

@mimowo
Copy link
Contributor

mimowo commented Oct 17, 2024

Sorry folks, this is an important KEP, but I will not be able to review it before 0.9 release is concluded.

@KunWuLuan
Copy link
Member Author

Hello friends, anyone have any comment?

@tenzen-y
Copy link
Member

Hello friends, anyone have any comment?

We assume to aim for this feature after the next release (v0.11) since the v0.10 deadline will come soon, within 2 weeks.

@k8s-ci-robot
Copy link
Contributor

@KunWuLuan: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-verify-main fc0837d link true /test pull-kueue-verify-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

BuildSegmentableJob(childRequest ctrl.Request) SegmentableJob
}

---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can drop this code block from the KEP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.