-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Kuberay MultiKueue adapter #3892
base: main
Are you sure you want to change the base?
[Feature] Kuberay MultiKueue adapter #3892
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mszadkow The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
chmod -R u+w "$(EXTERNAL_CRDS_DIR)/ray-operator" && \ | ||
rm -rf "$(EXTERNAL_CRDS_DIR)/ray-operator"; \ | ||
fi | ||
mkdir -p "${EXTERNAL_CRDS_DIR}/ray-operator"; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I need all of those folders to be able to properly install it.
@@ -70,6 +70,7 @@ IMAGE_TAG ?= $(IMAGE_REPO):$(GIT_TAG) | |||
JOBSET_VERSION = $(shell $(GO_CMD) list -m -f "{{.Version}}" sigs.k8s.io/jobset) | |||
KUBEFLOW_VERSION = $(shell $(GO_CMD) list -m -f "{{.Version}}" github.com/kubeflow/training-operator) | |||
KUBEFLOW_MPI_VERSION = $(shell $(GO_CMD) list -m -f "{{.Version}}" github.com/kubeflow/mpi-operator) | |||
KUBERAY_VERSION = $(shell $(GO_CMD) list -m -f "{{.Version}}" github.com/ray-project/kuberay/ray-operator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far it's the version without managedBy.
We don't have a ray-operator image that supports managedBy, that would have to be a custom one...
latest -> 1.2.2, which is the last release
@@ -14,4 +14,4 @@ kind: Kustomization | |||
images: | |||
- name: controller | |||
newName: us-central1-docker.pkg.dev/k8s-staging-images/kueue/kueue | |||
newTag: main | |||
newTag: v0.10.0-rc.3-53-g215d32ad-dirty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ups...will clean it
export KUBERAY_MANIFEST="${ROOT_DIR}/dep-crds/ray-operator/default/" | ||
export KUBERAY_IMAGE=bitnami/kuberay-operator:${KUBERAY_VERSION/#v} | ||
export KUBERAY_RAY_IMAGE=rayproject/ray:2.9.0 | ||
export KUBERAY_RAY_IMAGE_ARM=rayproject/ray:2.9.0-aarch64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is for us people working on macOS, it's vital for development, not so much for prod - I think it should stay
@@ -136,6 +155,22 @@ function install_mpi { | |||
kubectl apply --server-side -f "${KUBEFLOW_MPI_MANIFEST}" | |||
} | |||
|
|||
#$1 - cluster name | |||
function install_kuberay { | |||
# Extra e2e images required for Kuberay, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that one gives us the speed on test execution, but takes time here...
I am about to measure how much
remoteJob.Labels[kueue.MultiKueueOriginLabel] = origin | ||
|
||
// clear the managedBy enables the controller to take over | ||
// remoteJob.Spec.ManagedBy = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed np
} | ||
|
||
func (b *multikueueAdapter) IsJobManagedByKueue(ctx context.Context, c client.Client, key types.NamespacedName) (bool, string, error) { | ||
// job := rayv1.RayCluster{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed np
In the current stage I need to add RayCluster e2e test |
@mszadkow: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
d27f12c
to
ad4df0d
Compare
} | ||
|
||
// if the remote exists, just copy the status | ||
if err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also wait for unsuspend. See #3730.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx
} | ||
|
||
// if the remote exists, just copy the status | ||
if err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here.
if runtime.GOOS == "darwin" { | ||
E2eKuberayTestImage = "rayproject/ray:2.9.0-py39-aarch64" | ||
} | ||
fmt.Println(runtime.GOOS, E2eKuberayTestImage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt.Println(runtime.GOOS, E2eKuberayTestImage) | |
ginkgo.GinkgoLogr.Info(runtime.GOOS, E2eKuberayTestImage) |
Or maybe remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I will remove it, it was for debugging purposes
|
||
var _ = ginkgo.Describe("Multikueue", ginkgo.Ordered, ginkgo.ContinueOnFailure, func() { | ||
var _ = ginkgo.FDescribe("Multikueue", ginkgo.Ordered, ginkgo.ContinueOnFailure, func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var _ = ginkgo.FDescribe("Multikueue", ginkgo.Ordered, ginkgo.ContinueOnFailure, func() { | |
var _ = ginkgo.Describe("Multikueue", ginkgo.Ordered, ginkgo.ContinueOnFailure, func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
@@ -87,7 +88,7 @@ func (j *RayJob) IsSuspended() bool { | |||
} | |||
|
|||
func (j *RayJob) IsActive() bool { | |||
return j.Status.JobDeploymentStatus != rayv1.JobDeploymentStatusSuspended | |||
return j.Status.JobDeploymentStatus != rayv1.JobDeploymentStatusNew |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, that was the most important change actually
Because JobDeploymentStatusSuspended
is not a initial status and the workload couldn't be created at all
Maybe it should be both like not new and not suspended, but definitively not the suspended only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not even entirely sure if this should watch the JobDeploymentStatus, as the Job has its own Status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you right, it should be both.
kueue/pkg/controller/jobframework/interface.go
Lines 57 to 58 in cea42f2
// IsActive returns true if there are any running pods. | |
IsActive() bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @andrewsykim
What type of PR is this?
/kind feature
What this PR does / why we need it:
We want to be bale to run Kuberay workloads (RayJob, RayCluster) in MultiKueue.
Which issue(s) this PR fixes:
Relates to #3822
Special notes for your reviewer:
Does this PR introduce a user-facing change?