Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support KubeRay in MultiKueue via managedBy #3822

Open
mimowo opened this issue Dec 12, 2024 · 10 comments
Open

Support KubeRay in MultiKueue via managedBy #3822

mimowo opened this issue Dec 12, 2024 · 10 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@mimowo
Copy link
Contributor

mimowo commented Dec 12, 2024

What would you like to be added:

Support for KubeRay via managedBy in MultiKueue.

The relevant support for the managedBy field has been recently merged in KubeRay, see ray-project/kuberay#2544, and will be released most likely in 1.3.
Until then we can use the main branch of kuberay in Kueue to test it all works. Once KubeRay is released we can switch to the released version and merge to Kueue.

Why is this needed:

Support of KubeRay via managedBy in MultiKueue, allowing for:

  • ease of setup where full KubeRay can be installed on the management cluster (not just CRDs)
  • hybrid deployments, where some RayJobs are executed via MultiKueue and some locally on the management cluster
@mimowo mimowo added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 12, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Dec 12, 2024

@mszadkow
Copy link
Contributor

This ticket requires to first provide RayJob and RayCluster Multikueue adapter and setup e2e tests,
as we don't have them yet.
First PR will be about adapter and multikueue e2e tests with ray-operator v1.2.2- no managedBy yet, as this requires latest.
Second PR will be the attempt to work with ray-operator latest and use of managedBy.

@mimowo
Copy link
Contributor Author

mimowo commented Dec 13, 2024

sgtm, we can start with tests using 1.2.2 but without managedBy field and merge it as a starter. Or alternatively already use latest (main) of KubeRay just for testing purposes - we will merge once KubeRay is released.

@mszadkow
Copy link
Contributor

Another problem that I have right now is that Kuberay clusters startup time is huge for e2e tests.
It's about 5 minutes, maybe it's due to requirement to use rayproject image, but setting anything else as image results in job to stall at initialisation.
Normally we use sleep image just to call anything, but it won't work with ray cluster.

@mimowo
Copy link
Contributor Author

mimowo commented Dec 18, 2024

Ok, in that case we may need to have a separate CI for Ray.

However, let me first understand what you exactly mean by "cluster startup time" - is this the installation of the KubeRay, or time to run the first RayJob? Does it also take long to run follow-up Jobs?

Also, please make sure you are rebased against the main branch, because recently we increased CPU limits for Kueue to 2000m which might be relevant here too.

@andrewsykim
Copy link
Member

Normally we use sleep image just to call anything, but it won't work with ray cluster.

It may be possible to construct a dummy RayCluster that just calls sleep with a lighter image, but I haven't tried this myself. Assuming the start-up time issue is due to pulling the default Ray images

@mimowo
Copy link
Contributor Author

mimowo commented Dec 18, 2024

thank you @andrewsykim for the suggestion. @mszadkow can we try that? maybe you already did and hit some complications?

@mszadkow
Copy link
Contributor

Well I have tried to load the image up-front to the test cluster (kind), but then I had space limitation issues, at least locally.
Will have to verify what happens in CI.
@andrewsykim did you mean I could build lighter version of rayproject/ray ?

@andrewsykim
Copy link
Member

@andrewsykim did you mean I could build lighter version of rayproject/ray ?

No, when you constuct a RayJob or RayCluster, you can pass arbitrary images for the Head and Worker pods. So you can try using much smaller images used in other tests that just need to run sleep.

@mszadkow
Copy link
Contributor

I found a way to run other type of image, just with bash and calling sleep.
However it seems that kuberay overrides KUBERAY_GEN_RAY_START_CMD and never mind what I put there it's ray start --head --dashboard-host=0.0.0.0 --metrics-export-port=8080 --block --dashboard-agent-listen-port=52365.
Then I would have to install ray to the docker...
Let's first go with rayproject image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants