Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jobs] revamp scheduling for managed jobs #4485
base: master
Are you sure you want to change the base?
[jobs] revamp scheduling for managed jobs #4485
Changes from 1 commit
78eef52
4c54642
aeaaf7b
3b4cf44
ea84bb4
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if scheduler is killed before this line (e.g. when running as part of a controller job), we will get stuck since the job will be submitted but the controller will never start. Todo figure out how to recover from this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have a skylet event to monitor managed job table, like we do for normal unmanaged jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are already using the exiting managed job skylet event for that, but the problem is that if it dies right here, there's no way to know if the scheduler is just about to start the process or if it already died. We need a way to check if the scheduler died or maybe a timestamp for the WAITING -> LAUNCHING transition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we skip any scheduling call when there is an job holding the lock to start. Would it be possible that the following happens:
job_done
for job 1.Also, there is a race between this function release the scheduling lock vs the underlying
jobs.controller
calling the scheduling again. Should the underlyingjobs.controller
to wait for this scheduler function to set some specific state before proceeding to trigger the scheduler?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the launch is waiting for resources (in backoff), it will be ALIVE instead of LAUNCHING while sleeping.
https://github.com/skypilot-org/skypilot/pull/4485/files#diff-092d0b9d29f77447088331967c3a39d77095fb44328a6e3e89d614b1236f9ab3R391-R395
This race is not clear to me. I think whatever race you are thinking of is protected by holding the lock whenever a job transitions state.
Consider two processes that call
maybe_start_waiting_jobs
at the same time. Only one will get the lock, and the other will exit immediately, but no matter which process gets the lock, the function will do exactly the same thing. We guarantee this because as soon as the lock is held by either process, the scheduling and job states can only be changed by whatever processes is doing the scheduling. Any other job state changes will block until the scheduling finishes.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should clarify the following:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this may still take a while if the launch/failover takes a while, especially when it has to go through a large list of candidate regions/clouds/accelerator types, i.e. the time to schedule the next job may be non-deterministic and depends on the number of candidate locations to search for.
By race here, I mean the similar case as mentioned above, where we expect that if a job controller gets scheduled, the next job should start scheduling immediately once the
jobs.controller
calls the scheduling. However, due to the "race" (the scheduling function may be running for the currentjobs.controller
when it calls the schedule for the next job), the schedule for the next job will be skipped until we finish the failover for the launch.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since maybe_start_waiting_jobs will directly launch all possible jobs that can be launched, I think this is okay. Will try to clarify in comments.