-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jobs] revamp scheduling for managed jobs #4485
base: master
Are you sure you want to change the base?
Conversation
sky/jobs/scheduler.py
Outdated
os.makedirs(logs_dir, exist_ok=True) | ||
log_path = os.path.join(logs_dir, f'{managed_job_id}.log') | ||
|
||
pid = subprocess_utils.launch_new_process_tree( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if scheduler is killed before this line (e.g. when running as part of a controller job), we will get stuck since the job will be submitted but the controller will never start. Todo figure out how to recover from this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have a skylet event to monitor managed job table, like we do for normal unmanaged jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are already using the exiting managed job skylet event for that, but the problem is that if it dies right here, there's no way to know if the scheduler is just about to start the process or if it already died. We need a way to check if the scheduler died or maybe a timestamp for the WAITING -> LAUNCHING transition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505 for making this significant change! This is awesome! I glanced the code, and it mostly looks good. The main concern is the complexity and granularity we have for limiting the number of launches. Please see the comments below.
sky/jobs/constants.py
Outdated
@@ -2,10 +2,12 @@ | |||
|
|||
JOBS_CONTROLLER_TEMPLATE = 'jobs-controller.yaml.j2' | |||
JOBS_CONTROLLER_YAML_PREFIX = '~/.sky/jobs_controller' | |||
JOBS_CONTROLLER_LOGS_DIR = '~/sky_controller_logs' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we store the logs in either ~/sky_logs/jobs_controller
or ~/.sky
?
sky/jobs/scheduler.py
Outdated
print(launching_jobs, alive_jobs) | ||
print(_get_launch_parallelism(), _get_job_parallelism()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to redirect the logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were debug lines I left in accidentally. I am now using logger.debug. I guess that's probably fine?
sky/jobs/scheduler.py
Outdated
os.makedirs(logs_dir, exist_ok=True) | ||
log_path = os.path.join(logs_dir, f'{managed_job_id}.log') | ||
|
||
pid = subprocess_utils.launch_new_process_tree( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have a skylet event to monitor managed job table, like we do for normal unmanaged jobs.
sky/jobs/scheduler.py
Outdated
_ACTIVE_JOB_LAUNCH_WAIT_INTERVAL = 0.5 | ||
|
||
|
||
def schedule_step() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about calling it schedule_new_job()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to maybe_start_waiting_jobs(). Open to other names - it might be a bit of a misnomer if it schedules an existing job that wants to launch something.
sky/jobs/scheduler.py
Outdated
|
||
|
||
def schedule_step() -> None: | ||
"""Determine if any jobs can be launched, and if so, launch them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Determine if any jobs can be launched, and if so, launch them. | |
"""Determine if any jobs can be scheduled, and if so, schedule them. |
sky/jobs/scheduler.py
Outdated
|
||
|
||
@contextlib.contextmanager | ||
def schedule_active_job_launch(is_running: bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this function may overcomplicated the problem here. For launches, we do not need to preserve a FIFO order.
Instead, we can use the schedule_step
above to preserve the order for the order of the jobs to be scheduled, while using a semaphore to limit the total number of launches.
Reason: I think we should limit the actual launches in a finer granularity:
skypilot/sky/jobs/recovery_strategy.py
Lines 311 to 321 in 83b2325
sky.launch( | |
self.dag, | |
cluster_name=self.cluster_name, | |
# We expect to tear down the cluster as soon as the job is | |
# finished. However, in case the controller dies, set | |
# autodown to try and avoid a resource leak. | |
idle_minutes_to_autostop=_AUTODOWN_MINUTES, | |
down=True, | |
detach_setup=True, | |
detach_run=True, | |
_is_launched_by_jobs_controller=True) |
Otherwise, when there is bad resource capacity, and the job went into an hour-long waiting loop, it will block other jobs from being able to launch their resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this the schedule_step
only needs to check the threshold for the total number of jobs that can run in parallel, but don't need to check the total launches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated how this works significantly. New version will transition to ALIVE (does not count towards launching limit) while in backoff.
/quicktest-core |
Detaches the job controller from ray worker and the ray driver program, and uses our own scheduling and parallelism control mechanism.
See the commands in sky/jobs/scheduler.py for more info.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh