[jobs] revamp scheduling for managed jobs #4485

cg505 · 2024-12-19T01:51:23Z

Detaches the job controller from ray worker and the ray driver program, and uses our own scheduling and parallelism control mechanism.

See the commands in sky/jobs/scheduler.py for more info.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

sky/jobs/scheduler.py

sky/jobs/state.py

cg505 · 2024-12-19T02:34:41Z

sky/jobs/scheduler.py

+                os.makedirs(logs_dir, exist_ok=True)
+                log_path = os.path.join(logs_dir, f'{managed_job_id}.log')
+
+                pid = subprocess_utils.launch_new_process_tree(


if scheduler is killed before this line (e.g. when running as part of a controller job), we will get stuck since the job will be submitted but the controller will never start. Todo figure out how to recover from this case

We can have a skylet event to monitor managed job table, like we do for normal unmanaged jobs.

We are already using the exiting managed job skylet event for that, but the problem is that if it dies right here, there's no way to know if the scheduler is just about to start the process or if it already died. We need a way to check if the scheduler died or maybe a timestamp for the WAITING -> LAUNCHING transition.

Michaelvll

Thanks @cg505 for making this significant change! This is awesome! I glanced the code, and it mostly looks good. The main concern is the complexity and granularity we have for limiting the number of launches. Please see the comments below.

sky/backends/cloud_vm_ray_backend.py

Michaelvll · 2024-12-19T07:38:39Z

sky/jobs/constants.py

@@ -2,10 +2,12 @@

 JOBS_CONTROLLER_TEMPLATE = 'jobs-controller.yaml.j2'
 JOBS_CONTROLLER_YAML_PREFIX = '~/.sky/jobs_controller'
+JOBS_CONTROLLER_LOGS_DIR = '~/sky_controller_logs'


Can we store the logs in either ~/sky_logs/jobs_controller or ~/.sky?

Michaelvll · 2024-12-19T07:56:37Z

sky/jobs/scheduler.py

+    print(launching_jobs, alive_jobs)
+    print(_get_launch_parallelism(), _get_job_parallelism())


Do we need to redirect the logging?

These were debug lines I left in accidentally. I am now using logger.debug. I guess that's probably fine?

sky/jobs/scheduler.py

Michaelvll · 2024-12-19T08:29:57Z

sky/jobs/scheduler.py

+                os.makedirs(logs_dir, exist_ok=True)
+                log_path = os.path.join(logs_dir, f'{managed_job_id}.log')
+
+                pid = subprocess_utils.launch_new_process_tree(


We can have a skylet event to monitor managed job table, like we do for normal unmanaged jobs.

sky/jobs/scheduler.py

Michaelvll · 2024-12-19T09:00:09Z

sky/jobs/scheduler.py

+_ACTIVE_JOB_LAUNCH_WAIT_INTERVAL = 0.5
+
+
+def schedule_step() -> None:


How about calling it schedule_new_job()?

Changed to maybe_start_waiting_jobs(). Open to other names - it might be a bit of a misnomer if it schedules an existing job that wants to launch something.

Michaelvll · 2024-12-19T09:00:21Z

sky/jobs/scheduler.py

+
+
+def schedule_step() -> None:
+    """Determine if any jobs can be launched, and if so, launch them.


Suggested change

"""Determine if any jobs can be launched, and if so, launch them.

"""Determine if any jobs can be scheduled, and if so, schedule them.

Michaelvll · 2024-12-19T09:09:41Z

sky/jobs/scheduler.py

+
+
+@contextlib.contextmanager
+def schedule_active_job_launch(is_running: bool):


I found this function may overcomplicated the problem here. For launches, we do not need to preserve a FIFO order.
Instead, we can use the schedule_step above to preserve the order for the order of the jobs to be scheduled, while using a semaphore to limit the total number of launches.
Reason: I think we should limit the actual launches in a finer granularity:

skypilot/sky/jobs/recovery_strategy.py

Lines 311 to 321 in 83b2325

sky.launch(

self.dag,

cluster_name=self.cluster_name,

# We expect to tear down the cluster as soon as the job is

# finished. However, in case the controller dies, set

# autodown to try and avoid a resource leak.

idle_minutes_to_autostop=_AUTODOWN_MINUTES,

down=True,

detach_setup=True,

detach_run=True,

_is_launched_by_jobs_controller=True)

Otherwise, when there is bad resource capacity, and the job went into an hour-long waiting loop, it will block other jobs from being able to launch their resources.

With this the schedule_step only needs to check the threshold for the total number of jobs that can run in parallel, but don't need to check the total launches.

Updated how this works significantly. New version will transition to ALIVE (does not count towards launching limit) while in backoff.

…d-jobs-skylet

cg505 · 2024-12-20T06:47:25Z

/quicktest-core

revamp scheduling for managed jobs

78eef52

cg505 commented Dec 19, 2024

View reviewed changes

sky/jobs/scheduler.py Outdated Show resolved Hide resolved

cg505 commented Dec 19, 2024

View reviewed changes

sky/jobs/state.py Outdated Show resolved Hide resolved

cg505 commented Dec 19, 2024

View reviewed changes

cg505 mentioned this pull request Dec 19, 2024

detach the managed job controller from job submission #4458

Closed

6 tasks

Michaelvll reviewed Dec 19, 2024

View reviewed changes

cg505 added 2 commits December 19, 2024 21:19

simplify locking mechanism

4c54642

additional fixes

aeaaf7b

cg505 marked this pull request as ready for review December 20, 2024 05:34

cg505 requested a review from Michaelvll December 20, 2024 05:34

cg505 changed the title ~~revamp scheduling for managed jobs~~ [jobs/ revamp scheduling for managed jobs Dec 20, 2024

cg505 changed the title ~~[jobs/ revamp scheduling for managed jobs~~ [jobs] revamp scheduling for managed jobs Dec 20, 2024

cg505 added 2 commits December 19, 2024 22:19

fix pid writing

3b4cf44

Merge branch 'master' of github.com:skypilot-org/skypilot into manage…

ea84bb4

…d-jobs-skylet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jobs] revamp scheduling for managed jobs #4485

[jobs] revamp scheduling for managed jobs #4485

cg505 commented Dec 19, 2024

cg505 Dec 19, 2024

Michaelvll Dec 19, 2024

cg505 Dec 20, 2024

Michaelvll left a comment

Michaelvll Dec 19, 2024

Michaelvll Dec 19, 2024

cg505 Dec 20, 2024

Michaelvll Dec 19, 2024

Michaelvll Dec 19, 2024

cg505 Dec 20, 2024

Michaelvll Dec 19, 2024

Michaelvll Dec 19, 2024

Michaelvll Dec 19, 2024

cg505 Dec 20, 2024

cg505 commented Dec 20, 2024

		print(launching_jobs, alive_jobs)
		print(_get_launch_parallelism(), _get_job_parallelism())

		_ACTIVE_JOB_LAUNCH_WAIT_INTERVAL = 0.5


		def schedule_step() -> None:



		def schedule_step() -> None:
		"""Determine if any jobs can be launched, and if so, launch them.

	"""Determine if any jobs can be launched, and if so, launch them.
	"""Determine if any jobs can be scheduled, and if so, schedule them.



		@contextlib.contextmanager
		def schedule_active_job_launch(is_running: bool):

	sky.launch(
	self.dag,
	cluster_name=self.cluster_name,
	# We expect to tear down the cluster as soon as the job is
	# finished. However, in case the controller dies, set
	# autodown to try and avoid a resource leak.
	idle_minutes_to_autostop=_AUTODOWN_MINUTES,
	down=True,
	detach_setup=True,
	detach_run=True,
	_is_launched_by_jobs_controller=True)

[jobs] revamp scheduling for managed jobs #4485

Are you sure you want to change the base?

[jobs] revamp scheduling for managed jobs #4485

Conversation

cg505 commented Dec 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cg505 commented Dec 20, 2024