Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] make per-cloud catalog lookup parallel #4483

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

aylei
Copy link

@aylei aylei commented Dec 18, 2024

Part of #3159

This is a draft PR with a known issue to be solved.

The test is done locally with the duration of launch to confirm timed (aylei@e662df0). On my laptop, this PR improves the performance in 4 test cases with 7 clouds enabled compared to main (f0ebf13):

  1. sky launch --gpus H100:8, from 2.38s to 2.06s
  2. rm -rf ~/.sky/catalogs/** && sky launch --gpus H100:8 (cold start), from 12.15s to 6.29s
  3. sky launch test.yaml, ordered resources with non-canonical names, from 5.75s to 5.07s
  4. rm -rf ~/.sky/catalogs/** && sky launch test.yaml, same as above, but cold start, from 15.75s to 10.17s

test.yaml:

resources:
  cpus: 8+
  ordered:
    - accelerators: L40S:1
    - accelerators: L40:1
    - accelerators: A10g:1
    - accelerators: A10:1
    - accelerators: A100:1
    - accelerators: H100:1

run: |
  echo 'test done'

Cold-start and test.yaml are chosen deliberately. Because I found that for the case of "--gpu H100:8", the improvement was less than I expected (I expected a ~1s improvement according to former analysis).

The reason is GIL. Through profiling (source: gil.svg), I found that each cloud spends 10~100+ ms on init, which can not be parallelized due to GIL:

image

Actually in my setup only K8S and AWS would file network IO, where K8S ListNode costs ~0.5s and AWS get-caller-identity costs ~1s. So I tested the cold-start case which shows in scenarios with more I/O calls, the performance optimization of concurrency will be more significant.

For test.yaml, I deliberately included non-canonical GPU names to cover the non short-circuit path, where the IO calls will be advanced to get_accelerators.

I thought carefully about whether this PR is a premature optimization and my final thought is that processing each cloud concurrently in multi-cloud scenario is necessary for scalability as skypilot is a multi-cloud solution at its core(IIUC). This is why I did not parallelize the single get_accelerators function but parallelize all catalog lookup calls in _map_clouds_catalog. @Michaelvll What do you think?

Note, since sub thread status is silent yet, sky launch will lose catalog progress after this PR merged. Follow up: #4491

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (see above)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aylei! Adding the parallelization for all cloud methods makes sense to me. Left some minor comments mainly for interfaces.

Investigating AWS and kubernetes latency should be a great next step. IIRC, trying to fetch the mapping for AWS zone names can cause a large latency even when we don't have AWS enabled.

sky/utils/subprocess_utils.py Outdated Show resolved Hide resolved
sky/optimizer.py Outdated Show resolved Hide resolved
@aylei
Copy link
Author

aylei commented Dec 19, 2024

Thanks @aylei! Adding the parallelization for all cloud methods makes sense to me. Left some minor comments mainly for interfaces.

Investigating AWS and kubernetes latency should be a great next step. IIRC, trying to fetch the mapping for AWS zone names can cause a large latency even when we don't have AWS enabled.

@Michaelvll Thanks for your suggestions❤️. I reproduced the AWS zone name issue by cleaning catalog cache, excluding aws from allowed_clouds in ~/.sky/config.yaml and run sky launch --gpus L40:8, which uses a non-canonical GPU name and will enter the slow path. I think service_catalog should also honor allowed_clouds configuration but this seems involve semantic change, since allowed_clouds is documented as

This field is used to restrict the clouds that SkyPilot will check and use
when running sky check. "

My idea is to apply the same semantics to other commands like launch, wdyt? @Michaelvll

As of normal cases, the zone mapping is usually cached with AWS userID as the cache key (

filename = f'aws/az_mappings-{aws_user_hash}.csv'
), so the only relevant RPC is get-caller-identity in daily usage, with the assumption that the AWS account used rarely changes.

Therefore, for the mainstream scenarios, I currently think there might be no optimization method for the get-caller-identity RPC of AWS and ListNode RPC of K8S without adjusting UX. That is, caching these calls requires extra user attention and should be carefully designed. I will evaluate whether the benefits would payoff.

@aylei aylei marked this pull request as ready for review December 19, 2024 12:31
@aylei
Copy link
Author

aylei commented Dec 19, 2024

@Michaelvll I've addressed the comments and added sub thread status support. Please kindly take a look. I will move further work on #3159 to separate PRs so each of them can be focused and coherent.

Current look when sub thread status is involved:

2024-12-19 20 22 20

2024-12-19 20 23 04

PS: fetch zone mapping is slow indeed on cache miss😂

@aylei aylei requested a review from Michaelvll December 19, 2024 12:42
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aylei! It looks good to me. Just some minor nits. Let's proceed with the optimization of the second step. : )

Comment on lines +1297 to +1301
feasible_list = subprocess_utils.run_in_parallel(
lambda cloud, r=resources, n=task.num_nodes:
(cloud, cloud.get_feasible_launchable_resources(r, n)),
clouds_list)
for cloud, feasible_resources in feasible_list:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very minor nit: the run_in_parallel keeps the original order of the arguments, so no need to have cloud in the output, but zip(cloud_list, feasible_list) should be fine.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, with this new knowledge, I personally still tend to loosen the behavior dependency on the run_in_parallel function for maintainability, wdyt?

@@ -1,16 +1,21 @@
"""Rich status spinner utils."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! However, since this can affect many other part of the code, let's move this to another PR, and keep this PR clean for only having the parallelism changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but since status of sub thread is silent on master, there would be a UX degradation in between, is that okay?

@@ -113,6 +114,12 @@ def run_in_parallel(func: Callable,
A list of the return values of the function func, in the same order as the
arguments.
"""
if isinstance(args, collections.abc.Sized):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how about we change the input args to be List[Any] and change the invocations to use lists to avoid this collections.abc.Sized?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I will try this~

@aylei
Copy link
Author

aylei commented Dec 20, 2024

@Michaelvll ready for the next review

Signed-off-by: Aylei <[email protected]>
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aylei! LGTM. Could you share a bit of how the UX looks like without the thread-based handling for spinner?

return method(*args, **kwargs)

results = subprocess_utils.run_in_parallel(_execute_catalog_method,
list(clouds), len(clouds))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: use kwargs for non-obvious argument meaning : )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

sky/utils/subprocess_utils.py Outdated Show resolved Hide resolved
@aylei
Copy link
Author

aylei commented Dec 21, 2024

@Michaelvll

For canonical GPU:
2024-12-21 21 13 01

For non-canonical GPU:
2024-12-21 21 13 30

Fine when no catalog refresh is involved, otherwise it looks a little confusing since Optimizing... or List accelerators... would last several seconds with no detailed message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants