-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Release] Release 0.7.1 #4438
base: releases/0.7.1_pure
Are you sure you want to change the base?
[Release] Release 0.7.1 #4438
Conversation
…ing (skypilot-org#4264) * fix race condition for setting job status to FAILED during INIT * Fix * fix * format * Add smoke tests * revert pending submit * remove update entirely for the job schedule step * wait for job 32 to finish * fix smoke * move and rename * Add comment * minor
* Avoid job schedule race condition * format * format * Avoid race for cancel
…ounts are specified (skypilot-org#4317) do file mounts if storage is specified
* avoid catching ValueError during failover If the cloud api raises ValueError or a subclass of ValueError during instance termination, we will assume the cluster was downed. Fix this by introducing a new exception ClusterDoesNotExist that we can catch instead of the more general ValueError. * add unit test * lint
…g#4443) * if a newly-created cluster is missing from the cloud, wait before deleting Addresses skypilot-org#4431. * confirm cluster actually terminates before deleting from the db * avoid deleting cluster data outside the primary provision loop * tweaks * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * use usage_intervals for new cluster detection get_cluster_duration will include the total duration of the cluster since its initial launch, while launched_at may be reset by sky launch on an existing cluster. So this is a more accurate method to check. * fix terminating/stopping state for Lambda and Paperspace * Revert "use usage_intervals for new cluster detection" This reverts commit aa6d2e9. * check cloud.STATUS_VERSION before calling query_instances * avoid try/catch when querying instances * update comments --------- Co-authored-by: Zhanghao Wu <[email protected]>
* smoke tests support storage mount only * fix verify command * rename to only_mount
tests/test_smoke.py
Outdated
@@ -1144,7 +1144,7 @@ def test_gcp_stale_job_manual_restart(): | |||
# Ensure the skylet updated the stale job status. | |||
_get_cmd_wait_until_job_status_contains_without_matching_job( | |||
cluster_name=name, | |||
job_status=[JobStatus.FAILED.value], | |||
job_status=[JobStatus.FAILED], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this kind of hot fixes, we may want to include it in master and cherry pick it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's due to a merge conflict. The master branch's value is FAILED_DRIVER
, which does not exist in version 0.7.1 but is correct in the master branch.
Looking at the test failures (checked ones should be fine):
The following does not fail on release/0.7.0, we should fix:
|
Does this issue persists? Since Azure provisioning is relatively slow, it is possible that sometimes it passes the initial delay and sometimes not. Also, I'm a little bit confused - why is there a expected FAILED status? |
I've tried many times with no luck. The failure rate is high, even if it's flaky. Could we fix the flakiness? |
After changing the region, I found that this test case needs to be run on the aws controller. If we don't have a controller running, sky launches an azure controller, which then fails due to missing aws credentials. Is this a bug? @Michaelvll (t-managed-jobs-storage-8b, pid=2429) Traceback (most recent call last):
(t-managed-jobs-storage-8b, pid=2429) File "/home/azureuser/miniconda3/envs/skypilot-runtime/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(t-managed-jobs-storage-8b, pid=2429) return _run_code(code, main_globals, None,
(t-managed-jobs-storage-8b, pid=2429) File "/home/azureuser/miniconda3/envs/skypilot-runtime/lib/python3.10/runpy.py", line 86, in _run_code
(t-managed-jobs-storage-8b, pid=2429) exec(code, run_globals)
(t-managed-jobs-storage-8b, pid=2429) File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/controller.py", line 583, in <module>
(t-managed-jobs-storage-8b, pid=2429) start(args.job_id, args.dag_yaml, args.retry_until_up)
(t-managed-jobs-storage-8b, pid=2429) File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/controller.py", line 541, in start
(t-managed-jobs-storage-8b, pid=2429) _cleanup(job_id, dag_yaml=dag_yaml)
(t-managed-jobs-storage-8b, pid=2429) File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/controller.py", line 480, in _cleanup
(t-managed-jobs-storage-8b, pid=2429) dag, _ = _get_dag_and_name(dag_yaml)
(t-managed-jobs-storage-8b, pid=2429) File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/controller.py", line 40, in _get_dag_and_name
(t-managed-jobs-storage-8b, pid=2429) dag = dag_utils.load_chain_dag_from_yaml(dag_yaml)
(t-managed-jobs-storage-8b, pid=2429) File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/dag_utils.py", line 101, in load_chain_dag_from_yaml
(t-managed-jobs-storage-8b, pid=2429) task = task_lib.Task.from_yaml_config(task_config, env_overrides)
(t-managed-jobs-storage-8b, pid=2429) File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/task.py", line 438, in from_yaml_config
(t-managed-jobs-storage-8b, pid=2429) storage_obj = storage_lib.Storage.from_yaml_config(storage[1])
(t-managed-jobs-storage-8b, pid=2429) File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 1043, in from_yaml_config
(t-managed-jobs-storage-8b, pid=2429) storage_obj = cls(name=name,
(t-managed-jobs-storage-8b, pid=2429) File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 556, in __init__
(t-managed-jobs-storage-8b, pid=2429) self.add_store(StoreType.S3)
(t-managed-jobs-storage-8b, pid=2429) File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 894, in add_store
(t-managed-jobs-storage-8b, pid=2429) store = store_cls(
(t-managed-jobs-storage-8b, pid=2429) File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 1110, in __init__
(t-managed-jobs-storage-8b, pid=2429) super().__init__(name, source, region, is_sky_managed,
(t-managed-jobs-storage-8b, pid=2429) File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 261, in __init__
(t-managed-jobs-storage-8b, pid=2429) self._validate()
(t-managed-jobs-storage-8b, pid=2429) File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 1156, in _validate
(t-managed-jobs-storage-8b, pid=2429) raise exceptions.ResourcesUnavailableError(
(t-managed-jobs-storage-8b, pid=2429) sky.exceptions.ResourcesUnavailableError: Storage 'store: s3' specified, but AWS access is disabled. To fix, enable AWS by running `sky check`. More info: https://docs.skypilot.co/en/latest/getting-started/installation.html.
It's aws sync error, not gcp, and it's a 100% reproduction failure. @Michaelvll
|
I've tried many times with no luck. The failure rate is high, even if it's flaky. Could we fix the flakiness? Does increasing the initial delay works for you? |
This is a known issue, @weih1121 is working on it, see #4512, cc @Michaelvll |
Based on releases/0.7.0, cherry-picks all commits from 0.7.1
With some manual changes:
smoke_tests.py
to ensure more smoke tests pass and Buildkite works.The release should include: version 0.7.1 along with the manual changes
Code to run test below include: version 0.7.1 along with the manual changes
Smoke tests:
Use buildkite CI to run the following tests:
pytest tests/test_smoke.py --aws
pytest tests/test_smoke.py --gcp
pytest tests/test_smoke.py --azure
pytest tests/test_smoke.py --kubernetes
All passes except the failures:
You can view by clicking the failure from buildkite:
Manual tests:
docs/build/index.html
, scroll over “CLI Reference” (ideally, every page) to see if there are missing sections (we once caught the CLI page completely missing due to an import error; and once it has weird blockquotes displayed)sky -v
backward_compatibility_tests.sh
run against 0.7.0 on aws, run by buildkitesky launch --num-nodes=75 -c dbg --cpus 2+ --use-spot --down --cloud aws -y
sky show-gpus
manual testsRun a 24-hour+ spot job and ensure it doesn’t OOM
sky spot launch -n test-oom --cloud aws --cpus 2 sleep 1000000000000000