-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix smoke tests bug #4492
base: master
Are you sure you want to change the base?
fix smoke tests bug #4492
Conversation
/quicktest-core |
/quicktest-core |
1 similar comment
/quicktest-core |
/quicktest-core |
sky/clouds/gcp.py
Outdated
@@ -73,6 +73,7 @@ | |||
~/google-cloud-sdk/install.sh -q >> {_GCLOUD_INSTALLATION_LOG} 2>&1 && \ | |||
echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc && \ | |||
source ~/google-cloud-sdk/path.bash.inc >> {_GCLOUD_INSTALLATION_LOG} 2>&1; }}; }} && \ | |||
export GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we tested this without using a service account, e.g., with a normal GCP authentication locally or with gcp.remote_identity: SERVICE_ACCOUNT
?
We should not assume that this file exists on the remote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If the file
~/.config/gcloud/application_default_credentials.json
does not exist, or it exist but invalud, the sky launch will fail because GCP is not configured.
+ touch ~/.ssh/id_rsa.pub
+ sky launch -y -c t-azure-storage-mount-b8-31 --cloud azure /var/folders/83/zxqx914s57x310rfnhq8kk9r0000gn/T/tmpy0nu9vj0.yaml
D 12-31 14:55:23 skypilot_config.py:228] Using config path: /Users/zepingguo/.sky/config.yaml
D 12-31 14:55:23 skypilot_config.py:233] Config loaded:
D 12-31 14:55:23 skypilot_config.py:233] {}
D 12-31 14:55:23 skypilot_config.py:245] Config syntax check passed.
Task from YAML spec: /var/folders/83/zxqx914s57x310rfnhq8kk9r0000gn/T/tmpy0nu9vj0.yaml
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: timed out
WARNING:google.auth._default:Authentication failed using Compute Engine authentication due to unavailable metadata server.
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: timed out
WARNING:google.auth._default:Authentication failed using Compute Engine authentication due to unavailable metadata server.
Traceback (most recent call last):
File "/Users/zepingguo/miniconda3/envs/sky/bin/sky", line 8, in <module>
sys.exit(cli())
File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/Users/zepingguo/Desktop/skypilot/sky/utils/common_utils.py", line 366, in _record
return f(*args, **kwargs)
File "/Users/zepingguo/Desktop/skypilot/sky/cli.py", line 838, in invoke
return super().invoke(ctx)
File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/Users/zepingguo/Desktop/skypilot/sky/utils/common_utils.py", line 386, in _record
return f(*args, **kwargs)
File "/Users/zepingguo/Desktop/skypilot/sky/cli.py", line 1118, in launch
task_or_dag = _make_task_or_dag_from_entrypoint_with_overrides(
File "/Users/zepingguo/Desktop/skypilot/sky/cli.py", line 790, in _make_task_or_dag_from_entrypoint_with_overrides
dag = dag_utils.load_chain_dag_from_yaml(entrypoint, env_overrides=env)
File "/Users/zepingguo/Desktop/skypilot/sky/utils/dag_utils.py", line 101, in load_chain_dag_from_yaml
task = task_lib.Task.from_yaml_config(task_config, env_overrides)
File "/Users/zepingguo/Desktop/skypilot/sky/task.py", line 438, in from_yaml_config
storage_obj = storage_lib.Storage.from_yaml_config(storage[1])
File "/Users/zepingguo/Desktop/skypilot/sky/data/storage.py", line 1045, in from_yaml_config
storage_obj = cls(name=name,
File "/Users/zepingguo/Desktop/skypilot/sky/data/storage.py", line 560, in __init__
self.add_store(StoreType.GCS)
File "/Users/zepingguo/Desktop/skypilot/sky/data/storage.py", line 896, in add_store
store = store_cls(
File "/Users/zepingguo/Desktop/skypilot/sky/data/storage.py", line 1551, in __init__
super().__init__(name, source, region, is_sky_managed,
File "/Users/zepingguo/Desktop/skypilot/sky/data/storage.py", line 263, in __init__
self._validate()
File "/Users/zepingguo/Desktop/skypilot/sky/data/storage.py", line 1596, in _validate
raise exceptions.ResourcesUnavailableError(
sky.exceptions.ResourcesUnavailableError: Storage 'store: gcs' specified, but GCP access is disabled.
- If file
~/.config/gcloud/application_default_credentials.json
is configured as personal account and valid, it will be successful
(sky) ➜ skypilot git:(dev/zeping/fix_smoke_tests) pytest tests/smoke_tests/test_mount_and_storage.py::test_azure_storage_mounts_with_stop --azure
D 12-31 15:02:46 skypilot_config.py:228] Using config path: /Users/zepingguo/.sky/config.yaml
D 12-31 15:02:46 skypilot_config.py:233] Config loaded:
D 12-31 15:02:46 skypilot_config.py:233] {}
D 12-31 15:02:46 skypilot_config.py:245] Config syntax check passed.
bringing up nodes...
[azure_storage_mounts] Test started. Log: less /var/folders/83/zxqx914s57x310rfnhq8kk9r0000gn/T/azure_storage_mounts-s5h1dawv.log
[azure_storage_mounts] Passed.
[azure_storage_mounts] Log: less /var/folders/83/zxqx914s57x310rfnhq8kk9r0000gn/T/azure_storage_mounts-s5h1dawv.log
[azure_storage_mounts]
.
1 passed, 272 warnings in 468.11s (0:07:48)
(sky) ➜ skypilot git:(dev/zeping/fix_smoke_tests)
- If file
~/.config/gcloud/application_default_credentials.json
is configured as service account and valid, it will also be successful
We don't assume this file exists. If the file does not exist, the credential may fail as usual. If the file exists, the credential may succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, update the command to export GOOGLE_APPLICATION_CREDENTIALS
only if it is not set and the file exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zpoint! Please see the comments below
sky/clouds/gcp.py
Outdated
@@ -73,6 +73,7 @@ | |||
~/google-cloud-sdk/install.sh -q >> {_GCLOUD_INSTALLATION_LOG} 2>&1 && \ | |||
echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc && \ | |||
source ~/google-cloud-sdk/path.bash.inc >> {_GCLOUD_INSTALLATION_LOG} 2>&1; }}; }} && \ | |||
export GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we tested this without using a service account, e.g., with a normal GCP authentication locally or with gcp.remote_identity: SERVICE_ACCOUNT
?
We should not assume that this file exists on the remote.
@@ -566,7 +566,7 @@ def test_kubernetes_context_failover(): | |||
'kubectl get namespaces --context kind-skypilot | grep test-namespace || ' | |||
'{ echo "Should set the namespace to test-namespace for kind-skypilot. Check the instructions in ' | |||
'tests/test_smoke.py::test_kubernetes_context_failover." && exit 1; }', | |||
'sky show-gpus --cloud kubernetes --region kind-skypilot | grep H100 | grep "1, 2, 3, 4, 5, 6, 7, 8"', | |||
'sky show-gpus --cloud kubernetes --region kind-skypilot | grep H100 | grep "1, 2, 4, 8"', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we make the change here? In k8s, we do all different number of GPUs? @romilbhardwaj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I execute on aws t3.2xlarge
instance:
sky local up
kubectl label node --overwrite skypilot-control-plane skypilot.co/accelerator=h100 --context kind-skypilot
kubectl proxy &
sleep 5
curl --header "Content-Type: application/json-patch+json" \
--request PATCH \
--data '[{"op": "add", "path": "/status/capacity/nvidia.com~1gpu", "value": "8"}]' \
http://localhost:8001/api/v1/nodes/skypilot-control-plane/status
kubectl create namespace test-namespace --context kind-skypilot
kubectl config set-context kind-skypilot --namespace=test-namespace --context kind-skypilot
kind create cluster --name kind-cluster-2
kubectl config use-context kind-skypilot
sky check
But the sky show-gpus --cloud kubernetes --region kind-skypilot
only shows 1, 2, 4, 8
@@ -556,6 +556,7 @@ def test_tpu_vm_pod(): | |||
|
|||
|
|||
# ---------- TPU Pod Slice on GKE. ---------- | |||
@pytest.mark.gke |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean by having two marks for a single test? Our current handling of the marks means that if a test function is marked with a specific cloud, it can only be run with --cloud-name
. How can we trigger this test now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used directly by pytest. Some test cases can run on Kubernetes after executing sky local up
.
However, some tests require a real TPU cluster or GKE resources. I need to schedule these test cases to run on the real GKE cluster instead of the local one. I use this marker to distinguish them on buildkite.
On buildkite, some tests will run on the kubernetes cluster set up by sky local up
, while others will run on the GKE cluster. (I was hoping all tests could run on the local Kubernetes cluster, but it seems that’s not possible.)
Some fix for #4438
Refer to release comment here and some remaining problem after resolved
Change regoin for:
And robust backward compatibility
Add a
gke
mark for pytest and modifygenerate_pipeline.py
. The basic idea is to separate tests marked withgke
to run on the GKE cluster, while other Kubernetes tests can run on the local cluster, without relying on GKE for buildkite.Remove the
me-central2
region test since its restricted and our account don't have permissionTested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh