Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix smoke tests bug #4492

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

zpoint
Copy link
Collaborator

@zpoint zpoint commented Dec 20, 2024

Some fix for #4438

Refer to release comment here and some remaining problem after resolved

Change regoin for:

 pytest tests/test_smoke.py::test_managed_jobs_storage --azure
 pytest tests/test_smoke.py::test_azure_best_tier_failover --azure
 pytest tests/test_smoke.py::test_azure_disk_tier --azure

And robust backward compatibility

Add a gke mark for pytest and modify generate_pipeline.py. The basic idea is to separate tests marked withgke to run on the GKE cluster, while other Kubernetes tests can run on the local cluster, without relying on GKE for buildkite.

Remove the me-central2 region test since its restricted and our account don't have permission

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@zpoint
Copy link
Collaborator Author

zpoint commented Dec 20, 2024

/quicktest-core

@zpoint
Copy link
Collaborator Author

zpoint commented Dec 20, 2024

/quicktest-core

1 similar comment
@zpoint
Copy link
Collaborator Author

zpoint commented Dec 25, 2024

/quicktest-core

@zpoint
Copy link
Collaborator Author

zpoint commented Dec 25, 2024

/quicktest-core

@zpoint zpoint modified the milestones: 100 jobs, v0.7.1 Dec 27, 2024
@zpoint zpoint requested review from cg505 and Michaelvll December 27, 2024 03:44
@@ -73,6 +73,7 @@
~/google-cloud-sdk/install.sh -q >> {_GCLOUD_INSTALLATION_LOG} 2>&1 && \
echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc && \
source ~/google-cloud-sdk/path.bash.inc >> {_GCLOUD_INSTALLATION_LOG} 2>&1; }}; }} && \
export GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json && \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can fix credential issue for #4512 cc @weih1121

failure tests: pytest tests/smoke_tests/test_mount_and_storage.py::test_azure_storage_mounts_with_stop --azure

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tested this without using a service account, e.g., with a normal GCP authentication locally or with gcp.remote_identity: SERVICE_ACCOUNT?

We should not assume that this file exists on the remote.

Copy link
Collaborator Author

@zpoint zpoint Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If the file ~/.config/gcloud/application_default_credentials.json does not exist, or it exist but invalud, the sky launch will fail because GCP is not configured.
+ touch ~/.ssh/id_rsa.pub
+ sky launch -y -c t-azure-storage-mount-b8-31 --cloud azure /var/folders/83/zxqx914s57x310rfnhq8kk9r0000gn/T/tmpy0nu9vj0.yaml
D 12-31 14:55:23 skypilot_config.py:228] Using config path: /Users/zepingguo/.sky/config.yaml
D 12-31 14:55:23 skypilot_config.py:233] Config loaded:
D 12-31 14:55:23 skypilot_config.py:233] {}
D 12-31 14:55:23 skypilot_config.py:245] Config syntax check passed.
Task from YAML spec: /var/folders/83/zxqx914s57x310rfnhq8kk9r0000gn/T/tmpy0nu9vj0.yaml
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: timed out
WARNING:google.auth._default:Authentication failed using Compute Engine authentication due to unavailable metadata server.
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: timed out
WARNING:google.auth._default:Authentication failed using Compute Engine authentication due to unavailable metadata server.
Traceback (most recent call last):
  File "/Users/zepingguo/miniconda3/envs/sky/bin/sky", line 8, in <module>
    sys.exit(cli())
  File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/common_utils.py", line 366, in _record
    return f(*args, **kwargs)
  File "/Users/zepingguo/Desktop/skypilot/sky/cli.py", line 838, in invoke
    return super().invoke(ctx)
  File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/common_utils.py", line 386, in _record
    return f(*args, **kwargs)
  File "/Users/zepingguo/Desktop/skypilot/sky/cli.py", line 1118, in launch
    task_or_dag = _make_task_or_dag_from_entrypoint_with_overrides(
  File "/Users/zepingguo/Desktop/skypilot/sky/cli.py", line 790, in _make_task_or_dag_from_entrypoint_with_overrides
    dag = dag_utils.load_chain_dag_from_yaml(entrypoint, env_overrides=env)
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/dag_utils.py", line 101, in load_chain_dag_from_yaml
    task = task_lib.Task.from_yaml_config(task_config, env_overrides)
  File "/Users/zepingguo/Desktop/skypilot/sky/task.py", line 438, in from_yaml_config
    storage_obj = storage_lib.Storage.from_yaml_config(storage[1])
  File "/Users/zepingguo/Desktop/skypilot/sky/data/storage.py", line 1045, in from_yaml_config
    storage_obj = cls(name=name,
  File "/Users/zepingguo/Desktop/skypilot/sky/data/storage.py", line 560, in __init__
    self.add_store(StoreType.GCS)
  File "/Users/zepingguo/Desktop/skypilot/sky/data/storage.py", line 896, in add_store
    store = store_cls(
  File "/Users/zepingguo/Desktop/skypilot/sky/data/storage.py", line 1551, in __init__
    super().__init__(name, source, region, is_sky_managed,
  File "/Users/zepingguo/Desktop/skypilot/sky/data/storage.py", line 263, in __init__
    self._validate()
  File "/Users/zepingguo/Desktop/skypilot/sky/data/storage.py", line 1596, in _validate
    raise exceptions.ResourcesUnavailableError(
sky.exceptions.ResourcesUnavailableError: Storage 'store: gcs' specified, but GCP access is disabled.
  1. If file ~/.config/gcloud/application_default_credentials.json is configured as personal account and valid, it will be successful
(sky) ➜  skypilot git:(dev/zeping/fix_smoke_tests) pytest tests/smoke_tests/test_mount_and_storage.py::test_azure_storage_mounts_with_stop --azure
D 12-31 15:02:46 skypilot_config.py:228] Using config path: /Users/zepingguo/.sky/config.yaml
D 12-31 15:02:46 skypilot_config.py:233] Config loaded:
D 12-31 15:02:46 skypilot_config.py:233] {}
D 12-31 15:02:46 skypilot_config.py:245] Config syntax check passed.
bringing up nodes...
[azure_storage_mounts] Test started. Log: less /var/folders/83/zxqx914s57x310rfnhq8kk9r0000gn/T/azure_storage_mounts-s5h1dawv.log
[azure_storage_mounts] Passed.
[azure_storage_mounts] Log: less /var/folders/83/zxqx914s57x310rfnhq8kk9r0000gn/T/azure_storage_mounts-s5h1dawv.log
[azure_storage_mounts] 
.
1 passed, 272 warnings in 468.11s (0:07:48)
(sky) ➜  skypilot git:(dev/zeping/fix_smoke_tests)
  1. If file ~/.config/gcloud/application_default_credentials.json is configured as service account and valid, it will also be successful

We don't assume this file exists. If the file does not exist, the credential may fail as usual. If the file exists, the credential may succeed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, update the command to export GOOGLE_APPLICATION_CREDENTIALS only if it is not set and the file exists.

sky/data/mounting_utils.py Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint! Please see the comments below

@@ -73,6 +73,7 @@
~/google-cloud-sdk/install.sh -q >> {_GCLOUD_INSTALLATION_LOG} 2>&1 && \
echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc && \
source ~/google-cloud-sdk/path.bash.inc >> {_GCLOUD_INSTALLATION_LOG} 2>&1; }}; }} && \
export GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json && \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tested this without using a service account, e.g., with a normal GCP authentication locally or with gcp.remote_identity: SERVICE_ACCOUNT?

We should not assume that this file exists on the remote.

sky/data/mounting_utils.py Show resolved Hide resolved
tests/backward_compatibility_tests.sh Outdated Show resolved Hide resolved
@@ -566,7 +566,7 @@ def test_kubernetes_context_failover():
'kubectl get namespaces --context kind-skypilot | grep test-namespace || '
'{ echo "Should set the namespace to test-namespace for kind-skypilot. Check the instructions in '
'tests/test_smoke.py::test_kubernetes_context_failover." && exit 1; }',
'sky show-gpus --cloud kubernetes --region kind-skypilot | grep H100 | grep "1, 2, 3, 4, 5, 6, 7, 8"',
'sky show-gpus --cloud kubernetes --region kind-skypilot | grep H100 | grep "1, 2, 4, 8"',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we make the change here? In k8s, we do all different number of GPUs? @romilbhardwaj

Copy link
Collaborator Author

@zpoint zpoint Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I execute on aws t3.2xlarge instance:

sky local up
kubectl label node --overwrite skypilot-control-plane skypilot.co/accelerator=h100 --context kind-skypilot

kubectl proxy &

sleep 5

curl --header "Content-Type: application/json-patch+json" \
  --request PATCH \
  --data '[{"op": "add", "path": "/status/capacity/nvidia.com~1gpu", "value": "8"}]' \
  http://localhost:8001/api/v1/nodes/skypilot-control-plane/status

kubectl create namespace test-namespace --context kind-skypilot
kubectl config set-context kind-skypilot --namespace=test-namespace --context kind-skypilot


kind create cluster --name kind-cluster-2
kubectl config use-context kind-skypilot

sky check

But the sky show-gpus --cloud kubernetes --region kind-skypilot only shows 1, 2, 4, 8

@@ -556,6 +556,7 @@ def test_tpu_vm_pod():


# ---------- TPU Pod Slice on GKE. ----------
@pytest.mark.gke
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean by having two marks for a single test? Our current handling of the marks means that if a test function is marked with a specific cloud, it can only be run with --cloud-name. How can we trigger this test now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used directly by pytest. Some test cases can run on Kubernetes after executing sky local up.

However, some tests require a real TPU cluster or GKE resources. I need to schedule these test cases to run on the real GKE cluster instead of the local one. I use this marker to distinguish them on buildkite.

On buildkite, some tests will run on the kubernetes cluster set up by sky local up, while others will run on the GKE cluster. (I was hoping all tests could run on the local Kubernetes cluster, but it seems that’s not possible.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants