-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tests] fix smoke test race condition on first run #4494
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Aylei <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aylei , The code looks good
I’m trying to test it with the following commands:
profile_name=skypilot-v1
aws iam remove-role-from-instance-profile --instance-profile-name "$profile_name" --role-name "$profile_name"
aws iam delete-instance-profile --instance-profile-name "$profile_name"
# Double check its deleted
aws iam get-instance-profile --instance-profile-name "$profile_name"
# Additional tests can be added for easier reproduction
pytest tests/test_smoke.py::test_aws_stale_job_manual_restart tests/test_smoke.py::test_minimal --aws
This seems flaky; it might be because profile.add_role
also needs to address the concurrency issue. Could u also help fix this? @aylei
D 12-27 11:51:18 provisioner.py:151] Failed to provision 't-aws-stale-job-manua-1d-86' on AWS (us-east-2a,us-east-2b,us-east-2c).
D 12-27 11:51:18 provisioner.py:153] bulk_provision for 't-aws-stale-job-manua-1d-86' failed. Stacktrace:
D 12-27 11:51:18 provisioner.py:153] Traceback (most recent call last):
D 12-27 11:51:18 provisioner.py:153] File "/Users/zepingguo/Desktop/skypilot/sky/provision/provisioner.py", line 142, in bulk_provision
D 12-27 11:51:18 provisioner.py:153] return _bulk_provision(cloud, region, cluster_name,
D 12-27 11:51:18 provisioner.py:153] File "/Users/zepingguo/Desktop/skypilot/sky/provision/provisioner.py", line 60, in _bulk_provision
D 12-27 11:51:18 provisioner.py:153] config = provision.bootstrap_instances(provider_name, region_name,
D 12-27 11:51:18 provisioner.py:153] File "/Users/zepingguo/Desktop/skypilot/sky/provision/__init__.py", line 52, in _wrapper
D 12-27 11:51:18 provisioner.py:153] return impl(*args, **kwargs)
D 12-27 11:51:18 provisioner.py:153] File "/Users/zepingguo/Desktop/skypilot/sky/provision/aws/config.py", line 71, in bootstrap_instances
D 12-27 11:51:18 provisioner.py:153] node_cfg['IamInstanceProfile'] = _configure_iam_role(iam)
D 12-27 11:51:18 provisioner.py:153] File "/Users/zepingguo/Desktop/skypilot/sky/provision/aws/config.py", line 254, in _configure_iam_role
D 12-27 11:51:18 provisioner.py:153] profile.add_role(RoleName=role.name)
D 12-27 11:51:18 provisioner.py:153] File "/Users/zepingguo/miniconda3/envs/sky/lib/python3.10/site-packages/boto3/resources/factory.py", line 581, in do_action
D 12-27 11:51:18 provisioner.py:153] response = action(self, *args, **kwargs)
D 12-27 11:51:18 provisioner.py:153] File "/Users/zepingguo/miniconda3/envs/sky/lib/python3.10/site-packages/boto3/resources/action.py", line 88, in __call__
D 12-27 11:51:18 provisioner.py:153] response = getattr(parent.meta.client, operation_name)(*args, **params)
D 12-27 11:51:18 provisioner.py:153] File "/Users/zepingguo/miniconda3/envs/sky/lib/python3.10/site-packages/botocore/client.py", line 569, in _api_call
D 12-27 11:51:18 provisioner.py:153] return self._make_api_call(operation_name, kwargs)
D 12-27 11:51:18 provisioner.py:153] File "/Users/zepingguo/miniconda3/envs/sky/lib/python3.10/site-packages/botocore/client.py", line 1023, in _make_api_call
D 12-27 11:51:18 provisioner.py:153] raise error_class(parsed_response, operation_name)
D 12-27 11:51:18 provisioner.py:153] botocore.errorfactory.LimitExceededException: An error occurred (LimitExceeded) when calling the AddRoleToInstanceProfile operation: Cannot exceed quota for InstanceSessionsPerInstanceProfile: 1
D 12-27 11:51:18 provisioner.py:153]
D 12-27 11:51:18 provisioner.py:158] Terminating the failed cluster.
Signed-off-by: Aylei <[email protected]>
@zpoint Thanks! Looks like
Besides, this quota is not allowed to increase according to https://docs.aws.amazon.com/IAM/latest/APIReference/API_AddRoleToInstanceProfile.html I added a quick fix in the PR, PTAL❤️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aylei
LGTM
sky/provision/aws/config.py
Outdated
profile = _get_instance_profile(instance_profile_name) | ||
time.sleep(15) # wait for propagation | ||
assert profile is not None, 'Failed to create instance profile' | ||
|
||
# TODO(aylei): check if the role associated is the one we expect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already checked in _ensure_instance_profile_role
. Can we remove the TODO
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check in _ensure_instance_profile_role
only occurs when race happens, i.e. profile.roles
is empty and the subsequent adding role action encounters LimitExceeded
error. We still need to check the role in other cases.
I found this TODO is trivial, just added the missing part and removed the TODO.
Signed-off-by: Aylei <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aylei
sky/provision/aws/config.py
Outdated
def _check_instance_profile_role(profile: Any, role_name: str): | ||
if profile.roles and profile.roles[0].name != role_name: | ||
# If the associated role is not the role we expect, error out | ||
# to user instead of silently overriding the role. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need the comment here? The logger.fatal
statement below is quite clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, removed the log for simplicity~
Signed-off-by: Aylei <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for fixing this @aylei! It looks good to me.
f'role {role_name} is not the same as the expected ' | ||
f'role. Please remove the existing role from the ' | ||
f'instance profile and try again.') | ||
raise SystemExit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we use the same handle_boto_error
here as well to keep the error handling consistent and raise the same exc
error?
@@ -150,6 +150,64 @@ def _get_role(role_name: str): | |||
f'{role_name}{colorama.Style.RESET_ALL} from AWS.') | |||
raise exc | |||
|
|||
def _create_instance_profile_if_not_exists(instance_profile_name: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same for the functions below
def _create_instance_profile_if_not_exists(instance_profile_name: str): | |
def _create_instance_profile_if_not_exists(instance_profile_name: str) -> None: |
Fix the following error when I ran
pytest tests/test_smoke.py
on my laptop, which cause test failure:Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_aws_stale_job_manual_restart
conda deactivate; bash -i tests/backward_compatibility_tests.sh