-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Storage] Continue storage deletion in sky storage delete
when some fail
#4454
base: master
Are you sure you want to change the base?
Conversation
@cblmemo PTAL, thanks! |
I inserted five new records into the 'global_user_state' database: three records containing the previous storage handle value, and two records with null handle values. andyl@DESKTOP-7FP6SMO ~/skypilot (fix-storage-recover) [1]> sky storage dele
te -a
Deleting 6 storages: skypilot-workdir-andyl-74d418df, test-storage-1, test-storage-2, test-storage-3, test-storage-4, test-storage-5. Proceed? [Y/n]:
Error deleting storage test-storage-3: In global_user_state, storage name 'test-storage-3' does not match handle.storage_name 'skypilot-workdir-andyl-577894a6'
Error deleting storage test-storage-2: In global_user_state, storage name 'test-storage-2' does not match handle.storage_name 'skypilot-workdir-andyl-577894a6'
Error deleting storage test-storage-4: Storage name 'test-storage-4' not found.
Error deleting storage test-storage-1: In global_user_state, storage name 'test-storage-1' does not match handle.storage_name 'skypilot-workdir-andyl-577894a6'
Error deleting storage test-storage-5: Storage name 'test-storage-5' not found.
Deleted GCS bucket skypilot-workdir-andyl-74d418df. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the prompt fix @andylizf ! left some discussions.
""" | ||
# Reference: https://stackoverflow.com/questions/25790279/python-multiprocessing-early-termination # pylint: disable=line-too-long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not removing this reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean why remove the reference? I think it's trivial now, and the reference doesn't add much value to the code.
sky/utils/subprocess_utils.py
Outdated
def run_in_parallel(func: Callable, | ||
args: Iterable[Any], | ||
num_threads: Optional[int] = None) -> List[Any]: | ||
num_threads: Optional[int] = None, | ||
continue_on_error: bool = False) -> List[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding this seems change the return value type from Value
to Union[Value, Exception]
, which looks a little bit strange to me. Why do we need this anyway? aren't we already add the try-except in the func?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I thought about not adding a function to catch and log these exceptions, but instead handle all exceptions in run_in_parallel
. Anyway, it made sense to have this function have an argument like continue_on_error
.
But then I realized it violated the abstraction concept and would make it harder to print custom logs. So I ended up wrapping the function and logging it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andylizf ! Mostly looks good to me. After those nits it should be ready to go!
sky/cli.py
Outdated
except (exceptions.StorageBucketDeleteError, PermissionError) as e: | ||
click.secho(f'Error deleting storage {name}: {e}', fg='red') | ||
except ValueError as e: | ||
click.secho(f'Error deleting storage {name}: {e}', fg='red') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except (exceptions.StorageBucketDeleteError, PermissionError) as e: | |
click.secho(f'Error deleting storage {name}: {e}', fg='red') | |
except ValueError as e: | |
click.secho(f'Error deleting storage {name}: {e}', fg='red') | |
except (exceptions.StorageBucketDeleteError, PermissionError, ValueError) as e: | |
click.secho(f'Error deleting storage {name}: {e}', fg='red') |
can we merge them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, seems like here is one exception that is not being caught. After a second thought, for simplicity, we can probably just do broad except here..
(skypilot) (base) ➜ skypilot git:(remove_token) ✗ sky storage delete -a
Deleting 15 storages: skypilot-workdir-windsey-afa32d67, skypilot-workdir-windsey-81d2115f, skypilot-workdir-windsey-b8fd0382, skypilot-workdir-windsey-499178bb, skypilot-workdir-windsey-dd64cc00, skypilot-workdir-windsey-d04e17a8, skypilot-workdir-windsey-43925bf5, skypilot-workdir-windsey-3375424a, skypilot-workdir-windsey-5e824f17, skypilot-workdir-windsey-db755382, skypilot-workdir-windsey-25270ec6, skypilot-workdir-windsey-1cb4689a, skypilot-workdir-windsey-f40df433, skypilot-workdir-windsey-2ef8a0bf, skypilot-workdir-windsey-58878e24. Proceed? [Y/n]:
No backing stores found. Deleting storage.
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadBucket operation: Forbidden
The above exception was the direct cause of the following exception:
sky.exceptions.StorageBucketGetError: Failed to access existing bucket 'skypilot-workdir-windsey-afa32d67'. This is likely because it is a private bucket you do not have access to.
To fix:
1. If you are trying to create a new bucket: use a different name.
2. If you are trying to connect to an existing bucket: make sure your cloud credentials have access to it. To debug, consider running `aws s3 ls skypilot-workdir-windsey-afa32d67`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... I agree
sky/cli.py
Outdated
except Exception as e: # pylint: disable=broad-except | ||
with ux_utils.print_exception_no_traceback(): | ||
raise e | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except Exception as e: # pylint: disable=broad-except | |
with ux_utils.print_exception_no_traceback(): | |
raise e |
It seems like we dont need it? We can simply not except
it. And any exceptions raised by sky.storage_delete
should already have the surpress traceback context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why they should have the traceback context supressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cuz this is a user-facing API and we want to keep the error message simple. The content in the error message should be self-explanatory.
sky storage delete
when some fail
Co-authored-by: Tian Xia <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andylizf ! LGTM. Please lmk when you pass the test and it should be ready to go!
Fix #4050.
Make
sky storage delete
continue deleting remaining storages when some fail, instead of aborting the entire operation. Failed deletions are reported individually.We Implemented this feature via adding a
continue_on_error
parameter inrun_in_parallel
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh