-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask cluster issues (scheduler?) #11
Comments
OK I've replicated the issue of the cluster starting to slow down and have read more on worker memory management which seems to be the culprit. I ran a long running job processing 5 HLS tiles concurrently for 190 tiles (WA 2015-2019). After 46 tiles were completed the job stalled when a worker stopped. I've attached the logs of the worker (port 44725) and scheduler. logs_ tls___10.244.13.8_44725.txt Here are some screenshots of the cluster dashboards after it froze: |
From my (new) understanding of how memory management works Its likely the reason the workers start to slow down is that they start to consume more memory (2 potential reasons: memory leak, or each tile job gets larger as we progress through years as Sentinel satellites come online). They slow down because they start to spill memory to disk (if they can); however, there are frequent logs being written from every worker of the form:
At some point the scheduler decides to kill the worker that ends up freezing the computation, but doesn't succeed to seem to do so:
In the worker we see:
To fix this problem I think I need to:
|
I replicated this issue on this notebook |
Confirmed that a worker over 80% memory pauses |
If you're able to, capturing a performance report might help with debugging. Option 1 does sound by far the easiest. Fair warning: you're running into an issue that's been a headache for dask users for years. Lots of things look like a memory leak (see this search on the distributed issue tracker, dask/dask#3530, and others). Most / all of the time it isn't actually dask / distributed's fault, but it ends up exhausting worker memory. Dask workers do not behave well when they're close to their memory limit (dask/distributed#2602). I'll try to take a look at the notebook later today or tomorrow to see if anything stands out. |
Thanks for the tips @TomAugspurger. Here is a run today that eventually hung (despite me reducing the concurrency at which I process tiles from 5 to 2, and bumping memory per worker to 8gb) I didn't observe serious pressure on the memory for any workers during this run. I'm not sure if the cause --> effect is the worker having issues --> scheduler tries to remove (and fails) or if its scheduler tries to remove worker --> worker fails (but isn't properly removed). Looking at the performance report the bandwidth matrix for workers looks strange for the failed worker (didn't communicate with many others), but maybe thats an artifact of when the matrix is generated? Also the fact num tasks is exactly 100k. performance report I will dig through dask issues to see if there is anything I can find there |
@TomAugspurger Any ideas here especially around how to ensure a worker is removed when it gets disconnected from the scheduler properly? When the worker and scheduler get disconnected the worker is never removed and it just hangs and is marked red in the cluster's workers page. I'm a little lost here. Even with force restarting the cluster every N jobs, this still is occasionally happening so I assume its just a natural networking issue when a worker gets moved to a different node in AKS or something like that. Here's the scheduler log:
and the corresponding worker log:
|
Next try: starting the workers with |
Apologies for the delay. Hoping to take a closer look tomorrow.
This will be easiest if I’m able to reproduce it. Is there anything I wouldn’t be able to easily reproduce? Perhaps the part about `az://fia/catalogs/hls_wa_2015-2019.zarr`? <az://fia/catalogs/hls_wa_2015-2019.zarr%60?>
Is there an easy way for me to get sufficiently similar data from the azure open datasets storage account?
… On Jan 21, 2021, at 4:26 PM, Ben Shulman ***@***.***> wrote:
Next try: starting the workers with --no-reconnect so when it loses its connection to the scheduler it shuts down instead of trying to reconnect.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#11 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIUHLMRDFPSBUFJ4ETDS3CS3FANCNFSM4VTMEIDA>.
|
current branch, current notebook The zarr needed for reproducing issues is zipped here (it's just a dataset of HLS tiles/scenes from Azure open datasets): It appears using Some example logs from workers that got stuck at one point:
|
daskhub helm values:
The Pangeo ML notebook image can be used instead which our image is based off of |
Great thanks. I'm able to at least start playing with the computation now. A few meta-comments:
And a few Random thoughts:
Combining these points will, I think, make things easier to debug. You'll just I haven't had a chance yet to look into the core of your computation |
Thanks for all the comments and insight on improving computations - we're new to dask and xarray so any pointers are helpful! meta-comments:
Random thoughts:
|
Do you know what the indices in the grouper are like? IIRC it's something like I'll need to check the latest state of things, but IIRC, that kind of pattern is hard for Dask.Array to do efficiently. Your forced to either generate a bunch of tasks or move around a bunch of data. If we're able to identify this as a problem I'll take a closer look later. Apologies for the lack of good examples here. We'll be contributing more and I'll see if anything comes up. Some of the ones in http://gallery.pangeo.io/ (particularly the landast gallery) may be worth looking at, but I don't know of any that directly relate to what you're trying to do. |
Yeah its grouping on |
I refactored so that each job (aka invocation of calculate_job_median) doesn't use futures and uses Each job is a totally independent computation and thus embarrassingly parallel in that sense...So what I am looking to do is ideally submit all the jobs that need to be done to the dask cluster and then the dask cluster know to only do so many at once so as not to overwhelm the total available memory in the cluster. Right now if I jump to 8 concurrent jobs from 2 the cluster will struggle significantly. |
I've been working to scale our HLS notebooks out over larger datasets and larger clusters and have run into some issues:
Things I've investigated:
More investigation is required...
The text was updated successfully, but these errors were encountered: