-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshots Issue - Cannot Serve More Than 10 Concurrent Functions #579
Comments
@DanielLee343 thanks for raising the Issue (converting it to a Discussion). The timeout is most likely related to the underlying disk (/fccd/snapshots) being too slow. We recommend using a node with an SSD. Alternatively, you can increase the timeout value in the code. Regarding remote storage of snapshots, this mode is not fully supported yet. However, #465 adds everything necessary to enable this mode although we haven't tested it extensively just yet. You can take a look at the code in that branch. |
@ustiugov @amohoste Thanks for the response! I checked out to new_snapshots branch, but when I run the following test in
Running |
@DanielLee343 could you check if the paths to the sockets mentioned in the log actually exist? we might have forgotten to add their deletion in the cleaning scripts you mentioned. If that's the case, could you submit a PR with a fix to that branch? |
@DanielLee343 is the issue still relevant? if not, please close. |
Hi, I have several questions related to snapshotting. I am running one of the
TestBenchParallelServe
tests invhive/Makefile
on a single node setup.If I'm correct, this script will spawn
parallelNum
concurrent (and same) functions, in my case,helloworld
, with both snapshots and REAP enabled. However, the maximumparallelNum
it supports is only 10 on my machine. When getting larger, it fails with the following error:It fails in createSnapshots() in bench_test.go where you are performing parallel snapshots saving. It throws error in createSnapshot in which it's sending createSnapshot gRPC request. I'm using a single node on chameleon cloud that has 128GB RAM and 48 cores of Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz. I doubt it's hardware capacity issue because I was using
htop
that shows for the entire testing duration, RAM usage is below 4GB. However, I see from your ASPLOS paper, "We use the the helloworld function and consider up to 64 concurrent independent function arrivals". Are you testing on a cluster in which those functions run on different machines? Or do you think it's constrained by concurrent SSD IO bandwidth? df -h shows:My second question is, it seems you are storing snapshots/ws files in local, under
/fccd/snapshots/
. What if the next invocation of the same function is on another machine, where you don't have the snapshots there? In this case, S3, or a distributed storage solution should be considered, right? Or if you have better ideas, love to hear that.Also, using lsblk shows tons of stuff like this, which I have no idea. Is it created by Firecracker or vhive CRI?
I appreciate your time reviewing and answering these. Thank you!
The text was updated successfully, but these errors were encountered: