Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshots Issue - Cannot Serve More Than 10 Concurrent Functions #579

Closed
DanielLee343 opened this issue Jul 27, 2022 · 4 comments
Closed

Comments

@DanielLee343
Copy link

Hi, I have several questions related to snapshotting. I am running one of the TestBenchParallelServe tests in vhive/Makefile on a single node setup.

sudo mkdir -m777 -p $(CTRDLOGDIR) && sudo env "PATH=$(PATH)" /usr/local/bin/firecracker-containerd --config /etc/firecracker-containerd/config.toml 1>$(CTRDLOGDIR)/fccd_orch_noupf_log_bench.out 2>$(CTRDLOGDIR)/fccd_orch_noupf_log_bench.err &
sudo env "PATH=$(PATH)" go test $(EXTRAGOARGS) -run TestBenchParallelServe -args $(WITHSNAPSHOTS) $(WITHUPF) -benchDirTest configREAP -metricsTest -funcName helloworld
./scripts/clean_fcctr.sh

If I'm correct, this script will spawn parallelNum concurrent (and same) functions, in my case, helloworld, with both snapshots and REAP enabled. However, the maximum parallelNum it supports is only 10 on my machine. When getting larger, it fails with the following error:

cc@vhive-inst-01:~/vhive$ make bench 
sudo mkdir -m777 -p /tmp/ctrd-logs && sudo env "PATH=/home/cc/.vscode-server/bin/3b889b090b5ad5793f524b5d1d39fda662b96a2a/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin" /usr/local/bin/firecracker-containerd --config /etc/firecracker-containerd/config.toml 1>/tmp/ctrd-logs/fccd_orch_noupf_log_bench.out 2>/tmp/ctrd-logs/fccd_orch_noupf_log_bench.err &
sudo env "PATH=/home/cc/.vscode-server/bin/3b889b090b5ad5793f524b5d1d39fda662b96a2a/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin" go test -v -race -cover -run TestBenchParallelServe -args -snapshotsTest -upfTest -benchDirTest configREAP -metricsTest -funcName helloworld
time="2022-07-27T15:56:12.352968106-04:00" level=info msg="Orchestrator snapshots enabled: true"
time="2022-07-27T15:56:12.353343756-04:00" level=info msg="Orchestrator UPF enabled: true"
time="2022-07-27T15:56:12.353431463-04:00" level=info msg="Orchestrator lazy serving mode enabled: false"
time="2022-07-27T15:56:12.353527886-04:00" level=info msg="Orchestrator UPF metrics enabled: true"
time="2022-07-27T15:56:12.353589272-04:00" level=info msg="Drop cache: true"
time="2022-07-27T15:56:12.353640659-04:00" level=info msg="Bench dir: configREAP"
time="2022-07-27T15:56:12.353695564-04:00" level=info msg="Registering bridges for tap manager"
time="2022-07-27T15:56:12.355698518-04:00" level=info msg="Creating containerd client"
time="2022-07-27T15:56:12.358589476-04:00" level=info msg="Created containerd client"
time="2022-07-27T15:56:12.358791972-04:00" level=info msg="Creating firecracker client"
time="2022-07-27T15:56:12.359535647-04:00" level=info msg="Created firecracker client"
=== RUN   TestBenchParallelServe
time="2022-07-27T15:56:12.370809757-04:00" level=info msg="New function added" fID=plr-fnc image="ghcr.io/ease-lab/helloworld:var_workload" isPinned=true servedTh=0
... (omitted some logs here)
time="2022-07-27T15:56:50.439361536-04:00" level=info msg="Creating snapshot for 1, vmID is 1-0"
time="2022-07-27T15:56:50.441571578-04:00" level=info msg="Orchestrator received CreateSnapshot" vmID=1-0
time="2022-07-27T15:56:50.666040404-04:00" level=info msg="Creating snapshot for 0, vmID is 0-0"
time="2022-07-27T15:56:50.667999825-04:00" level=info msg="Orchestrator received CreateSnapshot" vmID=0-0
time="2022-07-27T15:56:55.440588667-04:00" level=error msg="failed to create snapshot of the VM" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" vmID=1-0
time="2022-07-27T15:56:55.440898692-04:00" level=panic msg="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
panic: (*logrus.Entry) 0xc0002390a0

goroutine 186 [running]:
github.com/sirupsen/logrus.(*Entry).log(0xc000238bd0, 0x0, {0xc0002ae870, 0x43})
	/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:259 +0x95b
github.com/sirupsen/logrus.(*Entry).Log(0xc000238bd0, 0x0, {0xc000d05630, 0x1, 0x1})
	/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:285 +0x8c
github.com/sirupsen/logrus.(*Logger).Log(0xc00021fdc0, 0x0, {0xc000d05630, 0x1, 0x1})
	/root/go/pkg/mod/github.com/sirupsen/[email protected]/logger.go:198 +0x85
github.com/sirupsen/logrus.(*Logger).Panic(...)
	/root/go/pkg/mod/github.com/sirupsen/[email protected]/logger.go:247
github.com/sirupsen/logrus.Panic(...)
	/root/go/pkg/mod/github.com/sirupsen/[email protected]/exported.go:129
github.com/ease-lab/vhive.(*Function).CreateInstanceSnapshot(0xc00041c370)
	/home/cc/vhive/functions.go:463 +0x51b
github.com/ease-lab/vhive.(*Function).Serve.func2()
	/home/cc/vhive/functions.go:308 +0x46
sync.(*Once).doSlow(0xc0001c26d4, 0xc000d05ae8)
	/usr/local/go/src/sync/once.go:68 +0x102
sync.(*Once).Do(0xc0001c26d4, 0xc000536750?)
	/usr/local/go/src/sync/once.go:59 +0x47
github.com/ease-lab/vhive.(*Function).Serve(0xc00041c370, {0x1ca7c15?, 0x1?}, {0x1ca7c15, 0x1}, {0x1cad79c, 0x28}, {0x1c81435, 0x6})
	/home/cc/vhive/functions.go:306 +0xeee
github.com/ease-lab/vhive.(*FuncPool).Serve(0xc00021fdc0?, {0x1e9d988, 0xc0001c2008}, {0x1ca7c15, 0x1}, {0x1cad79c, 0x28}, {0x1c81435, 0x6})
	/home/cc/vhive/functions.go:121 +0xea
github.com/ease-lab/vhive.createSnapshots.func1({0x1ca7c15, 0x1})
	/home/cc/vhive/bench_test.go:297 +0x18a
created by github.com/ease-lab/vhive.createSnapshots
	/home/cc/vhive/bench_test.go:292 +0xb6
exit status 2
FAIL	github.com/ease-lab/vhive	43.155s

It fails in createSnapshots() in bench_test.go where you are performing parallel snapshots saving. It throws error in createSnapshot in which it's sending createSnapshot gRPC request. I'm using a single node on chameleon cloud that has 128GB RAM and 48 cores of Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz. I doubt it's hardware capacity issue because I was using htop that shows for the entire testing duration, RAM usage is below 4GB. However, I see from your ASPLOS paper, "We use the the helloworld function and consider up to 64 concurrent independent function arrivals". Are you testing on a cluster in which those functions run on different machines? Or do you think it's constrained by concurrent SSD IO bandwidth? df -h shows:

cc@vhive-inst-01:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             63G     0   63G   0% /dev
tmpfs            13G  7.3M   13G   1% /run
/dev/sda1       230G   73G  148G  33% /
tmpfs            63G     0   63G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            63G     0   63G   0% /sys/fs/cgroup
/dev/loop0       56M   56M     0 100% /snap/core18/1932
/dev/loop1       68M   68M     0 100% /snap/lxd/18150
/dev/loop3       56M   56M     0 100% /snap/core18/2538
/dev/loop4       47M   47M     0 100% /snap/snapd/16292
/dev/loop5       62M   62M     0 100% /snap/core20/1581
/dev/loop6       68M   68M     0 100% /snap/lxd/22753
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/56a07329fbcd10e05e218702b95ed3b5c42d75d55321af0bfdc884db4711cbbf/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/64fd4707bc99b77b6da37ce5cd6f188b5bb001d3e1bbcf9ba35b4f6c734b7d66/shm
... (tons of shms)

My second question is, it seems you are storing snapshots/ws files in local, under /fccd/snapshots/. What if the next invocation of the same function is on another machine, where you don't have the snapshots there? In this case, S3, or a distributed storage solution should be considered, right? Or if you have better ideas, love to hear that.

Also, using lsblk shows tons of stuff like this, which I have no idea. Is it created by Firecracker or vhive CRI?

loop0               7:0    0  55.4M  1 loop /snap/core18/1932
loop1               7:1    0  67.8M  1 loop /snap/lxd/18150
loop2               7:2    0   100G  0 loop 
loop3               7:3    0  55.6M  1 loop /snap/core18/2538
loop4               7:4    0    47M  1 loop /snap/snapd/16292
loop5               7:5    0    62M  1 loop /snap/core20/1581
loop6               7:6    0  67.8M  1 loop /snap/lxd/22753
loop7               7:7    0     2G  0 loop 
loop8               7:8    0   100G  0 loop 
loop9               7:9    0     2G  0 loop 
loop10              7:10   0   100G  0 loop 
loop11              7:11   0     2G  0 loop 
loop12              7:12   0   100G  0 loop 
loop13              7:13   0     2G  0 loop 
loop14              7:14   0   100G  0 loop 
loop15              7:15   0     2G  0 loop
...

I appreciate your time reviewing and answering these. Thank you!

@ustiugov
Copy link
Member

ustiugov commented Aug 7, 2022

@DanielLee343 thanks for raising the Issue (converting it to a Discussion). The timeout is most likely related to the underlying disk (/fccd/snapshots) being too slow. We recommend using a node with an SSD. Alternatively, you can increase the timeout value in the code.

Regarding remote storage of snapshots, this mode is not fully supported yet. However, #465 adds everything necessary to enable this mode although we haven't tested it extensively just yet. You can take a look at the code in that branch.

@DanielLee343
Copy link
Author

@ustiugov @amohoste Thanks for the response! I checked out to new_snapshots branch, but when I run the following test in vhive/ctriface/Makfile, it fails with VM already exists

./../scripts/clean_fcctr.sh
sudo mkdir -m777 -p $(CTRDLOGDIR) && sudo env "PATH=$(PATH)" /usr/local/bin/firecracker-containerd --config /etc/firecracker-containerd/config.toml 1>$(CTRDLOGDIR)/ctriface_log_upf_man_skip.out 2>$(CTRDLOGDIR)/ctriface_log_upf_man_skip.err &
sudo env "PATH=$(PATH)" go test $(EXTRAGOARGS) -run TestParallelSnapLoad -args $(WITHUPF)
# sudo env "PATH=$(PATH)" go test $(EXTRAGOARGS) -run TestParallelPhasedSnapLoad -args $(WITHUPF)
./../scripts/clean_fcctr.sh
cc@us-pmem-1:~/vhive/ctriface$ make test-skip
./../scripts/clean_fcctr.sh
Killing firecracker agents and VMs
Resetting nftables
Deleting veth* devices created by CNI
Cannot find device "br0"
Cannot find device "br1"
Cleaning in /var/lib/cni/ non-network
find: ‘/var/lib/cni/’: No such file or directory
Removing devmapper devices
Removed fc-dev-thinpool
device-mapper: remove ioctl on fc-dev-thinpool  failed: No such device or address
Command failed.
Cleaning /var/lib/firecracker-containerd/containerd /var/lib/firecracker-containerd/runtime /var/lib/firecracker-containerd/snapshotter
Cleaning /run/firecracker-containerd/*
Cleaning CNI state, e.g., allocated addresses
rm: cannot remove '/var/lib/cni/networks/fcnet*/last_reserved_ip.0': No such file or directory
clean already
rm: cannot remove '/var/lib/cni/networks/fcnet*/19*': No such file or directory
clean already
Cleaning snapshots
Creating a fresh devmapper
0 209715200 thin-pool /dev/loop20 /dev/loop19 128 32768 1 skip_block_zeroing
device-mapper: reload ioctl on fc-dev-thinpool  failed: No such device or address
Command failed.
sudo mkdir -m777 -p /tmp/ctrd-logs && sudo env "PATH=/home/cc/.vscode-server/bin/da76f93349a72022ca4670c1b84860304616aaa2/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/go/bin:/usr/local/go/bin" /usr/local/bin/firecracker-containerd --config /etc/firecracker-containerd/config.toml 1>/tmp/ctrd-logs/ctriface_log_upf_man_skip.out 2>/tmp/ctrd-logs/ctriface_log_upf_man_skip.err &
sudo env "PATH=/home/cc/.vscode-server/bin/da76f93349a72022ca4670c1b84860304616aaa2/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/go/bin:/usr/local/go/bin" go test -v -race -cover -run TestParallelSnapLoad -args -upf
=== RUN   TestParallelSnapLoad
INFO[2022-08-08T21:21:20.145164740Z] Creating containerd client                   
INFO[2022-08-08T21:21:20.146507197Z] Created containerd client                    
INFO[2022-08-08T21:21:20.146717468Z] Creating firecracker client                  
INFO[2022-08-08T21:21:20.147671841Z] Created firecracker client                   
INFO[2022-08-08T21:21:20.161774349Z] Creating image manager                       
INFO[2022-08-08T21:21:20.174838167Z] Registering bridges for tap manager          
WARN[2022-08-08T21:21:26.572613408Z] Failed to Fetch k8s dns clusterIP exit status 1
The connection to the server localhost:8080 was refused - did you specify the right host or port?
 
WARN[2022-08-08T21:21:26.572812308Z] Using google dns 8.8.8.8                     
WARN[2022-08-08T21:21:26.573001912Z] Failed to Fetch k8s dns clusterIP exit status 1
The connection to the server localhost:8080 was refused - did you specify the right host or port?
 
WARN[2022-08-08T21:21:26.573150269Z] Failed to Fetch k8s dns clusterIP exit status 1
The connection to the server localhost:8080 was refused - did you specify the right host or port?
 
WARN[2022-08-08T21:21:26.573201926Z] Using google dns 8.8.8.8                     
WARN[2022-08-08T21:21:26.573321838Z] Using google dns 8.8.8.8                     
WARN[2022-08-08T21:21:26.573502243Z] Failed to Fetch k8s dns clusterIP exit status 1
The connection to the server localhost:8080 was refused - did you specify the right host or port?
 
WARN[2022-08-08T21:21:26.573668929Z] Using google dns 8.8.8.8                     
WARN[2022-08-08T21:21:26.579673205Z] Failed to Fetch k8s dns clusterIP exit status 1
The connection to the server localhost:8080 was refused - did you specify the right host or port?
 
WARN[2022-08-08T21:21:26.579794792Z] Using google dns 8.8.8.8                     
    manual_cleanup_test.go:249: 
                Error Trace:    manual_cleanup_test.go:249
                                                        asm_amd64.s:1571
                Error:          Received unexpected error:
                                rpc error: code = AlreadyExists desc = VM with ID "6" already exists (socket: "unix:///run/containerd/s/fd343345c0a45feeae9834b35238fce37488a37c999de6de974015d49131fb3e")
                                failed to create the microVM in firecracker-containerd
                                github.com/ease-lab/vhive/ctriface.(*Orchestrator).StartVM
                                        /home/cc/vhive/ctriface/iface.go:112
                                github.com/ease-lab/vhive/ctriface.TestParallelSnapLoad.func1
                                        /home/cc/vhive/ctriface/manual_cleanup_test.go:248
                                runtime.goexit
                                        /usr/local/go/src/runtime/asm_amd64.s:1571
                Test:           TestParallelSnapLoad
                Messages:       Failed to start VM, 6
    manual_cleanup_test.go:249: 
                Error Trace:    manual_cleanup_test.go:249
                                                        asm_amd64.s:1571
                Error:          Received unexpected error:
                                rpc error: code = AlreadyExists desc = VM with ID "8" already exists (socket: "unix:///run/containerd/s/8c73ad3bfb0a17fc7625fe04171c76484d5e297abad3cd1439d68bdcdd31b698")
                                failed to create the microVM in firecracker-containerd
                                github.com/ease-lab/vhive/ctriface.(*Orchestrator).StartVM
                                        /home/cc/vhive/ctriface/iface.go:112
                                github.com/ease-lab/vhive/ctriface.TestParallelSnapLoad.func1
                                        /home/cc/vhive/ctriface/manual_cleanup_test.go:248
                                runtime.goexit
                                        /usr/local/go/src/runtime/asm_amd64.s:1571
                Test:           TestParallelSnapLoad
                Messages:       Failed to start VM, 8
    manual_cleanup_test.go:249: 
                Error Trace:    manual_cleanup_test.go:249
                                                        asm_amd64.s:1571
                Error:          Received unexpected error:
                                rpc error: code = AlreadyExists desc = VM with ID "9" already exists (socket: "unix:///run/containerd/s/383956d4d070d7f021a804b4626f049269984b149cd4ec11234cbb1651425096")
                                failed to create the microVM in firecracker-containerd
                                github.com/ease-lab/vhive/ctriface.(*Orchestrator).StartVM
                                        /home/cc/vhive/ctriface/iface.go:112
                                github.com/ease-lab/vhive/ctriface.TestParallelSnapLoad.func1
                                        /home/cc/vhive/ctriface/manual_cleanup_test.go:248
                                runtime.goexit
                                        /usr/local/go/src/runtime/asm_amd64.s:1571
                Test:           TestParallelSnapLoad
                Messages:       Failed to start VM, 9
    manual_cleanup_test.go:249: 
                Error Trace:    manual_cleanup_test.go:249
                                                        asm_amd64.s:1571
                Error:          Received unexpected error:
                                rpc error: code = AlreadyExists desc = VM with ID "7" already exists (socket: "unix:///run/containerd/s/485db199bd2501ca1e8396ecbf526ff3d497c6084ebcf8259e4782c82bd9c93d")
                                failed to create the microVM in firecracker-containerd
                                github.com/ease-lab/vhive/ctriface.(*Orchestrator).StartVM
                                        /home/cc/vhive/ctriface/iface.go:112
                                github.com/ease-lab/vhive/ctriface.TestParallelSnapLoad.func1
                                        /home/cc/vhive/ctriface/manual_cleanup_test.go:248
                                runtime.goexit
                                        /usr/local/go/src/runtime/asm_amd64.s:1571
                Test:           TestParallelSnapLoad
                Messages:       Failed to start VM, 7
    manual_cleanup_test.go:249: 
                Error Trace:    manual_cleanup_test.go:249
                                                        asm_amd64.s:1571
                Error:          Received unexpected error:
                                rpc error: code = AlreadyExists desc = VM with ID "10" already exists (socket: "unix:///run/containerd/s/fdba3d53d99c5125eb592401eb255c51dd84f8eb0c7a1eec8f2678bcd7de7371")
                                failed to create the microVM in firecracker-containerd
                                github.com/ease-lab/vhive/ctriface.(*Orchestrator).StartVM
                                        /home/cc/vhive/ctriface/iface.go:112
                                github.com/ease-lab/vhive/ctriface.TestParallelSnapLoad.func1
                                        /home/cc/vhive/ctriface/manual_cleanup_test.go:248
                                runtime.goexit
                                        /usr/local/go/src/runtime/asm_amd64.s:1571
                Test:           TestParallelSnapLoad
                Messages:       Failed to start VM, 10
INFO[2022-08-08T21:21:26.775675436Z] Removing bridges                             
--- FAIL: TestParallelSnapLoad (6.84s)
FAIL
coverage: 18.2% of statements
exit status 1
FAIL    github.com/ease-lab/vhive/ctriface      6.906s
make: *** [Makefile:68: test-skip] Error 1
cc@us-pmem-1:~/vhive/ctriface$ 

Running scripts/clean_fcctr.sh or scripts/github_runner/clean_cri_runner.sh didn't help.
Note that the first time running TestParallelSnapLoad works but any tests there after failed, so I guess the microVM socks haven't been cleaned? Also, the main branch doesn't have this problem.

@ustiugov
Copy link
Member

ustiugov commented Aug 9, 2022

@DanielLee343 could you check if the paths to the sockets mentioned in the log actually exist? we might have forgotten to add their deletion in the cleaning scripts you mentioned. If that's the case, could you submit a PR with a fix to that branch?

@ustiugov
Copy link
Member

@DanielLee343 is the issue still relevant? if not, please close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants