-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add: remote snapshotter dialer timeouts to service config #684
add: remote snapshotter dialer timeouts to service config #684
Conversation
Signed-off-by: Austin Vazquez <[email protected]>
vsock.WithLogger(log.G(ctx)), | ||
vsock.WithDialTimeout(dialerTimeouts.DialTimeout), | ||
vsock.WithRetryTimeout(dialerTimeouts.RetryTimeout), | ||
vsock.WithRetryInterval(dialerTimeouts.RetryInterval), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the value that really helped stabilize tests in the pipeline. By default, the SDK was retrying once every 100ms. I have dialed this frequency down to once every 5s. Yes the tests are now twice as slow as before but appearingly more stable on Buildkite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100ms -> 5s is 50x change. Why do you specifically pick 5s? Does it make sense to reduce a bit, like 1s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I noticed the ticker was firing every 100ms and executing code that had up to 1s timeout. Although I assume separate goroutines handle the execution, I wanted to see if lowering the frequency increased stability. 5s is somewhat arbitrary but based on doing runs with up to 4s acknowledge message timeouts which still lead to timeouts.
@@ -49,15 +49,15 @@ const ( | |||
) | |||
|
|||
func TestLaunchContainerWithRemoteSnapshotter_Isolated(t *testing.T) { | |||
integtest.Prepare(t, integtest.WithDefaultNetwork()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed. All CreateVM
calls specify network interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have this one in either a different commit and/or PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question regarding this: have you tried making just this change and observed any different behavior on the BuildKite tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guilty, yes. WithDefaultNetwork
configures the fcnet
network as the default network for CreateVM
calls who do not specify a network configuration. So it is technically redundant for this and the below network configuration in the create call. Just a simplification of the test.
_, err = fcClient.CreateVM(ctx, &proto.CreateVMRequest{ | ||
VMID: vmID, | ||
KernelArgs: kernelArgs, | ||
KernelArgs: integtest.DefaultRuntimeConfig.KernelArgs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default kernel args should be enough.
@@ -104,15 +100,14 @@ func launchContainerWithRemoteSnapshotterInVM(ctx context.Context, vmID string) | |||
AllowMMDS: true, | |||
CNIConfig: &proto.CNIConfiguration{ | |||
NetworkName: "fcnet", | |||
InterfaceName: "veth0", | |||
InterfaceName: fmt.Sprintf("veth%s", vmID), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I don't believe having each microVM on the same interface causes any issue, there also isn't a reason to have them on the same interface so just place each VM on it's own interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just the name of the interface as seen inside the VM's netns, isn't it? There shouldn't be any sharing by using the same name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, you are correct. I misread the CNI documentation when it stated "traffic would be mirrored on the external veth0
device". The name outside the VM appears dynamic as well.
AckMsgTimeoutInSeconds int `toml:"ack_msg_timeout_in_seconds" default:"1"` | ||
DialTimeout string `toml:"dial_timeout" default:"100ms"` | ||
RetryTimeout string `toml:"retry_timeout" default:"10s"` | ||
RetryInterval string `toml:"retry_interval" default:"5s"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be too slow but we have the knobs now and can tune based on observations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the timeouts, if we don't have them, would it wait indefinitely without clear errors? I wonder whether we could have a single timeout, instead of asking end-users to budget all of them specifically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not done any experiments here to prove without the timeouts. I agree reducing complexity would be better as long as the solution becomes stable.
@@ -156,6 +156,44 @@ func initResolver(config config.Config) (proxyaddress.Resolver, error) { | |||
} | |||
} | |||
|
|||
type dialerTimeouts struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the direction it is heading to.
- firecracker-containerd itself is using vsock but we don't have these configuration values. What is the difference between firecracker-containerd and demux snapshotter?
- Exposing all knobs to end users basically moving the responsibility from ourselves to them, which isn't good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair comment. My reasoning for lift and shifting the responsibility was, and based purely on observations, the dial performance seemed to be greatly impacted by host load. I think we could consider mechanisms like exponential backoff to help reduce the number of knobs and loss of performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's pick stable values for our BuildKite first. While we don't have much customers today, I feel these values are too tied to the implementation and hard to tune.
vsock.WithLogger(log.G(ctx)), | ||
vsock.WithDialTimeout(dialerTimeouts.DialTimeout), | ||
vsock.WithRetryTimeout(dialerTimeouts.RetryTimeout), | ||
vsock.WithRetryInterval(dialerTimeouts.RetryInterval), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100ms -> 5s is 50x change. Why do you specifically pick 5s? Does it make sense to reduce a bit, like 1s?
AckMsgTimeoutInSeconds int `toml:"ack_msg_timeout_in_seconds" default:"1"` | ||
DialTimeout string `toml:"dial_timeout" default:"100ms"` | ||
RetryTimeout string `toml:"retry_timeout" default:"10s"` | ||
RetryInterval string `toml:"retry_interval" default:"5s"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the timeouts, if we don't have them, would it wait indefinitely without clear errors? I wonder whether we could have a single timeout, instead of asking end-users to budget all of them specifically.
@@ -49,15 +49,15 @@ const ( | |||
) | |||
|
|||
func TestLaunchContainerWithRemoteSnapshotter_Isolated(t *testing.T) { | |||
integtest.Prepare(t, integtest.WithDefaultNetwork()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have this one in either a different commit and/or PR.
Signed-off-by: Austin Vazquez [email protected]
Issue #, if available:
#673
Description of changes:
Expose all the dialer timeout knobs at the snapshotter service level. This enables tuning by users based on their host stability.
Buildkite instances used by the pipeline tend to be somewhat slower compared to developer instances, so this helps enable stabilizing remote snapshotter integration tests.
Additionally, remove some unneeded code from the remote snapshotter integration tests and make the tests run in parallel again to speed up the build.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.