Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add: remote snapshotter dialer timeouts to service config #684

Closed
wants to merge 1 commit into from
Closed

add: remote snapshotter dialer timeouts to service config #684

wants to merge 1 commit into from

Conversation

austinvazquez
Copy link
Contributor

@austinvazquez austinvazquez commented Jun 22, 2022

Signed-off-by: Austin Vazquez [email protected]

Issue #, if available:
#673

Description of changes:
Expose all the dialer timeout knobs at the snapshotter service level. This enables tuning by users based on their host stability.

Buildkite instances used by the pipeline tend to be somewhat slower compared to developer instances, so this helps enable stabilizing remote snapshotter integration tests.

Additionally, remove some unneeded code from the remote snapshotter integration tests and make the tests run in parallel again to speed up the build.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@austinvazquez austinvazquez changed the title fix: unique interface name for multi VM test add: remote snapshotter dialer timeouts to service config Jun 23, 2022
vsock.WithLogger(log.G(ctx)),
vsock.WithDialTimeout(dialerTimeouts.DialTimeout),
vsock.WithRetryTimeout(dialerTimeouts.RetryTimeout),
vsock.WithRetryInterval(dialerTimeouts.RetryInterval),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the value that really helped stabilize tests in the pipeline. By default, the SDK was retrying once every 100ms. I have dialed this frequency down to once every 5s. Yes the tests are now twice as slow as before but appearingly more stable on Buildkite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100ms -> 5s is 50x change. Why do you specifically pick 5s? Does it make sense to reduce a bit, like 1s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I noticed the ticker was firing every 100ms and executing code that had up to 1s timeout. Although I assume separate goroutines handle the execution, I wanted to see if lowering the frequency increased stability. 5s is somewhat arbitrary but based on doing runs with up to 4s acknowledge message timeouts which still lead to timeouts.

@@ -49,15 +49,15 @@ const (
)

func TestLaunchContainerWithRemoteSnapshotter_Isolated(t *testing.T) {
integtest.Prepare(t, integtest.WithDefaultNetwork())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed. All CreateVM calls specify network interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have this one in either a different commit and/or PR.

Copy link
Contributor

@ginglis13 ginglis13 Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question regarding this: have you tried making just this change and observed any different behavior on the BuildKite tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guilty, yes. WithDefaultNetwork configures the fcnet network as the default network for CreateVM calls who do not specify a network configuration. So it is technically redundant for this and the below network configuration in the create call. Just a simplification of the test.

_, err = fcClient.CreateVM(ctx, &proto.CreateVMRequest{
VMID: vmID,
KernelArgs: kernelArgs,
KernelArgs: integtest.DefaultRuntimeConfig.KernelArgs,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default kernel args should be enough.

@@ -104,15 +100,14 @@ func launchContainerWithRemoteSnapshotterInVM(ctx context.Context, vmID string)
AllowMMDS: true,
CNIConfig: &proto.CNIConfiguration{
NetworkName: "fcnet",
InterfaceName: "veth0",
InterfaceName: fmt.Sprintf("veth%s", vmID),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I don't believe having each microVM on the same interface causes any issue, there also isn't a reason to have them on the same interface so just place each VM on it's own interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just the name of the interface as seen inside the VM's netns, isn't it? There shouldn't be any sharing by using the same name.

Copy link
Contributor Author

@austinvazquez austinvazquez Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, you are correct. I misread the CNI documentation when it stated "traffic would be mirrored on the external veth0 device". The name outside the VM appears dynamic as well.

AckMsgTimeoutInSeconds int `toml:"ack_msg_timeout_in_seconds" default:"1"`
DialTimeout string `toml:"dial_timeout" default:"100ms"`
RetryTimeout string `toml:"retry_timeout" default:"10s"`
RetryInterval string `toml:"retry_interval" default:"5s"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be too slow but we have the knobs now and can tune based on observations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the timeouts, if we don't have them, would it wait indefinitely without clear errors? I wonder whether we could have a single timeout, instead of asking end-users to budget all of them specifically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not done any experiments here to prove without the timeouts. I agree reducing complexity would be better as long as the solution becomes stable.

@austinvazquez austinvazquez marked this pull request as ready for review June 23, 2022 15:16
@austinvazquez austinvazquez requested a review from a team as a code owner June 23, 2022 15:16
@@ -156,6 +156,44 @@ func initResolver(config config.Config) (proxyaddress.Resolver, error) {
}
}

type dialerTimeouts struct {
Copy link
Contributor

@kzys kzys Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the direction it is heading to.

  • firecracker-containerd itself is using vsock but we don't have these configuration values. What is the difference between firecracker-containerd and demux snapshotter?
  • Exposing all knobs to end users basically moving the responsibility from ourselves to them, which isn't good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair comment. My reasoning for lift and shifting the responsibility was, and based purely on observations, the dial performance seemed to be greatly impacted by host load. I think we could consider mechanisms like exponential backoff to help reduce the number of knobs and loss of performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pick stable values for our BuildKite first. While we don't have much customers today, I feel these values are too tied to the implementation and hard to tune.

vsock.WithLogger(log.G(ctx)),
vsock.WithDialTimeout(dialerTimeouts.DialTimeout),
vsock.WithRetryTimeout(dialerTimeouts.RetryTimeout),
vsock.WithRetryInterval(dialerTimeouts.RetryInterval),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100ms -> 5s is 50x change. Why do you specifically pick 5s? Does it make sense to reduce a bit, like 1s?

AckMsgTimeoutInSeconds int `toml:"ack_msg_timeout_in_seconds" default:"1"`
DialTimeout string `toml:"dial_timeout" default:"100ms"`
RetryTimeout string `toml:"retry_timeout" default:"10s"`
RetryInterval string `toml:"retry_interval" default:"5s"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the timeouts, if we don't have them, would it wait indefinitely without clear errors? I wonder whether we could have a single timeout, instead of asking end-users to budget all of them specifically.

@@ -49,15 +49,15 @@ const (
)

func TestLaunchContainerWithRemoteSnapshotter_Isolated(t *testing.T) {
integtest.Prepare(t, integtest.WithDefaultNetwork())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have this one in either a different commit and/or PR.

@austinvazquez
Copy link
Contributor Author

Superseded by #689, #690

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants