Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add tcp user timeout config knob #14

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pH14
Copy link

@pH14 pH14 commented Mar 9, 2023

Currently tokio-postgres exposes two knobs to maintain healthy connections: a connect timeout and keep-alives settings that apply directly to the TCP socket. These cover the cases of connection establishment and for maintaining idle connections, but do not cover the case of an active/established socket that does not hear a response from the receiver for a long period of time. By default it can take 15-20m (15 retries with exponential backoff. the # of retries is controlled by tcp_retries2) for a connection to be killed under these circumstances.

The generally recommended solution to this problem is to set TCP_USER_TIMEOUT to cap the total amount of time a socket waits to receive a response after it is established. https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/ has a great writeup of this case under "Busy ESTAB socket is not forever".

I haven't found a super satisfying way of testing this yet, but staging it here for now.

@pH14
Copy link
Author

pH14 commented Mar 23, 2023

OK this was weirdly hard to verify -- I spun up a scratch machine (this option is Linux specific), started environmentd, and then attached strace -e trace=setsockopt -f -p <envd> and from there we can see a steady stream of syscalls setting this option:

[pid 475309] setsockopt(86, SOL_TCP, TCP_USER_TIMEOUT, [5000], 4) = 0
[pid 475309] setsockopt(93, SOL_TCP, TCP_USER_TIMEOUT, [5000], 4) = 0
...

@pH14
Copy link
Author

pH14 commented Mar 23, 2023

Not sure what to do about the failing tests. cc @benesch wouldn't mind a look at the change itself / any tips for CI here

@benesch
Copy link
Member

benesch commented Mar 24, 2023

CI here has been broken for years I’m afraid! Recommend filing the patch upstream, where CI does work. Then the only thing that can bite us is a rebase issue, which is fairly unlikely, and we get plenty of end-to-end coverage of this library in Materialize.

@pH14 pH14 force-pushed the tcp-user-timeout branch from 18ee7eb to 1b30118 Compare March 27, 2023 21:08
@pH14 pH14 force-pushed the tcp-user-timeout branch from 1b30118 to 54a3c5e Compare March 27, 2023 21:10
@pH14
Copy link
Author

pH14 commented Mar 27, 2023

sfackler#1007 merged upstream. I didn't cherry-pick since the change is so few lines / it didn't apply cleanly... not sure if that's cool or not. Otherwise, I think we're good to go here, though I'm not able to merge myself (cc @benesch)

@benesch
Copy link
Member

benesch commented Mar 28, 2023

Thanks, @pH14! I figured I'd just integrate the latest upstream changes (including yours) into the master branch, following the instructions here: https://github.com/MaterializeInc/rust-postgres#integrating-upstream-changes. I "fixed" the branch protection settings (I think) to not require CI to pass.

I think a cargo update -p tokio-postgres in the Materialize repo should pick this up now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants