Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Never closed sockets when remote ZYRE node goes offline #4729

Open
stephan57160 opened this issue Aug 20, 2024 · 3 comments
Open

Never closed sockets when remote ZYRE node goes offline #4729

stephan57160 opened this issue Aug 20, 2024 · 3 comments

Comments

@stephan57160
Copy link
Contributor

Issue description

On our ZYRE production server, once in a while, we observe never-closed sockets.
Sometimes, it goes up to 200 sockets to the same remote ZYRE node.

Environment

  • libzmq version (commit hash if unreleased): 3.4
  • OS: reproduced on
    • Linux CentOS (32 & 64 bits - x86 and ARM),
    • Rocky (64 bits) (x86)

Minimal test code / Steps to reproduce the issue

  1. Start ZYRE node A
  2. Start ZYRE node B
  3. On node A, 2 TCP sockets are seen with Node B:
  • Node A connected to Node B (used to send data to B).
  • Node B connected to Node A (used to receive data from B).
  1. Node B goes offline (out of WIFI coverage, Ethernet cable unplugged, Windows hybernation, ...)
  2. On node A, after some time, the ZYRE layer detects that node B is no more present and the PEER B is destroyed with the socket to it (node A to B).

What's the actual result? (include assertion message & call stack if applicable)

Socket from node B to node A is never closed, even if

  • node B application is restarted or
  • node B is rebooted.

Note:
This is not visible if application on node B is properly stopped (thx to TCP layer for sending TCP RESET).

What's the expected result?

Sockets from remote nodes should be automatically closed when the remote disappear:

  • Either the ZYRE peer destruction should do,
  • Use of TCP KEEPALIVE from the ZYRE application,

I failed to have a working implementation in any of those 2 cases.

Possible solution

I digged into LIBZMQ and ZYRE for quite some time.
I tried different approaches, but I always failed to get an access to the ACCEPT()ed socket
in this particular scenario.

Finally, I have a 'draft' possible workaroung, that enables TCP KEEPALIVE right after a particular ACCEPT() in tcp_listener.cpp.
Basically, the idea is like:

  sock = accept(s_);
  ...
  tune_tcp_keepalives(sock, x, y, y);
@keith-dev
Copy link
Contributor

It sounds reasonable to me. Have to considered submitting a change that fixes the problem?

@stephan57160
Copy link
Contributor Author

@keith-dev I can provide a draft PR this afternoon, with what we currently use for a few months.
It works for us, but might require more attention/suggestions.

@stephan57160
Copy link
Contributor Author

See PR 4761.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants