Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang on shutdown #541

Open
hmel opened this issue Mar 10, 2021 · 13 comments
Open

Hang on shutdown #541

hmel opened this issue Mar 10, 2021 · 13 comments
Assignees
Labels
Milestone

Comments

@hmel
Copy link

hmel commented Mar 10, 2021

Compiled libbitcoin 3.6 against boost 1.75.
Starting from a freshly initialized directory I let bs run for a couple of days without problems. After that I did a Ctrl-C to shut it down. At that point waiting more than 2 hours for the server to shutdown but nothing happens. Attached is the backtrace for the process.
gdb.txt

@evoskuil
Copy link
Member

It appears that the server's zeromq worker thread did not terminate. What version of zeromq did you build against?

@evoskuil evoskuil self-assigned this Mar 15, 2021
@evoskuil evoskuil added the bug label Mar 15, 2021
@evoskuil evoskuil modified the milestones: 3.1, 3.0 Mar 15, 2021
@hmel
Copy link
Author

hmel commented Mar 18, 2021

stock zeromq package from debian/testing - Version: 4.3.4-1

@evoskuil
Copy link
Member

Have you attempted to sync with the server endpoints shut down as I suggested?

@hmel
Copy link
Author

hmel commented Mar 25, 2021

Yes, the problem seems to go away. I have successfully stopped and restarted multiple times up to about 450k blocks.

@evoskuil
Copy link
Member

That’s the problem with a race condition. Repro can be hard, making it hard to spot.

@hmel
Copy link
Author

hmel commented Apr 26, 2021

I got another hang on shutdown, this time with server endpoints disabled. The machine was under disk i/o load because it was compiling at the same time as bs server doing initial sync. The local bitcoin core node was NOT running, but I had outbound_connections = 10
Attached are the backtrace, log and config.
backtrace.txt
bs.cfg.txt
log.txt

@evoskuil
Copy link
Member

evoskuil commented Apr 26, 2021

I’ve rewritten bc::system::resubscriber and bc::system::subscriber. These are at the core of the network message pump that drives network, node and server. These have been problematic for a long time, since they are the only possible source of deadlock due to reentry in the entire code base. I found a way to not only preclude any chance of reentry but to also free up message handlers to execute in parallel (basically the same problem). This will prevent any complex deadlocks and should dramatically improve performance due to allowing full parallelism.

Once I’m done unit testing I’ll commit this change on new Libbitcoin v3 and v4 branches. This will allow it to be tried out by simply modifying install.sh to pull libbitcoin-system from this feature branch.

It’s possible that the increased parallelism could expose other issues in the implementation but I don’t find that likely. It also remains possible for deadlock to occur, but not due to reentry attempt. All critical sections are now provably straight path. But due to shared threadpool allocation (which I plan to eliminate) starvation can possibly cause a deadlock. But this can only happen if there is a math error in the code.

The only other source of deadlock would be in the case of a database result object being held while another is requested. This could only happen in blockchain result handlers which are easily inspected, and I think are clean. This results when a database resize waits on release of the first result object, while that is waiting on the next being returned, which is in turn waiting on the database resize.

Both of these deadlock potentials are easy to spot and fix. The reentry issue was complex and really only resolvable by eliminating the subscriber code that made it possible. So I expect this will be resolved soon.

@evoskuil
Copy link
Member

evoskuil commented Apr 26, 2021

This issue does not pertain in any way to machine resources or other applications. It’s an internal threading design flaw that produces a deadlock race condition. So sometimes you see it and sometimes you don’t. If a deadlock has occurred on a given thread(s) it will always manifest as a shutdown hang, because the shutdown waits on thread coalescence and that will never happen once deadlock has been reached.

Coalescence is the process of threads completing their work, and rejoining the threadpool. When stop is signaled no more work is queued, and when the threadpool is fully joined, the network/node/server exits.

@hmel
Copy link
Author

hmel commented Apr 27, 2021

Doesn't that mean that if a deadlock occurs the server's performance will degrade? And not being able to shut it down cleanly makes the situation rather bad, since killing it would result in a corrupted database. I've been trying to do an initial sync for weeks now. I have power outages every week and have to start over. I tried stopping the server every other day and making a backup of the db, so I don't lose the progress. Any way to stop the server with a consistent DB and then resume it after DB copy is done?

@evoskuil
Copy link
Member

evoskuil commented May 9, 2021

If a deadlock occurs there is a bug and all subsequent behavior is undefined, it's not a performance issue. There is no way to recover. If you have a hard shutdown and the store had unflushed writes (which is detected by startup), the store is corrupt.

@parazyd
Copy link

parazyd commented May 10, 2021

For what it's worth, this is also happening on v4 when using mainnet, both with and without libconsensus.
Testnet seems to be okay either way.

@evoskuil
Copy link
Member

V4 is not production code, it is incomplete and should not be expected to function.

@OutiSig
Copy link

OutiSig commented Jun 12, 2023

That’s the problem with a race condition. Repro can be hard, making it hard to spot.

Could you please publish the compiled working bs-linux-x64 v3.6 to the Releases page?
I've been compiling bs a many times, and every time this bug is active.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants