Hang on shutdown #541

hmel · 2021-03-10T13:18:50Z

Compiled libbitcoin 3.6 against boost 1.75.
Starting from a freshly initialized directory I let bs run for a couple of days without problems. After that I did a Ctrl-C to shut it down. At that point waiting more than 2 hours for the server to shutdown but nothing happens. Attached is the backtrace for the process.
gdb.txt

evoskuil · 2021-03-15T20:05:41Z

It appears that the server's zeromq worker thread did not terminate. What version of zeromq did you build against?

hmel · 2021-03-18T08:08:22Z

stock zeromq package from debian/testing - Version: 4.3.4-1

evoskuil · 2021-03-18T21:40:06Z

Have you attempted to sync with the server endpoints shut down as I suggested?

hmel · 2021-03-25T16:54:38Z

Yes, the problem seems to go away. I have successfully stopped and restarted multiple times up to about 450k blocks.

evoskuil · 2021-03-25T18:33:23Z

That’s the problem with a race condition. Repro can be hard, making it hard to spot.

hmel · 2021-04-26T14:32:15Z

I got another hang on shutdown, this time with server endpoints disabled. The machine was under disk i/o load because it was compiling at the same time as bs server doing initial sync. The local bitcoin core node was NOT running, but I had outbound_connections = 10
Attached are the backtrace, log and config.
backtrace.txt
bs.cfg.txt
log.txt

evoskuil · 2021-04-26T18:39:12Z

I’ve rewritten bc::system::resubscriber and bc::system::subscriber. These are at the core of the network message pump that drives network, node and server. These have been problematic for a long time, since they are the only possible source of deadlock due to reentry in the entire code base. I found a way to not only preclude any chance of reentry but to also free up message handlers to execute in parallel (basically the same problem). This will prevent any complex deadlocks and should dramatically improve performance due to allowing full parallelism.

Once I’m done unit testing I’ll commit this change on new Libbitcoin v3 and v4 branches. This will allow it to be tried out by simply modifying install.sh to pull libbitcoin-system from this feature branch.

It’s possible that the increased parallelism could expose other issues in the implementation but I don’t find that likely. It also remains possible for deadlock to occur, but not due to reentry attempt. All critical sections are now provably straight path. But due to shared threadpool allocation (which I plan to eliminate) starvation can possibly cause a deadlock. But this can only happen if there is a math error in the code.

The only other source of deadlock would be in the case of a database result object being held while another is requested. This could only happen in blockchain result handlers which are easily inspected, and I think are clean. This results when a database resize waits on release of the first result object, while that is waiting on the next being returned, which is in turn waiting on the database resize.

Both of these deadlock potentials are easy to spot and fix. The reentry issue was complex and really only resolvable by eliminating the subscriber code that made it possible. So I expect this will be resolved soon.

evoskuil · 2021-04-26T18:43:16Z

This issue does not pertain in any way to machine resources or other applications. It’s an internal threading design flaw that produces a deadlock race condition. So sometimes you see it and sometimes you don’t. If a deadlock has occurred on a given thread(s) it will always manifest as a shutdown hang, because the shutdown waits on thread coalescence and that will never happen once deadlock has been reached.

Coalescence is the process of threads completing their work, and rejoining the threadpool. When stop is signaled no more work is queued, and when the threadpool is fully joined, the network/node/server exits.

hmel · 2021-04-27T15:34:41Z

Doesn't that mean that if a deadlock occurs the server's performance will degrade? And not being able to shut it down cleanly makes the situation rather bad, since killing it would result in a corrupted database. I've been trying to do an initial sync for weeks now. I have power outages every week and have to start over. I tried stopping the server every other day and making a backup of the db, so I don't lose the progress. Any way to stop the server with a consistent DB and then resume it after DB copy is done?

evoskuil · 2021-05-09T18:46:38Z

If a deadlock occurs there is a bug and all subsequent behavior is undefined, it's not a performance issue. There is no way to recover. If you have a hard shutdown and the store had unflushed writes (which is detected by startup), the store is corrupt.

parazyd · 2021-05-10T08:29:48Z

For what it's worth, this is also happening on v4 when using mainnet, both with and without libconsensus.
Testnet seems to be okay either way.

evoskuil · 2021-05-10T08:50:15Z

V4 is not production code, it is incomplete and should not be expected to function.

OutiSig · 2023-06-12T11:14:20Z

That’s the problem with a race condition. Repro can be hard, making it hard to spot.

Could you please publish the compiled working bs-linux-x64 v3.6 to the Releases page?
I've been compiling bs a many times, and every time this bug is active.
Thanks.

evoskuil self-assigned this Mar 15, 2021

evoskuil added the bug label Mar 15, 2021

evoskuil modified the milestones: 3.1, 3.0 Mar 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hang on shutdown #541

Hang on shutdown #541

hmel commented Mar 10, 2021

evoskuil commented Mar 15, 2021

hmel commented Mar 18, 2021

evoskuil commented Mar 18, 2021

hmel commented Mar 25, 2021

evoskuil commented Mar 25, 2021

hmel commented Apr 26, 2021

evoskuil commented Apr 26, 2021 •

edited

Loading

evoskuil commented Apr 26, 2021 •

edited

Loading

hmel commented Apr 27, 2021

evoskuil commented May 9, 2021 •

edited

Loading

parazyd commented May 10, 2021

evoskuil commented May 10, 2021

OutiSig commented Jun 12, 2023

Hang on shutdown #541

Hang on shutdown #541

Comments

hmel commented Mar 10, 2021

evoskuil commented Mar 15, 2021

hmel commented Mar 18, 2021

evoskuil commented Mar 18, 2021

hmel commented Mar 25, 2021

evoskuil commented Mar 25, 2021

hmel commented Apr 26, 2021

evoskuil commented Apr 26, 2021 • edited Loading

evoskuil commented Apr 26, 2021 • edited Loading

hmel commented Apr 27, 2021

evoskuil commented May 9, 2021 • edited Loading

parazyd commented May 10, 2021

evoskuil commented May 10, 2021

OutiSig commented Jun 12, 2023

evoskuil commented Apr 26, 2021 •

edited

Loading

evoskuil commented Apr 26, 2021 •

edited

Loading

evoskuil commented May 9, 2021 •

edited

Loading