Skip to content
This repository has been archived by the owner on Jun 3, 2020. It is now read-only.

Validator disconnect on step regression #323

Open
mdyring opened this issue Jul 28, 2019 · 7 comments
Open

Validator disconnect on step regression #323

mdyring opened this issue Jul 28, 2019 · 7 comments

Comments

@mdyring
Copy link
Contributor

mdyring commented Jul 28, 2019

In below log, 5 validators are connected to single tmkms process on same chain-id. It appears that during a Proposal, some validators (.240 and .108 below) get disconnected due to step regression.

Is this as expected? It does not seem to happen with validators attempting to double sign with different block ids, so I would think not.

Disconnecting seems undesirable when trying to do HA. ;-)

tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://18.196.17.223:26659] signed Proposal:60E46FCB09 at h/r/s 88801/0/3 (0 ms)
tmkms[21952]: 18:52:22 [error] [gaia-13004:tcp://35.156.81.246:26659] attempted double sign at h/r/s: 88801/0/3 (60E46FCB09 != 85CC5AD4DD)
tmkms[21952]: 18:52:22 [error] [gaia-13004:tcp://54.93.169.244:26659] attempted double sign at h/r/s: 88801/0/3 (60E46FCB09 != 9DFE0E1AE9)
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://18.196.17.223:26659] signed PreVote:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:22 [error] [gaia-13004@tcp://35.158.126.240:26659] attempted double sign: step regression: round regression at height:88801 round:0 last step:6 new step:3
tmkms[21952]: 18:52:22 [error] [gaia-13004@tcp://18.196.63.108:26659] attempted double sign: step regression: round regression at height:88801 round:0 last step:6 new step:3
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://35.156.81.246:26659] signed PreVote:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://54.93.169.244:26659] signed PreVote:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://35.156.81.246:26659] signed PreCommit:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://18.196.17.223:26659] signed PreCommit:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://54.93.169.244:26659] signed PreCommit:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:23 [info] KMS node ID: 20E2B9BE81DD6CBB410B14FB59EBD4DBC9E0228C
tmkms[21952]: 18:52:23 [info] KMS node ID: 20E2B9BE81DD6CBB410B14FB59EBD4DBC9E0228C
tmkms[21952]: 18:52:23 [warn] [gaia-13004] 35.158.126.240:26659: unverified validator peer ID! (ADCA89A67DAA5FFC2D2EFFD8AD33B1DC8DCD9466)
tmkms[21952]: 18:52:23 [info] [gaia-13004@tcp://35.158.126.240:26659] connected to validator successfully
tmkms[21952]: 18:52:23 [warn] [gaia-13004] 18.196.63.108:26659: unverified validator peer ID! (00ACD9BDBE36B2933231545AEBF1D799DD06D83B)
tmkms[21952]: 18:52:23 [info] [gaia-13004@tcp://18.196.63.108:26659] connected to validator successfully
@tarcieri
Copy link
Contributor

tarcieri commented Jul 29, 2019

@mdyring I'd suggest opening a corresponding issue on https://github.com/tendermint/tendermint since I do not believe the KMS is closing the connection here, but rather my best guess is Tendermint is closing the connection after receiving a double-signing error response. It'd be good to confirm these are both the case, though.

@mdyring
Copy link
Contributor Author

mdyring commented Jul 29, 2019

@tarcieri done, thanks. From looking at Tendermint log it looks like it sees the connection close, I have added log output to tendermint/tendermint#3844.

@tarcieri
Copy link
Contributor

tarcieri commented Jul 29, 2019

Here's the relevant KMS codepath. It returns an error response to Tendermint, but otherwise considers the request successfully handled and therefore it shouldn't abort the connection.

You can also turn on debug logging on the KMS by passing the -v or --verbose option to tmkms start.

@tarcieri
Copy link
Contributor

tarcieri commented Jul 29, 2019

From looking at Tendermint log it looks like it sees the connection close, I have added log output to tendermint/tendermint#3844.

I looked at the attached log output but I don't see what you're talking about re: Tendermint showing the connection being remotely closed by the KMS.

@mdyring
Copy link
Contributor Author

mdyring commented Jul 29, 2019

Sorry, should have been clearer: First line has a "EOF" at the end, indicating the connection was closed.

This seems to indicate the connection closure is unexpected for Tendermint, but it could also just be a case of Tendermint closing the connection earlier and then failing on a read later and logging that.

@ebuchman
Copy link
Contributor

ebuchman commented Aug 23, 2019

Here's the relevant KMS codepath.

Isn't this only for the StateErrorKind::DoubleSign ? Martin's error seems to be of StateErrorKind::StepRegression ie. https://github.com/tendermint/kms/blob/3b3aabf8eab64959dad29ee2d5b747dc5fe813ac/src/chain/state.rs#L92-L99 which means sign would return an error, which I think exits the request_loop ?

@tarcieri
Copy link
Contributor

@ebuchman good catch, that appears to be what's happening.

Perhaps the KMS should send back an error to Tendermint on StateErrorKind::StepRegression as well.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants