Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raft: Force kill a node during snapshot retrieval goes into limbo #10225

Closed
manishrjain opened this issue Oct 29, 2018 · 6 comments
Closed

Raft: Force kill a node during snapshot retrieval goes into limbo #10225

manishrjain opened this issue Oct 29, 2018 · 6 comments

Comments

@manishrjain
Copy link
Contributor

manishrjain commented Oct 29, 2018

I'm seeing this strange issue, where a Raft node replica which has fallen behind, gets a snapshot from the leader (!raft.IsEmptySnapshot(rd.Snapshot)). In the case of Dgraph, this retrieval can take some time. During this time, if the node gets force killed, when it restarts, irrespective of the state of the write-ahead log, the leader never tries to bring it back to the latest state. It doesn't get the snapshot again or get any following updates.

When a node gets force killed, there's no chance to call ReportSnapshot(SnapshotFailure). But, I expect that on a restart, the leader would compare the node's state against its own, and then start streaming the (at that time) latest snapshot and the proposals since then. That doesn't seem to happen at all. See attached logs:

bank-dg3 | I1029 23:23:12.410171       1 node.go:83] 3 became follower at term 2
bank-dg3 | I1029 23:23:12.410218       1 node.go:83] raft.node: 3 elected leader 1 at term 2
bank-dg3 | I1029 23:23:12.418372       1 node.go:83] 3 [commit: 2128, lastindex: 2128, lastterm: 2] starts to restore snapshot [index: 3104, term: 2]
bank-dg3 | I1029 23:23:12.418520       1 node.go:83] log [committed=2128, applied=2069, unstable.offset=2070, len(unstable.Entries)=59] starts to restore snapshot [index: 3104, term: 2]
bank-dg3 | I1029 23:23:12.418559       1 node.go:83] 3 restored progress of 1 [next = 3105, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
bank-dg3 | I1029 23:23:12.418610       1 node.go:83] 3 restored progress of 2 [next = 3105, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
bank-dg3 | I1029 23:23:12.418628       1 node.go:83] 3 restored progress of 3 [next = 3105, match = 3104, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
bank-dg3 | I1029 23:23:12.418644       1 node.go:83] 3 [commit: 3104] restored snapshot [index: 3104, term: 2]
bank-dg3 | I1029 23:23:12.424310       1 draft.go:689] Waiting for applyCh to reach 2069 before taking snapshot
bank-dg3 | I1029 23:23:12.424353       1 draft.go:693] ---> SNAPSHOT: {Context:id:1 group:1 addr:"dg1:7180"  Index:3104 ReadTs:3006 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}. Group 1 from node id 1
bank-dg3 | I1029 23:23:12.530379       1 predicate.go:57] Sleeping for 15s...
bank-dg3 | I1029 23:23:14.503837       1 run.go:355] Caught Ctrl-C. Terminating now (this may take a few seconds)...

<Gets force killed after this. The following logs are after restart>

bank-dg3 | I1029 23:24:47.781449       1 draft.go:75] Node ID: 3 with GroupID: 1
bank-dg3 | I1029 23:24:47.781484       1 node.go:151] Setting raft.Config to: &{ID:3 peers:[] ElectionTick:100 HeartbeatTick:1 Storage:0xc42ac6e750 Applied:2059 MaxSizePerMsg:262144 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x1be5988}
bank-dg3 | I1029 23:24:47.781555       1 node.go:264] Found Snapshot.Metadata: {ConfState:{Nodes:[1 2 3] XXX_unrecognized:[]} Index:2059 Term:2 XXX_unrecognized:[]}
bank-dg3 | I1029 23:24:47.781573       1 node.go:275] Found hardstate: {Term:2 Vote:0 Commit:2069 XXX_unrecognized:[]}
bank-dg3 | I1029 23:24:47.781834       1 node.go:284] Group 1 found 12 entries
bank-dg3 | I1029 23:24:47.781840       1 draft.go:1055] Restarting node for group: 1
bank-dg3 | I1029 23:24:47.781848       1 node.go:173] Setting conf state to nodes:1 nodes:2 nodes:3 
bank-dg3 | I1029 23:24:47.782119       1 storage.go:227] Setting first index: 2060
bank-dg3 | I1029 23:24:47.782178       1 node.go:83] 3 became follower at term 2
bank-dg3 | I1029 23:24:47.782212       1 node.go:83] newRaft 3 [peers: [1,2,3], term: 2, commit: 2069, applied: 2059, lastindex: 2070, lastterm: 2]
bank-dg3 | I1029 23:24:47.782220       1 draft.go:1075] Restart node complete
bank-dg3 | I1029 23:24:47.782488       1 groups.go:519] Got address of a Zero server: zero1:5080
bank-dg3 | I1029 23:24:47.789679       1 draft.go:314] Skipping snapshot at 2059, because found one at 2059
bank-dg3 | I1029 23:24:50.382678       1 node.go:83] 3 is starting a new election at term 2
bank-dg3 | I1029 23:24:50.382844       1 node.go:83] 3 became pre-candidate at term 2
bank-dg3 | I1029 23:24:50.382868       1 node.go:83] 3 received MsgPreVoteResp from 3 at term 2
bank-dg2 | I1029 23:24:50.390952       1 node.go:83] 2 [logterm: 2, index: 3105, vote: 0] rejected MsgPreVote from 3 [logterm: 2, index: 2070] at term 2
bank-dg2 | W1029 23:24:50.391129       1 node.go:341] No healthy connection to node Id: 3 addr: [dg3:7183], err: Unhealthy connection
bank-dg3 | I1029 23:24:50.389614       1 node.go:83] 3 [logterm: 2, index: 2070] sent MsgPreVote request to 1 at term 2
bank-dg3 | I1029 23:24:50.389867       1 node.go:83] 3 [logterm: 2, index: 2070] sent MsgPreVote request to 2 at term 2
bank-dg1 | I1029 23:24:50.392675       1 node.go:83] 1 [logterm: 2, index: 3105, vote: 1] rejected MsgPreVote from 3 [logterm: 2, index: 2070] at term 2
bank-dg3 | I1029 23:24:52.407004       1 node.go:83] 3 became follower at term 2
bank-dg3 | I1029 23:24:52.407153       1 node.go:83] raft.node: 3 elected leader 1 at term 2

Now bank-dg3 is in a limbo. No updates are happening, it's just stuck waiting for updates, raft.Ready isn't called.

Any ideas what is going on?

P.S. This is related to: hypermodeinc/dgraph#2698
P.P.S. The sleeping for 15s is artificially induced to help narrow this down. I can reproduce the limbo reliably.

@manishrjain
Copy link
Contributor Author

Some more debugging. Looks like the leader stores in memory that the last action to the follower was that it has to apply the snapshot and it pauses any probing until that action has succeeded state = ProgressStateSnapshot, waiting = true. If the follower gets killed and restarts, the leader makes no effort to try to send the snapshot again (even if it has a newer version).

I think this could be made more robust, by having the leader retry on a node restart.

I1030 01:15:03.052501       1 draft.go:318] Creating snapshot at index: 2119. ReadTs: 2048.
I1030 01:15:03.053878       1 storage.go:532] CreateSnapshot i=2119, cs=nodes:1 nodes:2 nodes:3 
I1030 01:15:57.463246       1 node.go:83] 1 [logterm: 2, index: 2120, vote: 1] rejected MsgPreVote from 3 [logterm: 2, index: 1092] at term 2
I1030 01:16:01.444556       1 node.go:83] 1 [logterm: 2, index: 2120, vote: 1] rejected MsgPreVote from 3 [logterm: 2, index: 1092] at term 2
I1030 01:16:03.695864       1 pool.go:183] Connection established with dg3:7183
I1030 01:16:03.711770       1 node.go:79] 1 [firstindex: 2120, commit: 2120] sent snapshot[index: 2119, term: 2] to 3 [next = 1349, match = 1092, state = ProgressStateReplicate, waiting = false, pendingSnapshot = 0]
I1030 01:16:03.711828       1 node.go:79] 1 paused sending replication messages to 3 [next = 1349, match = 1092, state = ProgressStateSnapshot, waiting = true, pendingSnapshot = 2119]

@bdarnell
Copy link
Contributor

How is the leader supposed to know that the follower has restarted? In CockroachDB, snapshots are pushed from the leader over TCP and this allows the leader to detect failure and call ReportSnapshot appropriately. If you're sending snapshots via a pull process by the follower, I think your leader should have a timeout and call ReportSnapshot if the pull is not started in time (or slows/stops once it is running). Either way, this is mainly a job for the application-provided transport code; I don't see a great way for the raft state machine to handle this on its own (Resetting from ProgressStateSnapshot to ProgressStateProbe after some number of ticks would help keep the node from getting stuck forever, but the ideal value would be application-dependent and would be best integrated with the transport.

@manishrjain
Copy link
Contributor Author

Thanks for the explanation, @bdarnell . The purpose of ReportSnapshot wasn't clear before, due to lack of documentation around the call. I've created a PR to that effect.

For the fix, I made the snapshot pull bi-directional, so the follower can send an ACK back to the leader. If ACK fails, the leader will report snapshot failure and resume the log probing. That fixed the issue.

P.S. Congrats on launching CockroachDB as a service. Great work!

@brk0v
Copy link

brk0v commented Oct 31, 2018

@manishrjain As an idea. What about to implement a snapshot logic in a bit different way. Raft snapshot could contain only meta information about your backend storage. So leader sends to the follower a very tiny snapshot and runs ReportSnapshot() immediately. Then follower use this meta information to connect to the leader using a side channel and download data in the raft Ready() goroutine before sending messages rd.Messages, so leader will resend a snapshot in case of follower force kill.

Also it could be interesting for you to check out this issue #10219

@manishrjain
Copy link
Contributor Author

That's close to what we're already doing:
https://github.com/dgraph-io/dgraph/blob/master/worker/draft.go#L675-L710

Re: the report immediately idea. Reporting success doesn't do anything. The leader would still block sending any updates until the follower has advanced their state. And the follower should really only advance its raft state if it has gotten the snapshot -- we don't want this lagging follower to artificially advance its state (without having the actual data) and end up becoming a leader.

@brk0v
Copy link

brk0v commented Nov 1, 2018

@manishrjain
Yes, now I can see that you changed your snapshot logic (I actually asked a question about the old one here https://discuss.dgraph.io/t/worker-raft-and-snapshot-logic/3473).

According to the bi-directional communication. I don't know why, but etcdserver/api/rafthttp works for me without bi-directional channel for pulling snapshot. I can guess that the reason is a builtin Prober or maybe something else (there is an additional goroutine for raft heartbeats for example https://github.com/etcd-io/etcd/blob/e8b940f268a80c14f7082589f60cbfd3de531d12/etcdserver/api/rafthttp/peer.go#L190).

Didn't have time to understand this part. With etcdserver transport I can kill/block node during pulling a snapshot, then restart node and it gets a new one from leader without problem. Maybe it will help you with new snapshot design.

P.S. Thank you for Badger ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants