Raft: Force kill a node during snapshot retrieval goes into limbo #10225

manishrjain · 2018-10-29T23:54:02Z

I'm seeing this strange issue, where a Raft node replica which has fallen behind, gets a snapshot from the leader (!raft.IsEmptySnapshot(rd.Snapshot)). In the case of Dgraph, this retrieval can take some time. During this time, if the node gets force killed, when it restarts, irrespective of the state of the write-ahead log, the leader never tries to bring it back to the latest state. It doesn't get the snapshot again or get any following updates.

When a node gets force killed, there's no chance to call ReportSnapshot(SnapshotFailure). But, I expect that on a restart, the leader would compare the node's state against its own, and then start streaming the (at that time) latest snapshot and the proposals since then. That doesn't seem to happen at all. See attached logs:

bank-dg3 | I1029 23:23:12.410171       1 node.go:83] 3 became follower at term 2
bank-dg3 | I1029 23:23:12.410218       1 node.go:83] raft.node: 3 elected leader 1 at term 2
bank-dg3 | I1029 23:23:12.418372       1 node.go:83] 3 [commit: 2128, lastindex: 2128, lastterm: 2] starts to restore snapshot [index: 3104, term: 2]
bank-dg3 | I1029 23:23:12.418520       1 node.go:83] log [committed=2128, applied=2069, unstable.offset=2070, len(unstable.Entries)=59] starts to restore snapshot [index: 3104, term: 2]
bank-dg3 | I1029 23:23:12.418559       1 node.go:83] 3 restored progress of 1 [next = 3105, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
bank-dg3 | I1029 23:23:12.418610       1 node.go:83] 3 restored progress of 2 [next = 3105, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
bank-dg3 | I1029 23:23:12.418628       1 node.go:83] 3 restored progress of 3 [next = 3105, match = 3104, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
bank-dg3 | I1029 23:23:12.418644       1 node.go:83] 3 [commit: 3104] restored snapshot [index: 3104, term: 2]
bank-dg3 | I1029 23:23:12.424310       1 draft.go:689] Waiting for applyCh to reach 2069 before taking snapshot
bank-dg3 | I1029 23:23:12.424353       1 draft.go:693] ---> SNAPSHOT: {Context:id:1 group:1 addr:"dg1:7180"  Index:3104 ReadTs:3006 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}. Group 1 from node id 1
bank-dg3 | I1029 23:23:12.530379       1 predicate.go:57] Sleeping for 15s...
bank-dg3 | I1029 23:23:14.503837       1 run.go:355] Caught Ctrl-C. Terminating now (this may take a few seconds)...

<Gets force killed after this. The following logs are after restart>

bank-dg3 | I1029 23:24:47.781449       1 draft.go:75] Node ID: 3 with GroupID: 1
bank-dg3 | I1029 23:24:47.781484       1 node.go:151] Setting raft.Config to: &{ID:3 peers:[] ElectionTick:100 HeartbeatTick:1 Storage:0xc42ac6e750 Applied:2059 MaxSizePerMsg:262144 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x1be5988}
bank-dg3 | I1029 23:24:47.781555       1 node.go:264] Found Snapshot.Metadata: {ConfState:{Nodes:[1 2 3] XXX_unrecognized:[]} Index:2059 Term:2 XXX_unrecognized:[]}
bank-dg3 | I1029 23:24:47.781573       1 node.go:275] Found hardstate: {Term:2 Vote:0 Commit:2069 XXX_unrecognized:[]}
bank-dg3 | I1029 23:24:47.781834       1 node.go:284] Group 1 found 12 entries
bank-dg3 | I1029 23:24:47.781840       1 draft.go:1055] Restarting node for group: 1
bank-dg3 | I1029 23:24:47.781848       1 node.go:173] Setting conf state to nodes:1 nodes:2 nodes:3 
bank-dg3 | I1029 23:24:47.782119       1 storage.go:227] Setting first index: 2060
bank-dg3 | I1029 23:24:47.782178       1 node.go:83] 3 became follower at term 2
bank-dg3 | I1029 23:24:47.782212       1 node.go:83] newRaft 3 [peers: [1,2,3], term: 2, commit: 2069, applied: 2059, lastindex: 2070, lastterm: 2]
bank-dg3 | I1029 23:24:47.782220       1 draft.go:1075] Restart node complete
bank-dg3 | I1029 23:24:47.782488       1 groups.go:519] Got address of a Zero server: zero1:5080
bank-dg3 | I1029 23:24:47.789679       1 draft.go:314] Skipping snapshot at 2059, because found one at 2059
bank-dg3 | I1029 23:24:50.382678       1 node.go:83] 3 is starting a new election at term 2
bank-dg3 | I1029 23:24:50.382844       1 node.go:83] 3 became pre-candidate at term 2
bank-dg3 | I1029 23:24:50.382868       1 node.go:83] 3 received MsgPreVoteResp from 3 at term 2
bank-dg2 | I1029 23:24:50.390952       1 node.go:83] 2 [logterm: 2, index: 3105, vote: 0] rejected MsgPreVote from 3 [logterm: 2, index: 2070] at term 2
bank-dg2 | W1029 23:24:50.391129       1 node.go:341] No healthy connection to node Id: 3 addr: [dg3:7183], err: Unhealthy connection
bank-dg3 | I1029 23:24:50.389614       1 node.go:83] 3 [logterm: 2, index: 2070] sent MsgPreVote request to 1 at term 2
bank-dg3 | I1029 23:24:50.389867       1 node.go:83] 3 [logterm: 2, index: 2070] sent MsgPreVote request to 2 at term 2
bank-dg1 | I1029 23:24:50.392675       1 node.go:83] 1 [logterm: 2, index: 3105, vote: 1] rejected MsgPreVote from 3 [logterm: 2, index: 2070] at term 2
bank-dg3 | I1029 23:24:52.407004       1 node.go:83] 3 became follower at term 2
bank-dg3 | I1029 23:24:52.407153       1 node.go:83] raft.node: 3 elected leader 1 at term 2

Now bank-dg3 is in a limbo. No updates are happening, it's just stuck waiting for updates, raft.Ready isn't called.

Any ideas what is going on?

P.S. This is related to: hypermodeinc/dgraph#2698
P.P.S. The sleeping for 15s is artificially induced to help narrow this down. I can reproduce the limbo reliably.

The text was updated successfully, but these errors were encountered:

manishrjain · 2018-10-30T01:24:29Z

Some more debugging. Looks like the leader stores in memory that the last action to the follower was that it has to apply the snapshot and it pauses any probing until that action has succeeded state = ProgressStateSnapshot, waiting = true. If the follower gets killed and restarts, the leader makes no effort to try to send the snapshot again (even if it has a newer version).

I think this could be made more robust, by having the leader retry on a node restart.

I1030 01:15:03.052501       1 draft.go:318] Creating snapshot at index: 2119. ReadTs: 2048.
I1030 01:15:03.053878       1 storage.go:532] CreateSnapshot i=2119, cs=nodes:1 nodes:2 nodes:3 
I1030 01:15:57.463246       1 node.go:83] 1 [logterm: 2, index: 2120, vote: 1] rejected MsgPreVote from 3 [logterm: 2, index: 1092] at term 2
I1030 01:16:01.444556       1 node.go:83] 1 [logterm: 2, index: 2120, vote: 1] rejected MsgPreVote from 3 [logterm: 2, index: 1092] at term 2
I1030 01:16:03.695864       1 pool.go:183] Connection established with dg3:7183
I1030 01:16:03.711770       1 node.go:79] 1 [firstindex: 2120, commit: 2120] sent snapshot[index: 2119, term: 2] to 3 [next = 1349, match = 1092, state = ProgressStateReplicate, waiting = false, pendingSnapshot = 0]
I1030 01:16:03.711828       1 node.go:79] 1 paused sending replication messages to 3 [next = 1349, match = 1092, state = ProgressStateSnapshot, waiting = true, pendingSnapshot = 2119]

bdarnell · 2018-10-30T01:42:08Z

How is the leader supposed to know that the follower has restarted? In CockroachDB, snapshots are pushed from the leader over TCP and this allows the leader to detect failure and call ReportSnapshot appropriately. If you're sending snapshots via a pull process by the follower, I think your leader should have a timeout and call ReportSnapshot if the pull is not started in time (or slows/stops once it is running). Either way, this is mainly a job for the application-provided transport code; I don't see a great way for the raft state machine to handle this on its own (Resetting from ProgressStateSnapshot to ProgressStateProbe after some number of ticks would help keep the node from getting stuck forever, but the ideal value would be application-dependent and would be best integrated with the transport.

manishrjain · 2018-10-31T18:37:25Z

Thanks for the explanation, @bdarnell . The purpose of ReportSnapshot wasn't clear before, due to lack of documentation around the call. I've created a PR to that effect.

For the fix, I made the snapshot pull bi-directional, so the follower can send an ACK back to the leader. If ACK fails, the leader will report snapshot failure and resume the log probing. That fixed the issue.

P.S. Congrats on launching CockroachDB as a service. Great work!

brk0v · 2018-10-31T23:02:21Z

@manishrjain As an idea. What about to implement a snapshot logic in a bit different way. Raft snapshot could contain only meta information about your backend storage. So leader sends to the follower a very tiny snapshot and runs ReportSnapshot() immediately. Then follower use this meta information to connect to the leader using a side channel and download data in the raft Ready() goroutine before sending messages rd.Messages, so leader will resend a snapshot in case of follower force kill.

Also it could be interesting for you to check out this issue #10219

manishrjain · 2018-10-31T23:47:50Z

That's close to what we're already doing:
https://github.com/dgraph-io/dgraph/blob/master/worker/draft.go#L675-L710

Re: the report immediately idea. Reporting success doesn't do anything. The leader would still block sending any updates until the follower has advanced their state. And the follower should really only advance its raft state if it has gotten the snapshot -- we don't want this lagging follower to artificially advance its state (without having the actual data) and end up becoming a leader.

brk0v · 2018-11-01T18:29:04Z

@manishrjain
Yes, now I can see that you changed your snapshot logic (I actually asked a question about the old one here https://discuss.dgraph.io/t/worker-raft-and-snapshot-logic/3473).

According to the bi-directional communication. I don't know why, but etcdserver/api/rafthttp works for me without bi-directional channel for pulling snapshot. I can guess that the reason is a builtin Prober or maybe something else (there is an additional goroutine for raft heartbeats for example https://github.com/etcd-io/etcd/blob/e8b940f268a80c14f7082589f60cbfd3de531d12/etcdserver/api/rafthttp/peer.go#L190).

Didn't have time to understand this part. With etcdserver transport I can kill/block node during pulling a snapshot, then restart node and it gets a new one from leader without problem. Maybe it will help you with new snapshot design.

P.S. Thank you for Badger ;)

manishrjain mentioned this issue Oct 31, 2018

raft: Explain ReportSnapshot and Propose behavior #10230

Merged

manishrjain closed this as completed Oct 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raft: Force kill a node during snapshot retrieval goes into limbo #10225

Raft: Force kill a node during snapshot retrieval goes into limbo #10225

manishrjain commented Oct 29, 2018 •

edited

Loading

manishrjain commented Oct 30, 2018

bdarnell commented Oct 30, 2018

manishrjain commented Oct 31, 2018

brk0v commented Oct 31, 2018 •

edited

Loading

manishrjain commented Oct 31, 2018

brk0v commented Nov 1, 2018 •

edited

Loading

Raft: Force kill a node during snapshot retrieval goes into limbo #10225

Raft: Force kill a node during snapshot retrieval goes into limbo #10225

Comments

manishrjain commented Oct 29, 2018 • edited Loading

manishrjain commented Oct 30, 2018

bdarnell commented Oct 30, 2018

manishrjain commented Oct 31, 2018

brk0v commented Oct 31, 2018 • edited Loading

manishrjain commented Oct 31, 2018

brk0v commented Nov 1, 2018 • edited Loading

manishrjain commented Oct 29, 2018 •

edited

Loading

brk0v commented Oct 31, 2018 •

edited

Loading

brk0v commented Nov 1, 2018 •

edited

Loading