Reliable message delivery #132

xosmig · 2022-07-14T12:21:15Z

xosmig
Jul 14, 2022

Context

Most distributed protocols assume reliable message delivery between non-Byzantine participants.
In practice, we also need to be able to garbage-collect messages that are no longer relevant, even if they haven't yet reached the destination (e.g., once a quorum of responses is collected, many protocols do not care if the message reaches the rest of the nodes).

We should also take into account that transport layer connections can be spontaneously lost (due to some issues in the network) and that nodes may crash and recover.

Naive approach

A naive solution would be simply repeatedly invoking SendMessage every second or so until the message is no longer relevant.
This is what, I believe, is currently implemented in ISS.
A slightly less naive solution would also add acknowledgments to avoid wasting resources.

Issues with the naive approach

Retransmission is non-trivial: a static retransmission delay would be either too small (in which case we would simply overload our own networking stack) or too large (in which case we would waste time).
There is already a retransmission mechanism in the transport layer, which we should take advantage of. Indeed, if the networking module uses TCP or QUIC, unless the connection is lost, resending messages would make zero sense as the repeated messages would simply queue up in the buffer and would all be delivered.
In the current implementations of the networking module, the connections are established once and, if lost, are never recovered. Hence, in the current implementation, retransmissions do not really do anything useful (if the connection is not lost, reliable delivery is already guaranteed by the transport layer).

Some possible solutions

Possible solution 1

We could probably try to use some existing solutions (would libp2p pub-sub be suitable?).

Possible solution 2

First of all, make the net module responsible for automatically recovering lost connections.
It should also send notifications up-stack when connections are lost and recovered in order to let other modules know that some messages could have been lost. More precisely, the networking module can export the following interface:

SendMessage(msg Message, destNodes []NodeID, msgID int)
notification MessageRecieved(from NodeID, msg Message)
notification ConnectionLost(peer NodeID)
notification ConnectionRestored(peer NodeID)

This way, we will be able to implement a reliablenet module on top of the net module that will take advantage of the built-in retransmission mechanisms of the Transport layer whenever possible.

The reliablenet module can guarantee FIFO at-least-once delivery and expose the following API:

SendMessage(msg Message, destNodes []NodeID, tag string)
CancelMessage(tag string)
notification MessageRecieved(from NodeID, msg Message, msgID int)
AcknowledgeMessage(msgID int) -- used to notify the reliablenet module that the message was processed by the module up-stack and does not need to be retransmitted in case of recovery.

The reliablenet module will send each message once and, when the connection is lost and recovered, will exchange messages to determine which messages need to be retransmitted. It will also need to use persistent storage to keep track of the prefix of acknowledged message ids.

The reliablenet module will guarantee at-least-once delivery in the case when the recipient node fails and recovers, but the message can be lost if the sender node fails.

Possible solution 2.5

The net module may also not recover lost connections automatically, but simply notify the module up-stack that the connection is lost and export an API for reconnecting.

xosmig · 2022-07-14T12:48:44Z

xosmig
Jul 14, 2022
Author

As noted by @sergefdrv, this is related to #8

0 replies

sergefdrv · 2022-07-14T12:48:51Z

sergefdrv
Jul 14, 2022

I remember to have discussed this with @matejpavlovic once. I would suggest a solution where we introduce a notion of message queues (buffers) with specified properties like reliable delivery, FIFO order etc. Those message queues would ensure the promised properties until explicitly garbage-collected.

16 replies

sergefdrv Jul 14, 2022

We'd have separate queues for each epoch so that the amount of messages is reasonably limited.

sergefdrv Jul 14, 2022

I did something remotely resembling this in MinBFT project. It has a very different architecture compared to Mir. There is only reliable FIFO queue implemented here. Its networking API is defined here and used somewhere here.

xosmig Jul 14, 2022
Author

I have a couple more questions regarding the concept of message queues:

do you want the recipient node to explicitly start listening on a queue?
- it may be a bit tricky to deal with if the sender node is faster and starts sending messages before the recipient starts listening.
do you want the recipient node to also explicitly close (i.e., garbage-collect) the queue?

sergefdrv Jul 14, 2022

Yes, I'm thinking it terms of pull-style approach where the nodes initiate "connections" and start pulling (request) messages from the queues they need. Perhaps we should have both incoming and outgoing message queues, so that we can explicitly open and close them. We'd also need to limit the amount of incoming buffered data, either by setting a fixed upper bound or through some flow control mechanism.

sergefdrv Jul 14, 2022

Maybe we can define queues as long-living instances, but comprised of segments which can be activated and garbage-collected.

adlrocha · 2022-07-14T15:44:19Z

adlrocha
Jul 14, 2022

Adding my take on the matter, if my understanding is off please let me know (or feel free to completely disregard this message). My feeling is that we are mixing two things here: the network abstraction that the protocol will use (where the discussion about FIFO queues, events notifications, etc. make sense), and the underlying transport substrate used by the protocol, and one may impact the other.

First of all, libp2p uses a Stream abstraction that allows us to create long-lasting connections and detect when we are disconnected quite easily (which is nice). For the network substrate, the main question is what kind of communications we're going to have between nodes. If we need to send the same message to several parties I would use GossipSub right away, as we can assume (per the protocol's properties) that every peer subscribed to the topic will receive the message in a bounded amount of time (no need for retransmissions).

If, on the other hand, we need 1-to-1 message exchange between peers, I would implement an ad-hoc libp2p protocol for the transport layer. In this case, we may need to handle retransmissions. For this case, I would keep a stream open for each peer and I'd use a "fire-and-forget" approach to improve the latency in the happy path. The retransmission policy really depends on the protocol (i.e. Mir). If we have the concept of "protocol progress" I would send the message, wait a bit for progress, and then either remove the message from the queue or retransmit, and minimize the number of retransmissions. Disconnections and other network events can easily be detected through the opened peer stream. I would really avoid retransmiting every second, we can probably use better heuristics from Mir in the network abstraction (if you think it is useful I can read more in depth Mir's operation to try an elaborate a bit more on the policy).

1 reply

sergefdrv Jul 15, 2022

@adlrocha Thank you for your feedback! I think we were more focused here on the network abstraction which would provide a convenient interface for (1) specifying the desired message delivery properties, and (2) managing garbage collection (which is essential in case of reliable delivery guarantees). At the same time, the network abstraction should be able to take advantage of the properties already provided by the underlying transport substrate in order to avoid unnecessary overhead, such as periodic re-transmission over reliable connections.

matejpavlovic · 2022-08-22T16:09:26Z

matejpavlovic
Aug 22, 2022

Yes, reliable message delivery is a general concept common to many distributed protocols, and having it encapsulated in an abstraction with a separate implementation would definitely be useful. The naive approach currently implemented in ISS is definitely sub-optimal, with all the disadvantages mentioned by @xosmig here and @sergefdrv here and in other places earlier.

I think "Solution 2" is a good start and I could imagine something like that to form the basis of an advanced communication abstraction. (I think it's better than "Solution 2.5", which seems to put more load on the application than necessary in this context.)
I think it could be used to build also the message queues @sergefdrv mentioned, e.g. by having a multiple dynamically created instances of reliablenet (or any variant of it with potentially different guarantees), each representing a single message queue and implementing its own retransmission and crash-recovery protocol.

When it comes to acknowledging messages (and assuming an acknowledgment allows the sender to garbage-collect the corresponding message), I think there is a fundamental constraint to have in mind:
In order to have reliable end-to-end delivery and crash-recovery at the same time, some form of application-level acknowledgment needs to happen, or the receiver node must persistently store all received messages before delivering them to the protocol for processing.
That means that the reliablenet either

only acknowledges a message when told to do so by the higher level of the stack, or
first persistently stores a delivered message and only then acknowledges it to its counterpart on the sender node and informs higher levels of the stack.
Otherwise there will always be a scenario where a node restarts just after acknowledging the reception of a message, without the top of the networking stack having seen it.
The sending side also has to (explicitly or implicitly) persist all the messages before sending them, if we want reliable delivery. (By implicitly persisting messages I mean persisting something (e.g. in the WAL) that guarantees the recreation of reliablenets state.)

Note that this doesn't mean that each message needs to be acknowledged individually. Sometimes that might be appropriate, sometimes acknowledgments could be batched, and sometimes acknowledgments could even be implicit through garbage collection. (The latter is probably equivalent to deleting message queues when no longer needed.)

For the garbage collection itself, I think the tags shared by multiple messages are going in a good direction. I can easily imagine an extension to that, where there would be a total order on the tags and a greater tag would garbage-collect everything associated with lower tags (especially with FIFO channels this makes sense). This concept is already used in Mir and the "tags" are called RetentionIndex.

1 reply

sergefdrv Aug 29, 2022

I think if we employ pull style communication over stream transport then we'll neither need to store received messages persistently, nor send any acknowledgement between nodes. The node that receives messages would initiate a stream connection indicating to the other node what messages it expects to receive from that node (e.g. indicate its last retention index after recovery); the latter would then stream messages back. If some messages were garbage-collected then there will be a checkpoint certificate standing for the missing messages.

xosmig · 2022-08-29T10:53:39Z

xosmig
Aug 29, 2022
Author

I am no longer sure whether trying to take advantage of transport-layer retransmission was a good idea.

It seems that we sometimes may want to drop messages in Mir. Notably, when the module (or just the part of the protocol) that should process the message is not initialized yet (see Dynamically creating and deleting modules in Mir #127 and Availability module integration #192). This may happen due to asynchrony: one node may be more advanced in the protocol than others. Avoiding dropping the messages seems to be tricky and buffering is not always an option either as we may easily run out of memory.
We need protocol-level acknowledgments anyway to support temporary loss of connection and node recovery.
For the same reasons, we would need to retransmit some messages manually anyway (in the reliablenet module implementation).

Hence, it seems reasonable to actually build our own retransmission mechanism on top of a transport-layer protocol that does not have built-in retransmission (the obvious candidate is UDP). However, the issue is that we would need to also implement some flow-control mechanisms to avoid DOSing the network with retransmissions when the throughput is lower than expected.

9 replies

xosmig Aug 29, 2022
Author

I believe we should not try reinventing such complicated things

I am not sure whether trying to build on top of TCP would actually bring us any benefits in terms of complexity. Perhaps, the end-to-end argument may apply in this case.

sergefdrv Aug 29, 2022

After connection was lost and recovered or the receiving node crashed and recovered

As I mentioned earlier, I'm sure there are solutions for that.

sergefdrv Aug 29, 2022

I think the communication abstraction (Net module) should be responsible for convenient communication strategies for higher-level protocols (other modules), like best-effort or reliable delivery, so the end-to-end arguments should be applied with a grain of salt. We don't expect perfect reliability from the transport, but certain reliability guarantees may help us to avoid reinventing the wheel and excessive complexity.

matejpavlovic Aug 30, 2022

The end-to-end argument is spot on here I think and, as Sergey just said, we probably shouldn't expect perfect reliability from the transport. I see that there might have been a little misunderstanding here about the guarantees we had in mind during this discussion. I was, for example, assuming "reliable delivery" (title of this discussion) to be the same as "perfect reliability" (Sergey's last comment).

To move forward, we need to clarify this by defining what exact delivery guarantees we should discuss and how they would be helpful if they were abstracted in a module. Then we can see whether it's worth implementing such a module and how to do it. So, very concretely, what do you mean by "certain reliability guarantees"?

Btw, even if the formal guarantees of a fancy Net module implementation are technically not stronger at all, I can still imagine some performance benefit in (optimistically) leveraging the properties of the lower levels of the network stack, so the focus does not necessarily need to be on formal guarantees that are strictly stronger than best effort broadcast...

That said, we should also take a step back and see how important it is to solve this problem now. I agree that it is, in general, an important and interesting one, but it might be wiser to return to it when we have concrete examples / situations, where (while implementing a concrete protocol, e.g. Narwhal) additional delivery guarantees would be helpful. That would probably help us understand what concrete delivery guarantees we might want to abstract away and whether the benefit they provide is worth implementing such an abstraction.

xosmig Aug 30, 2022
Author

I think the communication abstraction (Net module) should be responsible for convenient communication strategies for higher-level protocols

As I described in "Possible Solution 2", the idea is to have net module that provides low-level access to the transport layer and reliablenet that provides reliable delivery guarantees.
The end-to-end argument applies to the transport-layer protocol.
As for reliablenet module, the idea is to overcome the end-to-end argument by having some feedback from the higher-level protocol via AcknowledgeMessage events (see "Possible Solution 2").

I'm sure there are solutions for that.

I never said that there were no solutions, the point was that we cannot fully offload the retransmission to the transport layer, so, the benefits of trying to rely on it are not clear.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reliable message delivery #132

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 27 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Reliable message delivery #132

Context

Naive approach

Issues with the naive approach

Some possible solutions

Possible solution 1

Possible solution 2

Possible solution 2.5

Replies: 5 comments · 27 replies

xosmig Jul 14, 2022 Author

xosmig Jul 14, 2022 Author

xosmig Aug 29, 2022 Author

xosmig Aug 29, 2022 Author

xosmig Aug 30, 2022 Author

Replies: 5 comments 27 replies

xosmig
Jul 14, 2022
Author

xosmig Jul 14, 2022
Author

xosmig
Aug 29, 2022
Author

xosmig Aug 29, 2022
Author

xosmig Aug 30, 2022
Author