Summary: A reliable multicast framework for light-weight sessions and application level framing

From: Andrew Miklas <agmiklas_at_cs.toronto.edu>
Date: Tue, 31 Oct 2006 01:26:24 -0500

This paper describes the Scalable Reliable Multicast (SRM) protocol. This is
a transport-layer protocol that supplies (as its name implies) reliable
messaging for multicast groups in a scalable way. Note that the protocol
assumes the existence of a network-layer protocol that supports multicast
(ie. IP Multicast). The protocol also assumes that the application will
provide some sort of naming scheme for "application data units" that the
protocol can use to report transmission errors and request re-transmission.

The authors begin by describing why reliable multicast is difficult. The
basic problem is that many of the techniques used in unicast networks don't
carry over well to multicast. For example, a simple design might have each
receiver send ACKs back to the transmitter. However, this design will not
scale, since the sender will suffer "ack implosion" as its downstream link is
jammed by all the incoming ack requests. Using unicast ACKs would also
require the sender to both know the identity of every receiver, and
independently track the progress of each. This would break one of the key
abstractions of IP Multicast -- senders don't track receivers, but instead
simply publish to a channel to which some number of receivers are subscribed.

The proposed system requires that the receivers detect loss (by noticing gaps
in the sequence space), and send a request (for repair) packet to the group
at some random time in the future. If another host who has also lost data
notices this message, it cancels its own request message. In this way,
failures of a link affecting many hosts should result in only a few (ideally
one) retransmit requests. The actual retransmit request is fulfilled by a
repair packet, which is also sent via multicast so that any other hosts
affected by the loss can receive the missing data. One interesting property
of the system is that any host that has the data may respond to the
retransmit request -- it doesn't have to be the original sender.

The paper provides a great deal of detail on their retransmit method. They
also present both analytic and simulation results in order to confirm that
their approach does perform well. They are also careful to explore a variety
of topologies to explore how the protocol will perform in a number of
pathological cases. However, they don't (appear to) present any experimental
data. This is unfortunate, since they appear to have an application that
uses their protocol already created. Perhaps they didn't feel that their
network and application would be able to adequately stress the protocol?

I found this paper pretty long. I'm not sure it had to be this long in order
to present these ideas. If they really required this space, maybe it could
have been split up into two publications?

The request-retransmit and repair mechanism described in this paper involved
notifying all hosts in a multicast group of a failure. Furthermore, all
hosts would hear the repair message. It seems like this might result in
multicast storms -- a condition where the group spends lots of bandwidth
handling retransmits, which creates higher congestion levels, which in turn
causes more loss requiring retransmits, etc. They might have addressed this
with their "Local Recovery" scheme, though.
Received on Tue Oct 31 2006 - 01:26:35 EST

This archive was generated by hypermail 2.2.0 : Tue Oct 31 2006 - 06:22:04 EST