Review: Resilient Overlay Networks

From: Fareha Shafique <fareha.s_at_gmail.com>
Date: Sat, 14 Oct 2006 16:35:35 -0400

The paper describes a Resilient Overlay Network (RON) architecture, which is
an application-layer overlay on top of the underlying Internet routing
substrate that greatly improves the reliability of Internet packet delivery
by detecting and recovering from outages and path failures more quickly than
the current wide-area routing protocols, namely BGP. BGP trades-off fault
tolerance for routing scalability. RON works by deploying nodes in different
Internet routing domains that cooperate amongst themselves to forward data
for each other. The nodes also monitor the quality of the underlying
Internet between themselves and use this information to decide whether to
route packets directly over the Internet or by way of other RON nodes,
optimizing application-specific routing metrics. In contrast to BGP, RON
tradesoff scalability for resilience and can scale to only 50 nodes due to
the bandwidth overhead resulting from its aggressive probing and monitoring
on all paths connecting its nodes in order to detect problems.
The most noteworthy finding from the experiments and analysis is that in
most cases, forwarding packets via at most one intermediate RON node was
sufficient both for recovering from failures and for improving communication
latency.
The authors list the three design goals:
1. Failure detection and recovery in less that 20 seconds to enable nodes to
communicate with each other in the face of problems with underlying Internet
paths connecting them.
2. Tighter integration of routing and path selection with the application
because poor conditions for one application may be acceptable by another.
3. Expressive policy routing which govern the choice of paths in the
network. It is possible to choose a path based on latency, loss and
throughput.
The paper then describes the design and implementation of a RON architecture
in quite some detail.
The authors evaluated three main areas over two RON datasets, one with 12
nodes (RON1) and the other with 16 (RON2):
1. RON's ability to detect outages and recover quickly from them
2. Performance failures and RON's ability to improve the loss rate, latency,
and throughput of badly performing paths.
3. Two aspects of RON's routing protocols:
   a. effectiveness of its one-intermediate-hop strategy compared to more
general alternatives
   b. stability of RON-generated routes.
The major results:
1. RON was able to successfully detect and recover from a majority of all
complete outages and all periods of sustained loss rates of 30% or more.
2. RON takes an average of 18 seconds to route around a failure and can do
so in the face of a flooding attack.
3. RON also overcomes performance failures, substantially improving the loss
rates, latency and TCP throughput of badly performing Internet paths.
Finally, the paper discusses some weak points of the proposed RON
acrhitecture, including scalability (which they claim is not a big problem
since most importnant distributed applications can benefit from it), misuse
of a RON by misbehaving nodes, and also problems caused by nodes behind a
NAT or firewall.
The paper is well written, providing clear details as well as pointing out
shortcoming along the way, such as the use of bidirectional information to
optimize uni-directional loss rates that may lead to worse loss rates. The
idea presented is intersting, however, the practacality is questionable.
Received on Sat Oct 14 2006 - 16:36:21 EDT

This archive was generated by hypermail 2.2.0 : Sat Oct 14 2006 - 18:08:02 EDT