Summary: Improving the Reliability of Internet Paths with One-hop Source Routing

From: Andrew Miklas <agmiklas_at_cs.toronto.edu>
Date: Tue, 17 Oct 2006 03:03:19 -0400

This paper looks at a lighter implementation of the one-hop routing
explored in the RON paper. In particular, the authors investigate a
stateless routing scheme where if a client finds it can't reach a
server, it simply selects four random intermediaries, and asks them to
forward its traffic to the server. As in the RON paper, the idea here
is that a failure that is affecting the path between some client and
the server might not affect the traffic between the client and some
intermediary, and the intermediary and the server.

The paper opens with a very thourough measurement study. The authors
determine how often path-failures, and whereabouts in the network
these failures occur. Most of the study carefully separates results
where the host being contacted is a broadband client and those where
the host is a dedicated server. After finding that the paths to
dedicated servers fail mostly in the Internet core (surprisingly, the
last-hop connection appears to often be more reliable than the
Internet core), the authors hypothesize that one-hop routing might be
effective for flows involving servers.

This hypothesis is tested, both for server and broadband flows. Their
measurement study suggests that server path failures could be
corrected 66% of the time, and broadband failures 39%. The
measurement study also spends some time exploring different one-hop
routing options. For example, the authors tried varying the selection
algorithm (ie. history-k, BGP-paths-k, random-k), and the number of
intermediates used.

They also empirically determined strategies as to how many failures
along the default path should be tolerated before switching to the
intermediate route policy. They considered factors such as the
probability that the next intermediate routing attempt would succeed
given that N previous attempts had failed, and the overhead associated
with beginning the intermediate process after N drops along the
default path. Finally, they computed the probability of the flow
being recovered after T seconds both with and without the one-hop
routing policy.

Armed with these results, the authors then implemented a tool that
implemented their random-4 plan. This seemed to be a simple user-mode
utility that sat on both the source and intermediate nodes. When the
source node found that the path was experiencing loss, it would
contact a random intermediate node and ask it to begin forwarding
traffic. The intermediate node used the NAT feature already present
in Linux in order to facilitate this request. The performance of
their tool seemed to match their predictions fairly well.

I found this paper interesting because it had essentially two
measurement studies. Typically, papers seem to present an idea, an
implementation, and then an evaluation of the idea. This paper
presented a measurement study that both motivated the problem and
allowed the authors to begin testing hypotheses right away. Only
after the authors fully understood the problem and had a solution they
believed would perform well did they implement it. After they
deployed their tool, they then evaluated it with another measurement study.

I suppose it's possible that the authors developed the tool in
parallel with their initial study. However, from the reader's
perspective it seemed that by the time the actual tool was written,
every detail had been so carefully analyzed that the authors would
have been fairly certain that things would work out.
Received on Tue Oct 17 2006 - 03:03:38 EDT

This archive was generated by hypermail 2.2.0 : Tue Oct 17 2006 - 05:36:24 EDT