Total Recall Review

From: Troy Ronda <ronda_REMOVE_THIS_FROM_EMAIL_FIRST_at_cs.toronto.edu>
Date: Thu, 24 Nov 2005 10:52:27 -0500

Total Recall: System Support for Automated Availability Management Review
Review By: Troy Ronda

Storage availability is desired property that is poorly understood. Many
users, for example, setup their systems without a clear understanding of
the impact of their configuration. An individual approach to availability
breaks down when dealing with large-scale distributed systems.
Peer-to-peer systems, in particular, are fragile as users join and leave
over short time periods. Even worse, hosts permanently leave the system,
leading to large amounts of churn. TotalRecall relies on three approaches.
Availability predictions continuously monitors the system (and makes
predictions). The challenge, for P2P storage, is that systems are rarely
homogeneous. This means that predictions must be made with empirical
measurement instead of a priori models. Redundancy management selects the
redundancy mechanism based on workload and policy. This is challenging
because it is not trivially clear how much redundancy is necessary.
Increased redundancy leads to increased overhead (and inefficiency).
Solutions range from simple replication to erasure codes. There is a
trade-off between run-time efficiency and storage cost. Dynamic repair
uses long time-scale predictions to initiate repair actions. The key
parameters is the amount of redundancy used in the system and how quickly
a system should react to departing hosts. The spectrum of repair policies
range from lazy to eager. A lazy repair approach is to defer repairs as
long as possible. The eager repair will replicate immediately following a
host departure. These two policies have the trade-off of redundancy levels
and overhead. The TotalRecall storage system allows systems to build upon
it. Examples include backup systems, file-sharing services and file
systems. It implements an automatic approach to the ideas previously
discussed in this review. Users will specify a policy and the underlying
storage system (TotalRecall) will provide guarantees.

This paper is an enjoyable read. It presents a simple API for a
distributed storage system (as P2P seem to often have). The main point of
this paper is well-taken, human administrators are poorly suited for
managing storage availability in a distributed system. It is simply too
difficult for a human to manually replicate objects onto non-predictable
machines. The prediction models and policy choices are both interesting
and a valuable contribution. It certainly makes sense to empirically
measure the system and select the best choice of policy based on load. The
simulation and test-bench are well explained. I suppose the system could
be used for all three applications given. The file-sharing obviously is a
good application. I have doubts that anyone would use it for a file or
backup system.

I guess the point I want to make is somewhat cynical. Since prediction
models for homogeneous systems in P2P is difficult, perhaps we should
start considering P2P networks consisting of homogeneous nodes. I suppose
I do not understand why an organization would use a peer-to-peer storage
system if they want an availability guarantee. The system administrator
can statically make replication policies in an intranet setting. At an
individual host level, users can statically buy a portable hard drive and
statically replicate to it. Another problem is that the model requires
independent hosts but as the other reading pointed out, this is a bad
assumption. Interesting questions like the form of meta-data are not given
enough coverage. I would also like to know if these types of systems
provide hierarchical searching, etc. I also do not see how the system
actually provides a guarantee to the user. How do we ensure that data is
not improperly written by another user. There does not seem to be any
guarantees here!
Received on Thu Nov 24 2005 - 10:52:38 EST

This archive was generated by hypermail 2.2.0 : Thu Nov 24 2005 - 11:09:14 EST