|
Date
|
Presenter
|
Topic
|
Presenter Bio
|
|
Sep 19
|
Kuei (Jack) Sun (Friday, 2:00pm, BA5205)
|
Robust Consistency Checking for Modern Filesystems
We describe our approach to building a runtime file system checker
for the emerging Linux Btrfs file system. Such checkers verify the consistency of
file system metadata update operations before they are committed to disk, thus
preventing corrupted updates from becoming durable. The consistency checks in
Btrfs are complex and need to be expressed clearly so that they can be reasoned
about and implemented reliably, thus we propose writing the checks declaratively.
This approach reduces the complexity of the checks, ensures their independence,
and helps identify the correct abstractions in the checker. It also shows how the
checker can be made robust against arbitrary file system corruption
|
Kuei (Jack) Sun is a third year PhD student supervised by Prof. Ashvin Goel and Prof. Angela Demke Brown. His current research focuses on specification of file system format and potential applications that can benefit from it. His research interest includes design, implementation and optimization of systems software with an emphasis on storage systems. He is also recreationally interested in programming language theory and artificial intelligence.
|
|
Sep 29
|
Yongle Zhang (Monday, 12:00pm, BA5205)
|
lprof : A Non-intrusive Request Flow Profiler for Distributed Systems
Applications implementing cloud services, such as HDFS, Hadoop YARN, Cassandra, and HBase, are mostly built as distributed systems designed to scale. In order to analyze and debug the performance of these systems effectively and efficiently, it is essential to under- stand the performance behavior of service requests, both in aggregate and individually.
lprof is a profiling tool that automatically reconstructs the execution flow of each request in a distributed application. In contrast to existing approaches that require instrumentation, lprof infers the request-flow entirely from runtime logs and thus does not require any modifications to source code. lprof first statically analyzes an application’s binary code to infer how logs can be parsed so that the dispersed and intertwined log entries can be stitched together and associated to specific individual requests.
We validate lprof using the four widely used distributed services mentioned above. Our evaluation shows lprof’s precision in request extraction is 90%, and lprof is helpful in diagnosing 65% of the sampled real-world performance anomalies.
|
Yongle Zhang is a second year Ph.D. student. He works with Professor Ding Yuan. His research area is focused on system reliability. Currently, he is working on utilizing existing logs to better understand distributed systems.
|
|
Oct 1
|
Prof. Ding Yuan (Wednesday, 12:00pm, BA5256)
|
Simple Testing Can Prevent Most Critical Failures
An Analysis of Production Failures in Distributed Data-intensive Systems
Large, production quality distributed systems still fail periodically,
and do so sometimes catastrophically, where
most or all users experience an outage or data loss. We
present the result of a comprehensive study investigating
198 randomly selected, user-reported failures that occurred
on Cassandra, HBase, Hadoop Distributed File
System (HDFS), Hadoop MapReduce, and Redis, with
the goal of understanding how one or multiple faults
eventually evolve into a user-visible failure. We found
that from a testing point of view, almost all failures require
only 3 or fewer nodes to reproduce, which is good
news considering that these services typically run on a
very large number of nodes. However, multiple inputs
are needed to trigger the failures with the order between
them being important. Finally, we found the error logs
of these systems typically contain sufficient data on both
the errors and the input events that triggered the failure,
enabling the diagnose and the reproduction of the production
failures.
We found the majority of catastrophic failures could
easily have been prevented by performing simple testing
on error handling code – the last line of defense – even
without an understanding of the software design. We extracted
three simple rules from the bugs that have lead to
some of the catastrophic failures, and developed a static
checker, Aspirator, capable of locating these bugs. Over
30% of the catastrophic failures would have been prevented
had Aspirator been used and the identified bugs
fixed. RunningAspirator on the code of 9 distributed systems
located 143 bugs and bad practices that have been
fixed or confirmed by the developers.
|
Ding Yuan is an assistant professor in the Department of Electrical and Computer Engineering
at University of Toronto. He received his PhD from the University of Illinois, Urbana-Champaign
under the supervision of Yuanyuan Zhou.
|
|
Dec 3
|
Prof. Cristiana Amza (Wednesday, 12:00pm, SFB560!)
|
Stage-Aware Anomaly Detection through Tracking Log Points
We introduce Stage-aware Anomaly Detection (SAAD), a low overhead
real-time solution for detecting runtime anomalies in
storage systems. Modern storage server architectures are multi-threaded
and structured as a set of modules, which we call stages.
SAAD leverages this to collect stage-level log summaries at runtime
and to perform statistical analysis across stage instances.
Stages that generate rare execution flows and/or register unusually
high duration for regular flows at run-time indicate anomalies.
SAAD makes two key contributions: i) limits the search space for
root causes, by pinpointing specific anomalous code stages, and ii)
reduces compute and storage requirements for log analysis, while
preserving accuracy, through a novel technique based on log summarization.
We evaluate SAAD on three distributed storage systems:
HBase, Hadoop Distributed File System (HDFS), and Cassandra.
We show that, with practically zero overhead, we uncover
various anomalies in real-time.
|
Cristiana Amza is an associate professor in the Department of Electrical and Computer Engineering
at University of Toronto. She received her PhD from Rice University. Her research interests are in
the design, implementation and evaluation of distributed systems. Her current work focuses on scaling
and consistency issues in web server technologies and distributed databases.
|
|
Dec 10
|
Nosayba El-Sayed (Wednesday, 12:00pm, BA5256)
|
Understanding different design tradeoffs in HPC checkpoint-scheduling policies.
The most commonly used fault tolerance method in HPC systems is “checkpoint/restart”, where an application writes periodic checkpoints of its state to stable storage that it can restart from in the case of a failure. In the first part of this work, our goal is to identify methods for optimizing the checkpointing process that are easy to use in practice and at the same time achieve high quality solutions. We evaluate an array of methods for optimizing the checkpoint interval, some previously known as well as new ones that we propose, using real-world failure logs. We show that a very simple closed-form solution can easily be adapted for use in practice and achieves near-optimal performance.
In the second part of this work, we evaluate these methods from an energy point of view. As the scale of HPC clusters continues to grow, their increasing energy consumption is emerging as a serious design concern. Checkpoint scheduling policies have been traditionally optimized and analysed from a performance perspective. Understanding the energy profile of these policies or how to optimize them for energy savings (rather than performance), remain not very well understood. We provide an extensive analysis of the energy/performance tradeoffs associated with various checkpointing policies. We estimate the energy overhead for a given policy, and provide simple formulas to optimize checkpoint scheduling for energy, with or without a bound on runtime. We then evaluate and compare the runtime-optimized and energy-optimized versions of the different methods using trace driven simulations based on failure logs from 10 production HPC clusters. Our results show ample room for achieving high energy savings with a low runtime overhead when using non-constant (adaptive) checkpointing methods that exploit characteristics of HPC failures. We also analyze the impact of energy-optimized checkpointing on the storage subsystem and study how to optimize for energy with a bound on I/O time.
|
Nosayba El-Sayed a fourth-year PhD student in the Computer Science department at University of Toronto.
She is part of the Systems and Networks group, and her advisor is professor Bianca Schroeder.
|