CSL seminars - Fall 2014



Location and time: BA5256/BA5205/SFB560, Time and Day below


 

Date

Presenter

Topic

Presenter Bio

Sep 19

Kuei (Jack) Sun (Friday, 2:00pm, BA5205)

Robust Consistency Checking for Modern Filesystems
We describe our approach to building a runtime file system checker for the emerging Linux Btrfs file system. Such checkers verify the consistency of file system metadata update operations before they are committed to disk, thus preventing corrupted updates from becoming durable. The consistency checks in Btrfs are complex and need to be expressed clearly so that they can be reasoned about and implemented reliably, thus we propose writing the checks declaratively. This approach reduces the complexity of the checks, ensures their independence, and helps identify the correct abstractions in the checker. It also shows how the checker can be made robust against arbitrary file system corruption


Kuei (Jack) Sun is a third year PhD student supervised by Prof. Ashvin Goel and Prof. Angela Demke Brown. His current research focuses on specification of file system format and potential applications that can benefit from it. His research interest includes design, implementation and optimization of systems software with an emphasis on storage systems. He is also recreationally interested in programming language theory and artificial intelligence.

Sep 29

Yongle Zhang (Monday, 12:00pm, BA5205)

lprof : A Non-intrusive Request Flow Profiler for Distributed Systems
Applications implementing cloud services, such as HDFS, Hadoop YARN, Cassandra, and HBase, are mostly built as distributed systems designed to scale. In order to analyze and debug the performance of these systems effectively and efficiently, it is essential to under- stand the performance behavior of service requests, both in aggregate and individually.
lprof is a profiling tool that automatically reconstructs the execution flow of each request in a distributed application. In contrast to existing approaches that require instrumentation, lprof infers the request-flow entirely from runtime logs and thus does not require any modifications to source code. lprof first statically analyzes an application’s binary code to infer how logs can be parsed so that the dispersed and intertwined log entries can be stitched together and associated to specific individual requests.
We validate lprof using the four widely used distributed services mentioned above. Our evaluation shows lprof’s precision in request extraction is 90%, and lprof is helpful in diagnosing 65% of the sampled real-world performance anomalies.


Yongle Zhang is a second year Ph.D. student. He works with Professor Ding Yuan. His research area is focused on system reliability. Currently, he is working on utilizing existing logs to better understand distributed systems.

Oct 1

Prof. Ding Yuan (Wednesday, 12:00pm, BA5256)

Simple Testing Can Prevent Most Critical Failures
An Analysis of Production Failures in Distributed Data-intensive Systems

Large, production quality distributed systems still fail periodically, and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the production failures.

We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code – the last line of defense – even without an understanding of the software design. We extracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs. Over 30% of the catastrophic failures would have been prevented had Aspirator been used and the identified bugs fixed. RunningAspirator on the code of 9 distributed systems located 143 bugs and bad practices that have been fixed or confirmed by the developers.


Ding Yuan is an assistant professor in the Department of Electrical and Computer Engineering at University of Toronto. He received his PhD from the University of Illinois, Urbana-Champaign under the supervision of Yuanyuan Zhou.

Dec 3

Prof. Cristiana Amza (Wednesday, 12:00pm, SFB560!)

Stage-Aware Anomaly Detection through Tracking Log Points
We introduce Stage-aware Anomaly Detection (SAAD), a low overhead real-time solution for detecting runtime anomalies in storage systems. Modern storage server architectures are multi-threaded and structured as a set of modules, which we call stages. SAAD leverages this to collect stage-level log summaries at runtime and to perform statistical analysis across stage instances. Stages that generate rare execution flows and/or register unusually high duration for regular flows at run-time indicate anomalies. SAAD makes two key contributions: i) limits the search space for root causes, by pinpointing specific anomalous code stages, and ii) reduces compute and storage requirements for log analysis, while preserving accuracy, through a novel technique based on log summarization. We evaluate SAAD on three distributed storage systems: HBase, Hadoop Distributed File System (HDFS), and Cassandra. We show that, with practically zero overhead, we uncover various anomalies in real-time.


Cristiana Amza is an associate professor in the Department of Electrical and Computer Engineering at University of Toronto. She received her PhD from Rice University. Her research interests are in the design, implementation and evaluation of distributed systems. Her current work focuses on scaling and consistency issues in web server technologies and distributed databases.

Dec 10

Nosayba El-Sayed (Wednesday, 12:00pm, BA5256)

Understanding different design tradeoffs in HPC checkpoint-scheduling policies.
The most commonly used fault tolerance method in HPC systems is “checkpoint/restart”, where an application writes periodic checkpoints of its state to stable storage that it can restart from in the case of a failure. In the first part of this work, our goal is to identify methods for optimizing the checkpointing process that are easy to use in practice and at the same time achieve high quality solutions. We evaluate an array of methods for optimizing the checkpoint interval, some previously known as well as new ones that we propose, using real-world failure logs. We show that a very simple closed-form solution can easily be adapted for use in practice and achieves near-optimal performance.

In the second part of this work, we evaluate these methods from an energy point of view. As the scale of HPC clusters continues to grow, their increasing energy consumption is emerging as a serious design concern. Checkpoint scheduling policies have been traditionally optimized and analysed from a performance perspective. Understanding the energy profile of these policies or how to optimize them for energy savings (rather than performance), remain not very well understood. We provide an extensive analysis of the energy/performance tradeoffs associated with various checkpointing policies. We estimate the energy overhead for a given policy, and provide simple formulas to optimize checkpoint scheduling for energy, with or without a bound on runtime. We then evaluate and compare the runtime-optimized and energy-optimized versions of the different methods using trace driven simulations based on failure logs from 10 production HPC clusters. Our results show ample room for achieving high energy savings with a low runtime overhead when using non-constant (adaptive) checkpointing methods that exploit characteristics of HPC failures. We also analyze the impact of energy-optimized checkpointing on the storage subsystem and study how to optimize for energy with a bound on I/O time.


Nosayba El-Sayed a fourth-year PhD student in the Computer Science department at University of Toronto. She is part of the Systems and Networks group, and her advisor is professor Bianca Schroeder.