Papers
For each meeting, readings are assigned. Usually, the readings will consist of two or three computer systems papers. The papers selected for this course are either classic papers or papers from recent top conferences. You are expected to read these papers thoroughly and submit a review BEFORE arriving at class on Wednesdays.
Each paper will be briefly presented by a student in the class, who will also lead the discussion of that paper. For each class meeting, we identify the topic and papers below; for each, we also try to identify good sources for background reading and for further investigation.
To enter your paper reviews, go here.
Electronic versions are available from the course review site.
(NOTE: This schedule is not set in stone. Some changes may be made to this schedule during the term)
Week 1 - January 14: Welcome to CSC 2227
This first meeting will be largely organizational in nature. In it, we will discuss how the class is going to work and what will be covered. In addition, we will very rapidly recap stuff you should already know, discussing what defines operating systems and distributed systems and what makes them continue to be interesting after all these years, and overviewing how the various topics in the course fit together.
There are no readings for this week. The following items are intended to help you refresh your memory of operating systems. These are to help you prepare for the course and assess your own knowledge of the pre-requisite material. Sample solutions to the OS exam questions will be posted next week.
- First Lecture notes
- OS Self Assessment (pdf)
- Concurrency Self Assessment (pdf) (Skip 8-25, 32, 36, 40, 43-45, 52)
Week 2 - January 21: Historical Distributed Systems
presented by TBD
Read and review the following papers:
- Grapevine: an exercise in distributed computing
Andrew D. Birrell, Roy Levin, Michael D. Schroeder, and Roger M. Needham. In Communications of the ACM, Vol. 25, No. 4, pp. 260-274, April 1982.
http://doi.acm.org/10.1145/358468.358487 - A Comparison of Two Distributed Systems: Amoeba and Sprite
Fred Douglis, M. Frans Kaashoek, John K. Ousterhout, and Andrew S. Tanenbaum. In Computing Systems, Vol. 4, No. 3, pp. 353-384, December 1991.
https://www.usenix.org/legacy/publications/compsystems/1991/fall_douglis.pdf
Additional suggested reading
- The LOCUS distributed operating system
Bruce Walker, Gerald Popek, Robert English, Charles Kline, and Greg Thiel. In Proceedings of the ninth ACM Symposium on Operating Systems Principles (SOSP '83), pp. 49-70, October 1983.
- Experience with Grapevine: the growth of a distributed system.
Michael D. Schroeder, Andrew D. Birrell, and Roger M. Needham. 1984. In ACM Transactions on Computer Systems, Vol. 2, No. 1, pp. 3-23, February 1984.
http://doi.acm.org/10.1145/2080.2081 - Distributed Operating Systems
Andrew S. Tanenbaum and Robbert Van Renesse. In ACM Computing Surveys, Vol. 17, No. 4, pp. 419-470, December 1985.
http://doi.acm.org/10.1145/6041.6074 - The V distributed system
David Cheriton. In Communications of the ACM, Vol. 31, No. 3 pp. 314-333, March 1988.
http://doi.acm.org/10.1145/42392.42400
Week 3 - January 28: Consensus
Read and review the following papers:
- Paxos Made Simple
Leslie Lamport. November 2001.
http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf - In Search of an Understandable Consensus Algorithm
Diego Ongaro and John Ousterhout. In Proceedings of the 2014 USENIX Annual Technical Conference, pp. 305-320, June 2014.
https://www.usenix.org/conference/atc14/technical-sessions/presentation/ongaro
Additional suggested reading
- There is more consensus in Egalitarian parliaments
Iulian Moraru, David G. Andersen, and Michael Kaminsky. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13), pp. 358-372, November 2013.
http://doi.acm.org/10.1145/2517349.2517350 - Paxos Made Practical
David Mazieres. 2007.
http://www.scs.stanford.edu/~dm/home/papers/paxos.pdf - Viewstamped Replication Revisited
Barbara Liskov and James Cowling. MIT technical report MIT-CSAIL-TR-2012-021, July 2012.
http://pmg.csail.mit.edu/papers/vr-revisited.pdf
Week 4 - Feb. 4: Coordination Services
Read and review the following papers:
- The Chubby Lock Service for Loosely-Coupled Distributed Systems
Mike Burrows. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI'06), pp. 335--350, November 2006.
https://www.usenix.org/legacy/event/osdi06/tech/burrows.html - ZooKeeper: Wait-free Coordination for Internet-scale Systems
Patrick Hunt, Mahadev Konar, Flavio P. Junqueira and Benjamin Reed. In Proceedings of the 2010 USENIX Annual Technical Conference (ATC'10), pp. 145--158, June 2010.
https://www.usenix.org/legacy/events/atc10/tech/full_papers/Hunt.pdf
Additional suggested reading
Week 5 - Feb. 11: Distributed Hash Tables
Read and review the following papers:
- Chord: A scalable peer-to-peer lookup service for internet applications.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. In Proceedings of the 2001 Conference on Applications, technologies, architectures, and protocols for computer communications (SIGCOMM '01), pp. 149-160, August 2001.
http://doi.acm.org/10.1145/383059.383071 - Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems.
Antony I. T. Rowstron and Peter Druschel. In Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms (Middleware '01), pp. 329-350, November 2001.
http://research.microsoft.com/en-us/um/people/antr/PAST/pastry.pdf
Additional suggested reading
READING WEEK - Feb. 18: NO CLASS
Week 6 - Feb. 25: Key-Value Stores
Read and review the following papers:
- Dynamo: Amazon’s Highly Available Key-value Store
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels. In Proceedings of 21st ACM SIGOPS Symposium on Operating Systems Principles (SOSP '07), pp. 205-220, October 2007.
http://doi.acm.org/10.1145/1294261.1294281 - HyperDex: a distributed, searchable key-value store
Robert Escriva, Bernard Wong, and Emin Gün Sirer. In Proceedings of the ACM SIGCOMM 2012 Conference on Applications, technologies, architectures, and protocols for computer communication (SIGCOMM '12), pp. 25--36, August 2012.
http://doi.acm.org/10.1145/2342356.2342360
Additional suggested reading
Week 7 - Mar 4: In-Memory Distributed Computing and Storage
Read and review the following papers:
- Fast Crash Recovery in RAMCloud
Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'11), pp. 29--41, October 2011.
http://doi.acm.org/10.1145/2043556.2043560 - FaRM: Fast Remote Memory
Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI'14), pp. 401--414, April 2014.
https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dragojevic.pdf
Additional suggested reading
- MICA: A Holistic Approach to Fast In-Memory Key-Value Storage
Hyeontaek Lim, Dongsu Han, David G. Andersen, and Michael Kaminsky. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI'14), pp. 429--444, April 2014.
https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-lim.pdf
Week 8 - Mar. 11: Distributed File Systems
Read and review the following papers:
- Ceph: A Scalable, High-Performance Distributed File System
Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI'06), pp. 307--320, November 2006.
https://www.usenix.org/legacy/event/osdi06/tech/weil.html - GPFS: A Shared-Disk File System for Large Computing Clusters
Frank Schmuck and Roger Haskin. In Proceedings of the First USENIX Conference on File and Storage Technologies (FAST'02), pp. 231--244, January 2002
https://www.usenix.org/legacy/publications/library/proceedings/fast02/full_papers/schmuck/schmuck.pdf
Additional suggested reading
- Scale and performance in a distributed file system
John H. Howard, Michael L. Kazar, Sherri G. Menees, David A. Nichols, M. Satyanarayanan, Robert N. Sidebotham, and Michael J. West. In ACM Transactions on Computer Systems, Vol. 6, No. 1, pp. 51--81, February 1988.
http://doi.acm.org/10.1145/35037.35059 - Frangipani: a scalable distributed file system
Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee. In Proceedings of the 16th ACM symposium on Operating Systems Principles (SOSP '97), pp. 224-237, October 1997.
http://doi.acm.org/10.1145/268998.266694 - The Google file system
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP '03), pp. 29--43, October 2003.
http://doi.acm.org/10.1145/945445.945450
Week 9 - Mar. 20 (proposed rescheduled meeting): Programming Frameworks/Models
Read and review the following papers:
- Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys'07), pp. 59-72, March 2007. http://doi.acm.org/10.1145/1272996.1273005 - Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, pp. 15--28, April 2012. https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
Additional suggested reading
- MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat. In Proceedings of the 6th USENIX Symposium on Operating System Design and Implementation (OSDI'04), pp. 137--150, December 2004. San Francisco, CA, December, 2004.
https://www.usenix.org/legacy/events/osdi04/tech/dean.html - Naiad: A Timely Dataflow System
Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martin Abadi. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP'13), pp. 439--455, November 2013.
http://doi.acm.org/10.1145/2517349.2522738
Week 10 - Mar. 25: Scheduling and Load Balancing
Read and review the following papers:
- Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling
Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. 2010. In Proceedings of the 5th European conference on Computer systems (EuroSys '10), pp. 265--278, 2010.
http://doi.acm.org/10.1145/1755913.1755940 - Omega: flexible, scalable schedulers for large compute clusters
Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys '13), pp. 351--364, 2013.
http://doi.acm.org/10.1145/2465351.2465386
Additional suggested reading
- Quincy: fair scheduling for distributed computing clusters
Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP '09), 261--276, October 2009.
http://doi.acm.org/10.1145/1629575.1629601 - Sparrow: distributed, low latency scheduling
Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP '13), 69--84, November 2013.
http://doi.acm.org/10.1145/2517349.2522716 - Apollo: scalable and coordinated scheduling for cloud-scale computing
Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. In Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation (OSDI'14), pp. 285--300, October 2014.
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-boutin_0.pdf
Week 11 - Apr. 1: Distributed Performance Analysis & Debugging
Read and review the following papers:
- DieCast: Testing Distributed Systems with an Accurate Scale Model
Diwaker Gupta, Kashi V. Vishwanath, and Amin Vahdat. In Proceedings of the 5th USENIX Symposium on Networked System Design and Implementation (NSDI'08), pp. 407–-421, April 2008.
https://www.usenix.org/legacy/event/nsdi08/tech/full_papers/gupta/gupta.pdf - The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services
Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Wenisch. In Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation (OSDI'14), pp. 217--231, October 2014.
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chow.pdf
Additional suggested reading
- Using Magpie for request extraction and workload modelling
Paul Barham, Austin Donnelly, Rebecca Isaacs and Richard Mortier. In Proceedings of the 6th USENIX Symposium on Operating System Design and Implementation (OSDI'04), pp. 259–-272, December 2004.
https://www.usenix.org/legacy/event/osdi04/tech/full_papers/barham/barham.pdf - Replay Debugging for Distributed Applications
Dennis Geels, Gautam Altekar, Scott Shenker, and Ion Stoica. In Proceedings of the 2006 USENIX Annual Technical Conference (ATC'06), pp. 289–-300, June 2006.
https://www.usenix.org/legacy/event/usenix06/tech/geels/geels.pdf
Week 12 - Apr. 8: Very Large Systems (Experiences)
Read and review the following papers:
- Operating Systems Support for Planetary-Scale Network Services
Andy Bavier, Mic Bowman, Brent Chun, David Culler, Scott Karlin, Steve Muir, Larry Peterson, Timothy Roscoe, Tammo Spalink and Mike Wawrzoniak. In Proceedings of the 1st Symposium on Networked Systems Design and Implementation (NSDI'04), pp. 253–-266, March 2004.
https://www.usenix.org/legacy/events/nsdi04/tech/full_papers/bavier/bavier.pdf - Distributed Computing in Practice: The Condor Experience
Douglas Thain, Todd Tannenbaum, and Miron Livny. In Concurrency and Computation: Practice and Experience, Vol. 17, No. 2-4, pp. 323--356, February 2005.
http://research.cs.wisc.edu/htcondor/doc/condor-practice.pdf (authors' version)
Additional suggested reading