Time and State in Distributed SystemsExpository papers and blog postsImpossibility resultsExpository papers and blog postsConsensus and state machine replicationExpository papers and blog postsDistributed algorithmsExpository papers and blog postsMiscellaneousExpository papers and blog postsCloud computing, big data storage/processingExpository papers
I talked about the importance of reading foundational papers last week. To followup, here is my compilation of foundational papers in the distributed systems area. (I focused on the core distributed systems area, and did not cover networking, security, distributed ledgers, verification work etc. I even left out distributed transactions, I hope to cover them at a later date.)
I classified the papers by subject, and listed them in chronological order. I also listed expository papers and blog posts at the end of each section.
Time, Clocks, and the Ordering of Events in a Distributed System. Leslie Lamport, Commn. of the ACM, 1978.
Distributed Snapshots: Determining Global States of a Distributed System. K. Mani Chandy Leslie Lamport, ACM Transactions on Computer Systems, 1985.
Virtual Time and Global States of Distributed Systems. Mattern, F. 1988.
There is No Now. Justin Sheehy, ACM Queue 2015
Why Logical Clocks are Easy. Carlos Baquero and Nuno Preguiça, ACM Queue 2016.
Coordinated Attack or Two Generals problem is a fundamental impossibility in distributed systems. It started more of folks theorem, so instead of pointing to a paper, I provide a link to the wikipedia page.
Impossibility of Distributed Consensus with One Faulty Process, Fischer, Lynch and Patterson, JACM, 1985
Unreliable Failure Detectors for Reliable Distributed Systems, Tushar Deepak Chandra and Sam Toueg, Journal of the ACM, 1996.
Harvest, Yield, and Scalable Tolerant Systems, Armando Fox, Eric A. Brewer, 1999
CAP Twelve years later: How the rules have changed, Eric Brewer, 2012.
Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems. B. Oki and B. Liskov. SOSP 1988
Implementing Fault-Tolerant Services Using the State Machine Approach: a Tutorial, Fred Schneider, 1990
How to build a highly available system with consensus, Butler Lampson, 1996
Practical Byzantine Fault Tolerance. Miguel Castro, Barbara Liskov. OSDI 1999.
Chain Replication for Supporting High Throughput and Availability. Robbert van Renesse and Fred B. Schneider, OSDI 2004.
ZooKeeper: Wait-free coordination for Internet-scale systems. Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, Benjamin Reed, Usenix ATC 2010.
Tango: Distributed Data Structures over a Shared Log, Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran, Michael Wei, John D. Davis, Sriram Rao, Tao Zou, Aviad Zuckk. SOSP 2013.
There is more consensus in Egalitarian parliaments. Iulian Moraru, David G. Andersen, Michael Kaminsky, SOSP 2013.
Flexible Paxos: Quorum intersection revisited. Heidi Howard, Dahlia Malkhi, Alexander Spiegelman. 2016.
WormSpace: A modular foundation for simple, verifiable distributed systems. Ji-Yong Shin, Jieung Kim, Wolf Honore, Hernán Vanzetto, Srihari Radhakrishnan, Mahesh Balakrishnan, Zhong Shao, SOCC'19.
In Search of an Understandable Consensus Algorithm. Diego Ongaro, John Ousterhout, Usenix ATC, 2014.
Paxos Made Moderately Complex. Robbert Van Renesse and Deniz Altinbuken, ACM Computing Surveys, 2015.
Consensus in the Cloud: Paxos Systems Demystified. Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, 2016.
Self-stabilizing systems in spite of distributed control, Edsgar W. Dijkstra, CACM 1974.
The Drinking Philosophers Problem, K. M. Chandy, J. Misra, ACM TOPLAS 1984
Sparse partitions, Baruch Awerbuch, David Peleg, FOCS 1990.
Distributed reset, Anish Arora, Mohamed Gouda, 1994
The Arrow Distributed Directory Protocol, Michael J. Demmer, M. Herlihy, DISC 1998.
Hints for computer system design, Butler Lampson, 1983
The role of distributed state, John Ousterhout, 1990
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services. Matt Welsh, David Culler, and Eric Brewer, SOSP 2001
Crash only software, George Candea, Armando Fox, HotOS 2003
Lessons from Giant-Scale Services. Eric A. Brewer, IEEE Internet Computing, 2001.
MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat, OSDI 2004.
Optimistic Replication, Yasushi Saito and Marc Shapiro, 2005.
Dynamo: Amazon’s Highly Available Key-Value Store. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels, ACM SIGOPS 2007.
On designing and deploying Internet scale services, James Hamilton, LISA 2007
Life beyond Distributed Transactions: an Apostate's Opinion, Pat Helland, CIDR 2007.
Conflict-free Replicated Data Types. Marc Shapiro, Nuno Preguiça, Carlos Baquero, Marek Zawirski, 2011.
Consistency Analysis in Bloom: a CALM and Collected Approach, Peter Alvaro, Neil Conway, Joseph M. Hellerstein, William R. Marczak, CIDR 2011.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012.
Tail at scale. Jeff Dean, Luiz Andre Barroso, Commn of the ACM, 2013.
Spanner: Google’s Globally Distributed Database, ACM, 2013.
Above the Clouds: A Berkeley View of Cloud Computing. Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica, Matei Zaharia, 2009.