🕸️

Foundational Distributed Systems Papers

Original Source


I talked about the importance of reading foundational papers last week. To followup, here is my compilation of foundational papers in the distributed systems area. (I focused on the core distributed systems area, and did not cover networking, security, distributed ledgers, verification work etc. I even left out distributed transactions, I hope to cover them at a later date.)
I classified the papers by subject, and listed them in chronological order. I also listed expository papers and blog posts at the end of each section.
 

Time and State in Distributed Systems

Time, Clocks, and the Ordering of Events in a Distributed System. Leslie Lamport, Commn. of the ACM,  1978.
Distributed Snapshots: Determining Global States of a Distributed System. K. Mani Chandy Leslie Lamport, ACM Transactions on Computer Systems, 1985.

Expository papers and blog posts

There is No Now. Justin Sheehy, ACM Queue 2015
Why Logical Clocks are Easy. Carlos Baquero and Nuno Preguiça, ACM Queue 2016.

Impossibility results

Coordinated Attack or Two Generals problem is a fundamental impossibility in distributed systems. It started more of folks theorem, so instead of pointing to a paper, I provide a link to the wikipedia page.
Impossibility of Distributed Consensus with One Faulty Process, Fischer, Lynch and Patterson, JACM, 1985
Unreliable Failure Detectors for Reliable Distributed Systems, Tushar Deepak Chandra and Sam Toueg, Journal of the ACM, 1996.
Harvest, Yield, and Scalable Tolerant Systems, Armando Fox, Eric A. Brewer, 1999

Expository papers and blog posts

Consensus and state machine replication

Practical Byzantine Fault Tolerance. Miguel Castro, Barbara Liskov. OSDI 1999.
Chain Replication for Supporting High Throughput and Availability. Robbert van Renesse and Fred B. Schneider, OSDI 2004.
ZooKeeper: Wait-free coordination for Internet-scale systems. Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, Benjamin Reed, Usenix ATC 2010.
Tango: Distributed Data Structures over a Shared Log, Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran, Michael Wei, John D. Davis, Sriram Rao, Tao Zou, Aviad Zuckk. SOSP 2013.
There is more consensus in Egalitarian parliaments. Iulian Moraru, David G. Andersen, Michael Kaminsky, SOSP 2013.
Flexible Paxos: Quorum intersection revisited. Heidi Howard, Dahlia Malkhi, Alexander Spiegelman. 2016.
WormSpace: A modular foundation for simple, verifiable distributed systems. Ji-Yong Shin, Jieung Kim, Wolf Honore, Hernán Vanzetto, Srihari Radhakrishnan, Mahesh Balakrishnan, Zhong Shao, SOCC'19.

Expository papers and blog posts

In Search of an Understandable Consensus Algorithm. Diego Ongaro, John Ousterhout, Usenix ATC, 2014.
Paxos Made Moderately Complex. Robbert Van Renesse and Deniz Altinbuken, ACM Computing Surveys, 2015.
Consensus in the Cloud: Paxos Systems Demystified. Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, 2016.

Distributed algorithms

The Drinking Philosophers Problem, K. M. Chandy, J. Misra, ACM TOPLAS 1984
Sparse partitions, Baruch Awerbuch, David Peleg, FOCS 1990.
Distributed reset, Anish Arora, Mohamed Gouda, 1994
The Arrow Distributed Directory Protocol, Michael J. Demmer, M. Herlihy, DISC 1998.

Expository papers and blog posts

Miscellaneous

Hints for computer system design, Butler Lampson, 1983
The role of distributed state, John Ousterhout, 1990
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services. Matt Welsh, David Culler, and Eric Brewer, SOSP 2001
Crash only software, George Candea, Armando Fox, HotOS 2003

Expository papers and blog posts

Cloud computing, big data storage/processing

Lessons from Giant-Scale Services. Eric A. Brewer, IEEE Internet Computing, 2001.
MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat, OSDI 2004.
Optimistic Replication, Yasushi Saito and Marc Shapiro, 2005.
Dynamo: Amazon’s Highly Available Key-Value Store. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels, ACM SIGOPS 2007.
Conflict-free Replicated Data Types. Marc Shapiro, Nuno Preguiça, Carlos Baquero, Marek Zawirski, 2011.
Consistency Analysis in Bloom: a CALM and Collected Approach, Peter Alvaro, Neil Conway, Joseph M. Hellerstein, William R. Marczak, CIDR 2011.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012.
Tail at scale. Jeff Dean, Luiz Andre Barroso, Commn of the ACM, 2013.

Expository papers

Above the Clouds: A Berkeley View of Cloud Computing. Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica, Matei Zaharia, 2009.