Google File System
Author: Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
This foundational paper describes the Google File System (GFS), a scalable distributed file system for large distributed data-intensive applications.
Tags: Storage, Distributed Systems
MapReduce: Simplified Data Processing on Large Clusters
Author: Jeffrey Dean, Sanjay Ghemawat
This white paper is an essential read on the MapReduce programming model that enables processing vast amounts of data across many machines.
Tags: Data Processing, Distributed Systems
Dynamo: Amazon's Highly Available Key-Value Store
Author: Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, Werner Vogels
In this research paper from AWS, you will learn about Dynamo, Amazon's key-value store designed for high availability and scalability, used to manage the state of various services.
Tags: Storage, Databases, Distributed Systems
Bigtable: A Distributed Storage System for Structured Data
Author: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber
This paper details Bigtable, Google's distributed storage system for managing structured data designed to scale to a very large size.
Tags: Storage, Databases
The Chubby Lock Service for Loosely-Coupled Distributed Systems
Author: Mike Burrows
This paper presents Chubby, a lock service for loosely-coupled distributed systems designed to manage coarse-grained locks.
Tags: Consensus, Distributed Systems
Paxos Made Simple
Author: Leslie Lamport
A simplified explanation of the Paxos consensus algorithm, which is foundational for understanding distributed systems and achieving consensus.
Tags: Consensus, Foundations
Raft Consensus Algorithm
Author: Diego Ongaro, John Ousterhout
An approachable and understandable consensus algorithm designed as an alternative to Paxos, providing better understandability and manageability.
Tags: Consensus, Distributed Systems
Spanner: Google's Globally-Distributed Database
Author: James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford
This paper introduces Spanner, Google's scalable, multi-version, globally-distributed, and synchronously-replicated database.
Tags: Databases, Distributed Systems, Consensus
The Log-Structured Merge-Tree (LSM-Tree)
Author: Patrick O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth O'Neil
The LSM-Tree paper introduces a method for improving write performance in databases, which is crucial for high-write systems.
Tags: Databases, Storage
Kafka: A Distributed Messaging System for Log Processing
Author: Jay Kreps, Neha Narkhede, Jun Rao
This paper describes Kafka, a distributed messaging system that is highly scalable and fault-tolerant, widely used for real-time data pipelines.
Tags: Data Processing, Messaging, Infrastructure
Cassandra — A Decentralized Structured Storage System
Author: Avinash Lakshman, Prashant Malik
This paper introduces Cassandra, a decentralized storage system designed to handle large amounts of data across many commodity servers.
Tags: Storage, Databases, Distributed Systems
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
Author: Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica
Learn about Apache Mesos, a resource management platform that allows multiple distributed systems to efficiently share cluster resources.
Tags: Infrastructure, Distributed Systems
The CAP Theorem
Author: Eric Brewer
This white paper introduces the CAP Theorem, which states that it is impossible for a distributed data store to simultaneously provide consistency, availability, and partition tolerance.
Tags: Foundations, Distributed Systems
The Tail at Scale
Author: Jeffrey Dean, Luiz André Barroso
This paper discusses the phenomenon of long latency tails in large-scale services and how to mitigate their effects.
Tags: Foundations, Infrastructure
The End-to-End Argument in System Design
Author: Jerome H. Saltzer, David P. Reed, David D. Clark
A seminal paper that introduces the end-to-end argument, a principle in system design that helps in deciding where to place functions in a networked system.
Tags: Foundations, Networking
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
Author: Luiz André Barroso, Urs Hölzle
This paper introduces the concept of warehouse-scale computing and discusses the design of datacenters that function as single massive computers.
Tags: Infrastructure, Foundations
Pregel: A System for Large-Scale Graph Processing
Author: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski
Pregel is a system designed by Google for processing large-scale graphs efficiently using a vertex-centric model.
Tags: Data Processing, Distributed Systems
The SWIM Gossip Protocol
Author: Abhinandan Das, Indranil Gupta, Ashish Motivala
This paper describes the SWIM protocol, a scalable, weakly-consistent, infection-style process group membership protocol.
Tags: Networking, Distributed Systems, Consensus
Dapper: A Large-Scale Distributed Systems Tracing Infrastructure
Author: Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, Chandan Shanbhag
This paper presents Dapper, Google's large-scale distributed systems tracing infrastructure for monitoring and diagnosing complex systems.
Tags: Infrastructure, Observability
ZooKeeper: Wait-Free Coordination for Internet-Scale Systems
Author: Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, Benjamin Reed
ZooKeeper is a coordination service for distributed applications, providing primitives such as configuration maintenance, synchronization, and naming.
Tags: Consensus, Distributed Systems, Infrastructure
Ceph: A Scalable, High-Performance Distributed File System
Author: Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, Carlos Maltzahn
Ceph is a distributed file system that provides high performance, reliability, and scalability, designed for a wide range of storage applications.
Tags: Storage, Distributed Systems
Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases
Author: Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, Xiaofeng Bao
This paper discusses the design considerations behind Amazon Aurora, a high throughput cloud-native relational database.
Tags: Databases, Infrastructure
Borg, Omega, and Kubernetes
Author: Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, John Wilkes
This paper examines the relationship between Borg, Omega, and Kubernetes, providing insights into the evolution of cluster management systems at Google.
Tags: Infrastructure, Distributed Systems
In Search of an Understandable Consensus Algorithm
Author: Diego Ongaro, John Ousterhout
This paper presents the Raft consensus algorithm, designed to be more understandable than Paxos while providing similar functionality.
Tags: Consensus, Foundations
Distributing and Querying the "Big Data" with Apache Hive
Author: Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, Raghotham Murthy
This paper discusses Apache Hive, a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
Tags: Data Processing, Databases
Zanzibar: Google's Consistent, Global Authorization System
Author: Ruoming Pang, Ramón Cáceres, Mike Burrows, Zhifeng Chen, Pratik Dave, Nathan Germer, Alexander Golynski, Kevin Graney, Nate Klingner, Alexander Lloyd, Sagar Menai, Sabrina Mutch, Satishchandra Rayaprolu, David Remy, Jeffrey Stucker
Describes Zanzibar, Google's authorization system for consistent access control across billions of objects.
Tags: Security, Distributed Systems
Attention Is All You Need
Author: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
The paper that introduced the Transformer architecture, replacing recurrence with self-attention mechanisms. The foundation behind GPT, BERT, and every modern large language model.
Tags: Machine Learning, Foundations
Time, Clocks, and the Ordering of Events in a Distributed System
Author: Leslie Lamport
Lamport's seminal 1978 paper that defines the happens-before relation and logical clocks, establishing the theoretical foundation for reasoning about ordering in distributed systems.
Tags: Foundations, Distributed Systems, Consensus
A Relational Model of Data for Large Shared Data Banks
Author: Edgar F. Codd
The 1970 paper that invented the relational database model. Introduced the concept of tables, relations, and normalization that underpin SQL and every RDBMS built since.
Tags: Databases, Foundations
Bitcoin: A Peer-to-Peer Electronic Cash System
Author: Satoshi Nakamoto
The original Bitcoin whitepaper proposing a decentralized digital currency using proof-of-work consensus, hash chains, and peer-to-peer networking — the foundation of all blockchain technology.
Tags: Consensus, Networking, Security
Spark: Cluster Computing with Working Sets
Author: Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica
Introduces Apache Spark and the Resilient Distributed Dataset (RDD) abstraction, enabling in-memory cluster computing that runs iterative algorithms up to 100x faster than MapReduce.
Tags: Data Processing, Distributed Systems
Dremel: Interactive Analysis of Web-Scale Datasets
Author: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis
Describes Dremel, Google's interactive ad-hoc query system for analysis of read-only nested data. The architecture behind BigQuery and the columnar format that inspired Apache Parquet.
Tags: Data Processing, Databases
Conflict-free Replicated Data Types (CRDTs)
Author: Marc Shapiro, Nuno Preguiça, Carlos Baquero, Marek Zawirski
Formalizes CRDTs — data structures that can be replicated across multiple nodes and always converge to a consistent state without coordination. Essential for offline-first and collaborative applications.
Tags: Consensus, Distributed Systems, Foundations
Scaling Memcache at Facebook
Author: Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, Venkateshwaran Venkataramani
How Facebook scaled Memcached to handle billions of requests per second across multiple data centers. A masterclass in caching infrastructure at internet scale.
Tags: Infrastructure, Storage, Distributed Systems
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Author: Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, Kaushik Veeraraghavan
Facebook's in-memory time series database optimized for writes, reads, and high availability. Introduces a novel timestamp and value compression scheme achieving 12x compression.
Tags: Databases, Observability, Infrastructure
F1: A Distributed SQL Database That Scales
Author: Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Littlefield, David Menestrina, Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, Himani Apte
Describes F1, Google's distributed relational database built on top of Spanner that replaced a sharded MySQL system for Google AdWords — handling one of the largest OLTP workloads on Earth.
Tags: Databases, Distributed Systems