Google File System
This foundational paper describes the Google File System (GFS), a scalable distributed file system for large distributed data-intensive applications.
MapReduce: Simplified Data Processing on Large Clusters
This white paper is an essential read on the MapReduce programming model that enables processing vast amounts of data across many machines.
Dynamo: Amazon's Highly Available Key-Value Store
In this research paper from AWS, you will learn about Dynamo, Amazon's key-value store designed for high availability and scalability, used to manage the state of various services.
Bigtable: A Distributed Storage System for Structured Data
This paper details Bigtable, Google's distributed storage system for managing structured data designed to scale to a very large size.
The Chubby Lock Service for Loosely-Coupled Distributed Systems
This paper presents Chubby, a lock service for loosely-coupled distributed systems designed to manage coarse-grained locks.
Paxos Made Simple
A simplified explanation of the Paxos consensus algorithm, which is foundational for understanding distributed systems and achieving consensus.
Raft Consensus Algorithm
An approachable and understandable consensus algorithm designed as an alternative to Paxos, providing better understandability and manageability.
Spanner: Google's Globally-Distributed Database
This paper introduces Spanner, Google's scalable, multi-version, globally-distributed, and synchronously-replicated database.
The Log-Structured Merge-Tree (LSM-Tree)
The LSM-Tree paper introduces a method for improving write performance in databases, which is crucial for high-write systems.
Kafka: A Distributed Messaging System for Log Processing
This paper describes Kafka, a distributed messaging system that is highly scalable and fault-tolerant, widely used for real-time data pipelines.
Cassandra — A Decentralized Structured Storage System
This paper introduces Cassandra, a decentralized storage system designed to handle large amounts of data across many commodity servers.
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
Learn about Apache Mesos, a resource management platform that allows multiple distributed systems to efficiently share cluster resources.
The CAP Theorem
This white paper introduces the CAP Theorem, which states that it is impossible for a distributed data store to simultaneously provide consistency, availability, and partition tolerance.
The Tail at Scale
This paper discusses the phenomenon of long latency tails in large-scale services and how to mitigate their effects.
The End-to-End Argument in System Design
A seminal paper that introduces the end-to-end argument, a principle in system design that helps in deciding where to place functions in a networked system.
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
This paper introduces the concept of warehouse-scale computing and discusses the design of datacenters that function as single massive computers.
Pregel: A System for Large-Scale Graph Processing
Pregel is a system designed by Google for processing large-scale graphs efficiently using a vertex-centric model.
The SWIM Gossip Protocol
This paper describes the SWIM protocol, a scalable, weakly-consistent, infection-style process group membership protocol.
Dapper: A Large-Scale Distributed Systems Tracing Infrastructure
This paper presents Dapper, Google's large-scale distributed systems tracing infrastructure for monitoring and diagnosing complex systems.
ZooKeeper: Wait-Free Coordination for Internet-Scale Systems
ZooKeeper is a coordination service for distributed applications, providing primitives such as configuration maintenance, synchronization, and naming.
Ceph: A Scalable, High-Performance Distributed File System
Ceph is a distributed file system that provides high performance, reliability, and scalability, designed for a wide range of storage applications.
Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases
This paper discusses the design considerations behind Amazon Aurora, a high throughput cloud-native relational database.
Borg, Omega, and Kubernetes
This paper examines the relationship between Borg, Omega, and Kubernetes, providing insights into the evolution of cluster management systems at Google.
In Search of an Understandable Consensus Algorithm
This paper presents the Raft consensus algorithm, designed to be more understandable than Paxos while providing similar functionality.
Distributing and Querying the "Big Data" with Apache Hive
This paper discusses Apache Hive, a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
Zanzibar: Google's Consistent, Global Authorization System
Describes Zanzibar, Google's authorization system for consistent access control across billions of objects.