Google File System

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung

This foundational paper describes the Google File System (GFS), a scalable distributed file system for large distributed data-intensive applications.

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean, Sanjay Ghemawat

This white paper is an essential read on the MapReduce programming model that enables processing vast amounts of data across many machines.

Dynamo: Amazon's Highly Available Key-Value Store

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, Werner Vogels

In this research paper from AWS, you will learn about Dynamo, Amazon's key-value store designed for high availability and scalability, used to manage the state of various services.

Bigtable: A Distributed Storage System for Structured Data

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber

This paper details Bigtable, Google's distributed storage system for managing structured data designed to scale to a very large size.

The Chubby Lock Service for Loosely-Coupled Distributed Systems

Mike Burrows

This paper presents Chubby, a lock service for loosely-coupled distributed systems designed to manage coarse-grained locks.

Paxos Made Simple

Leslie Lamport

A simplified explanation of the Paxos consensus algorithm, which is foundational for understanding distributed systems and achieving consensus.

Raft Consensus Algorithm

Diego Ongaro, John Ousterhout

An approachable and understandable consensus algorithm designed as an alternative to Paxos, providing better understandability and manageability.

Spanner: Google's Globally-Distributed Database

James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford

This paper introduces Spanner, Google's scalable, multi-version, globally-distributed, and synchronously-replicated database.

The Log-Structured Merge-Tree (LSM-Tree)

Patrick O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth O'Neil

The LSM-Tree paper introduces a method for improving write performance in databases, which is crucial for high-write systems.

Kafka: A Distributed Messaging System for Log Processing

Jay Kreps, Neha Narkhede, Jun Rao

This paper describes Kafka, a distributed messaging system that is highly scalable and fault-tolerant, widely used for real-time data pipelines.

Cassandra — A Decentralized Structured Storage System

Avinash Lakshman, Prashant Malik

This paper introduces Cassandra, a decentralized storage system designed to handle large amounts of data across many commodity servers.

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica

Learn about Apache Mesos, a resource management platform that allows multiple distributed systems to efficiently share cluster resources.

The CAP Theorem

Eric Brewer

This white paper introduces the CAP Theorem, which states that it is impossible for a distributed data store to simultaneously provide consistency, availability, and partition tolerance.

The Tail at Scale

Jeffrey Dean, Luiz André Barroso

This paper discusses the phenomenon of long latency tails in large-scale services and how to mitigate their effects.

The End-to-End Argument in System Design

Jerome H. Saltzer, David P. Reed, David D. Clark

A seminal paper that introduces the end-to-end argument, a principle in system design that helps in deciding where to place functions in a networked system.

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

Luiz André Barroso, Urs Hölzle

This paper introduces the concept of warehouse-scale computing and discusses the design of datacenters that function as single massive computers.

Pregel: A System for Large-Scale Graph Processing

Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski

Pregel is a system designed by Google for processing large-scale graphs efficiently using a vertex-centric model.

The SWIM Gossip Protocol

Abhinandan Das, Indranil Gupta, Ashish Motivala

This paper describes the SWIM protocol, a scalable, weakly-consistent, infection-style process group membership protocol.

Dapper: A Large-Scale Distributed Systems Tracing Infrastructure

Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, Chandan Shanbhag

This paper presents Dapper, Google's large-scale distributed systems tracing infrastructure for monitoring and diagnosing complex systems.

ZooKeeper: Wait-Free Coordination for Internet-Scale Systems

Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, Benjamin Reed

ZooKeeper is a coordination service for distributed applications, providing primitives such as configuration maintenance, synchronization, and naming.

Ceph: A Scalable, High-Performance Distributed File System

Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, Carlos Maltzahn

Ceph is a distributed file system that provides high performance, reliability, and scalability, designed for a wide range of storage applications.

Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases

Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, Xiaofeng Bao

This paper discusses the design considerations behind Amazon Aurora, a high throughput cloud-native relational database.

Borg, Omega, and Kubernetes

Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, John Wilkes

This paper examines the relationship between Borg, Omega, and Kubernetes, providing insights into the evolution of cluster management systems at Google.

In Search of an Understandable Consensus Algorithm

Diego Ongaro, John Ousterhout

This paper presents the Raft consensus algorithm, designed to be more understandable than Paxos while providing similar functionality.

Distributing and Querying the "Big Data" with Apache Hive

Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, Raghotham Murthy

This paper discusses Apache Hive, a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

Zanzibar: Google's Consistent, Global Authorization System

Ruoming Pang, Ramón Cáceres, Mike Burrows, Zhifeng Chen, Pratik Dave, Nathan Germer, Alexander Golynski, Kevin Graney, Nate Klingner, Alexander Lloyd, Sagar Menai, Sabrina Mutch, Satishchandra Rayaprolu, David Remy, Jeffrey Stucker

Describes Zanzibar, Google's authorization system for consistent access control across billions of objects.