This is the list of papers we read:
- Lamport Time Clocks
- Spanner: Google’s Globally-Distributed Database
- The Chubby Lock Service for Loosely-Coupled
- A note on distributed computing
- The Byzantine Generals Problem
- Your computer is already a distributed system. Why isn't your OS?
- How Complex Systems Fail
- Fast and Message-Efficient Global Snapshot Algorithms for Large-Scale Distributed Systems
- Automatic Management of Partitioned, Replicated Search Services
- Simple Testing Can Prevent Most Critical Failures
- Dynamo: Amazon’s Highly Available Key-value Store
- Wait-free coordination for Internet-scale system
- Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services
- SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
- Kafka: a Distributed Messaging System for Log Processing
- DistributedLog: A high performance replicated log service
- The Log: What every software engineer should know about real-time data's unifying abstraction
- Social Hash: an Assignment Framework for Optimizing Distributed Systems Operations on Social Networks
- Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
- MapReduce: Simplified Data Processing on Large Clusters
- Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
- MillWheel: Fault-Tolerant Stream Processing at Internet Scale
- Snowflake - Unique ID Generation. “No two snowflakes are alike.”
- The Hadoop Distributed File System
- Gorilla: A Fast, Scalable, In-Memory Time Series Database
- Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial
- Meltdown
- Spectre Attacks: Exploiting Speculative Execution
- Communicating Sequential Processes
- The Tail at Scale
- Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web
- Dapper, A Large Scale Distributed Systems Tracing Infrastructure
- The many faces of consistency
- SDPaxos: Building efficient semi-decentralized geo-replicated state machines
- Dataflow Model
- How to read a paper
- Jupiter Rising: A Decade of Clos Topologies andCentralized Control in Google’s Datacenter Network
- Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook
- Harvest, Yield, and Scalable Tolerant Systems
- Lineage stash: fault tolerance off the critical path.
- F1 Query: Declarative Querying at Scale
- The Architectural Implications of Facebook’s DNN-based Personalized Recommendation
- EFLOPS: Algorithm and System Co-design for a High Performance Distributed Training Platform
Other papers mentioned but not discussed:
Updated 2018-05-31: added additional papers
Updated 2018-06-20: added additional papers. Linkified a few more
Updated 2018-10-06: added additional papers. Linkified a few more
Updated 2019-07-10: added additional papers. Linkified a few more
Updated 2022-04-05: added additional papers. Linkified a few more
Updated 2018-06-20: added additional papers. Linkified a few more
Updated 2018-10-06: added additional papers. Linkified a few more
Updated 2019-07-10: added additional papers. Linkified a few more
Updated 2022-04-05: added additional papers. Linkified a few more