Chord & Distributed Hash Tables
Architecture
The Central Directory Problem
You have 10 million files spread across 50,000 machines. Someone asks: "Which machine has file X?" A central server that tracks every file-to-machine mapping works, but it is a single point of failure and a bottleneck. When the directory server goes down, nobody can find anything.
Distributed Hash Tables eliminate the central directory. Every node in the network can answer the question "where is file X?" by routing the query through a small number of intermediate nodes. No single node has the full picture. No single node is critical. The system works as long as a reasonable fraction of nodes are alive.
Chord: The Foundation
Chord was published by Stoica et al. at MIT in 2001 and became the reference DHT algorithm. Most DHTs used in practice (Kademlia, Pastry, Tapestry) solve the same problem with slightly different routing strategies, but the core ideas trace back to Chord.
The Ring
Chord arranges node IDs and key IDs on a circular space from 0 to 2^m - 1 (typically m=160, using SHA-1 hashes). Both nodes and keys get IDs by hashing: node_id = SHA1(IP_address) and key_id = SHA1(key_name).
A key is stored on its successor: the first node whose ID is greater than or equal to the key's ID, walking clockwise around the ring. If the key's ID is 30 and nodes exist at IDs 14, 28, 42, 56, then key 30 goes to node 42.
This is similar to consistent hashing, but the crucial difference is that in consistent hashing (as used by Cassandra or Memcached), every node knows about every other node. In Chord, each node only knows about a handful of other nodes, and lookups happen by routing through intermediaries.
Naive Lookup: Walking the Ring
The simplest Chord lookup walks around the ring one node at a time. Ask your successor: "Do you own key 30?" If not, they ask their successor, and so on until someone says yes. This works but takes O(n) hops in a network of n nodes.
Finger Tables: The Shortcut
Finger tables make Chord fast. Each node maintains m entries in its finger table. Entry i points to the successor of (current_node + 2^i) on the ring. So if you are node 4 on a ring of size 64:
- Finger 1: successor of 5
- Finger 2: successor of 6
- Finger 3: successor of 8
- Finger 4: successor of 12
- Finger 5: successor of 20
- Finger 6: successor of 36
Each finger points to a node roughly twice as far away as the previous finger. This gives you logarithmic coverage of the ring.
To look up a key, find the finger that gets you closest to the key without overshooting, and forward the query to that node. That node repeats the process. Each hop covers at least half the remaining distance, giving O(log n) hops total.
Node Join
When a new node N joins:
- N contacts any existing node and asks for the successor of N's ID. This tells N where it belongs on the ring.
- N adopts its successor's key range (keys between N's predecessor and N now belong to N instead of N's successor).
- N initializes its finger table by querying existing nodes.
- N notifies its successor and predecessor about its existence.
Stabilization
Finger tables can go stale. Nodes join and leave, and the pointers in finger tables might point to nodes that no longer exist. Chord runs a stabilization protocol in the background:
- Periodically verify your successor and predecessor are still alive.
- Periodically refresh finger table entries by re-running the successor query for each finger.
- Fix any inconsistencies found.
During stabilization, lookups might fail or take extra hops. The protocol is self-healing (eventually all finger tables converge to the correct state), but there is a window of vulnerability during churn.
Kademlia: The Practical Choice
Kademlia (2002) is the DHT that actually won in practice. BitTorrent, IPFS, Ethereum, and Gnutella all use Kademlia or close variants. What makes it better than Chord for real-world use?
XOR Distance
Chord uses clockwise ring distance. Kademlia uses XOR as its distance metric: distance(A, B) = A XOR B. XOR distance has a useful property that Chord's ring distance lacks: symmetry. If A is close to B, then B is equally close to A.
Why does symmetry matter? In Chord, a node that routes a query through you does not help you in return (their position on the ring is not symmetrically useful). In Kademlia, every lookup teaches you about nodes that are useful for your own routing. The network improves its routing tables as a natural byproduct of handling queries.
k-Buckets
Instead of Chord's finger table with one entry per bit position, Kademlia maintains k-buckets. For each bit position i (from 0 to 159 for 160-bit IDs), the node keeps a list of up to k contacts at distance 2^i to 2^(i+1).
The parameter k (typically 20) controls redundancy. With k contacts per bucket, the node has multiple options for routing a query. If one contact is down, try another. This built-in redundancy makes Kademlia much more resilient to churn than Chord, which has a single pointer per finger.
Parallel Lookups
Kademlia lookups are iterative and parallel. To find the node responsible for key K:
- Pick alpha (typically 3) nodes from your closest k-bucket to K.
- Send FIND_NODE(K) to all alpha nodes simultaneously.
- As responses come back, they include the responding node's closest contacts to K.
- Pick the alpha closest nodes from all responses and query them.
- Repeat until no closer nodes are found.
Parallel queries dramatically reduce lookup latency compared to Chord's sequential hop-by-hop approach. In a 10,000-node network, a Chord lookup takes ~13 sequential round trips. A Kademlia lookup might complete in 3-4 rounds of parallel queries.
Peer Eviction
Kademlia has a clever heuristic for maintaining k-buckets: prefer older contacts. When a bucket is full and a new contact is discovered, ping the oldest contact. If it responds, keep it and discard the new one. If it does not respond, replace it.
Why prefer old contacts? Long-lived nodes tend to stay alive longer (empirically true in P2P networks). A node that has been online for 6 hours is more likely to be online in an hour than a node that just joined. This heuristic makes routing tables more stable without any explicit protocol for it.
Production: BitTorrent's Mainline DHT
BitTorrent's Mainline DHT is the largest Kademlia deployment, with tens of millions of simultaneous participants. It is used for "trackerless" torrents: finding peers for a torrent without contacting a central tracker server.
The key is the info_hash of the torrent (a 160-bit SHA-1 hash). Peers that have or want a torrent "announce" themselves on the DHT by storing their IP address under the info_hash. Other peers look up the info_hash to discover them.
Scale brings challenges that the academic papers do not cover:
Routing table pollution. With millions of nodes, many are behind NATs and unreachable. Kademlia's ping-before-evict strategy handles this, but routing tables fill up with stale contacts faster than in a small network.
Eclipse attacks. An attacker creates many nodes with IDs close to a target info_hash, intercepting all lookups for that torrent. BitTorrent mitigates this with rate limiting on new contacts and preferring established peers.
Churn. In a P2P network, nodes come and go constantly. BitTorrent users close their clients, change IP addresses, and rejoin. The DHT must continuously repair itself. Kademlia's redundant k-buckets (k=8 in BitTorrent) provide enough resilience that lookups succeed despite high churn.
IPFS and Content Addressing
IPFS uses a Kademlia variant for content discovery. The key difference from BitTorrent: IPFS hashes content itself (content-addressing) rather than hashing metadata about content. Looking up a content hash in the DHT returns the set of peers that can serve that content.
IPFS adds a "provider record" system on top of Kademlia. When a node stores content, it publishes a provider record to the k closest nodes to the content's hash. Lookups for that content query those k nodes and retrieve the provider list.
The challenge IPFS faces is latency. A Kademlia lookup across the global IPFS network takes 1-5 seconds due to the geographic spread of nodes. For a file-sharing use case, this is fine. For real-time applications, it is too slow. This is why IPFS layered Bitswap (a direct exchange protocol between known peers) on top of the DHT: use the DHT for discovery, then exchange data directly.
DHTs vs. Gossip vs. Consistent Hashing
These three concepts are related but solve different problems:
Consistent hashing distributes keys across a known set of servers. Every node knows every other node. Used inside datacenters where membership is managed centrally. Cassandra, DynamoDB, Memcached.
Gossip protocols spread information (membership changes, health status) through a network. They are a communication mechanism, not a lookup mechanism. Often used alongside consistent hashing.
DHTs combine consistent hashing with a routing protocol for networks where no single node knows the full membership. Used in P2P systems where nodes come and go without central coordination.
In practice, datacenter systems use consistent hashing (full membership knowledge, O(1) lookup), and P2P systems use DHTs (partial membership knowledge, O(log n) lookup). The DHT routing overhead only makes sense when you cannot afford to give every node a full membership list.
Limitations
Latency. Every hop in a DHT is a network round trip. For internet-scale P2P networks, that means 50-200ms per hop. O(log n) hops becomes O(log n × 100ms), which is too slow for anything latency-sensitive.
Security. Open DHTs are vulnerable to Sybil attacks (create many fake nodes), eclipse attacks (surround a target key), and routing attacks (return wrong results). No satisfactory general-purpose solution exists. Production systems use application-specific mitigations.
Complexity. Running a DHT correctly requires handling network partitions, stale routing tables, replication, key migration, and NAT traversal. The algorithm is simple on paper but the engineering is substantial.
Key Points
- •A Distributed Hash Table (DHT) maps keys to nodes in a decentralized network with no central coordinator. Any node can look up any key by routing through O(log n) intermediate nodes, even in a network with millions of participants
- •Chord places both keys and nodes on a circular ID space (ring) of size 2^m. Each key is stored on its successor node, the first node whose ID is equal to or follows the key's ID clockwise on the ring
- •Finger tables make lookups fast. Each node maintains a table of O(log n) entries pointing to nodes at exponentially increasing distances around the ring. Instead of hopping one node at a time, lookups jump halfway to the target, then a quarter, then an eighth
- •Kademlia (used in BitTorrent and IPFS) uses XOR distance instead of clockwise distance. This makes routing symmetric (if A is close to B, then B is close to A) and enables parallel lookups where a node queries multiple peers simultaneously
- •Node joins and departures require updating finger tables and transferring keys. Chord's stabilization protocol runs periodically to fix inconsistencies, but there is a window where lookups can fail during churn. Production systems add replication on top to handle this
Used By
Common Mistakes
- ✗Confusing DHTs with consistent hashing for server-side load balancing. Consistent hashing (like what Cassandra uses) assumes a central view of the membership. DHTs work when no single node knows the full membership. They solve different problems
- ✗Assuming DHT lookups are fast in absolute terms. O(log n) hops across the internet means O(log n) network round trips, each adding 50-200ms of latency. A lookup in a 10,000-node DHT might take 500-2000ms. This is fine for file sharing, terrible for database queries
- ✗Not planning for churn. In peer-to-peer networks, nodes join and leave constantly. Each departure can orphan keys if replication is insufficient. Production DHTs replicate keys across multiple successor nodes and run background repair processes
- ✗Ignoring Sybil attacks. In an open DHT, an attacker can create thousands of nodes and position them strategically to intercept or censor specific keys. Defending against this requires proof-of-work, social trust graphs, or other identity mechanisms