Chord & Distributed Hash Tables — Distributed Systems | CrackingWalnuts

The Central Directory Problem

You have 10 million files spread across 50,000 machines. Someone asks: "Which machine has file X?" A central server that tracks every file-to-machine mapping works, but it is a single point of failure and a bottleneck. When the directory server goes down, nobody can find anything.

Distributed Hash Tables eliminate the central directory. Every node in the network can answer the question "where is file X?" by routing the query through a small number of intermediate nodes. No single node has the full picture. No single node is critical. The system works as long as a reasonable fraction of nodes are alive.

Chord: The Foundation

Chord was published by Stoica et al. at MIT in 2001 and became the reference DHT algorithm. Most DHTs used in practice (Kademlia, Pastry, Tapestry) solve the same problem with slightly different routing strategies, but the core ideas trace back to Chord.

The Ring

Chord arranges node IDs and key IDs on a circular space from 0 to 2^m - 1 (typically m=160, using SHA-1 hashes). Both nodes and keys get IDs by hashing: node_id = SHA1(IP_address) and key_id = SHA1(key_name).

A key is stored on its successor: the first node whose ID is greater than or equal to the key's ID, walking clockwise around the ring. If the key's ID is 30 and nodes exist at IDs 14, 28, 42, 56, then key 30 goes to node 42.

This is similar to consistent hashing, but the crucial difference is that in consistent hashing (as used by Cassandra or Memcached), every node knows about every other node. In Chord, each node only knows about a handful of other nodes, and lookups happen by routing through intermediaries.

Naive Lookup: Walking the Ring

The simplest Chord lookup walks around the ring one node at a time. Ask your successor: "Do you own key 30?" If not, they ask their successor, and so on until someone says yes. This works but takes O(n) hops in a network of n nodes.

Finger Tables: The Shortcut

Finger tables make Chord fast. Each node maintains m entries in its finger table. Entry i points to the successor of (current_node + 2^i) on the ring. So if you are node 4 on a ring of size 64:

Finger 1: successor of 5
Finger 2: successor of 6
Finger 3: successor of 8
Finger 4: successor of 12
Finger 5: successor of 20
Finger 6: successor of 36

Each finger points to a node roughly twice as far away as the previous finger. This gives you logarithmic coverage of the ring.

To look up a key, find the finger that gets you closest to the key without overshooting, and forward the query to that node. That node repeats the process. Each hop covers at least half the remaining distance, giving O(log n) hops total.

Node Join

When a new node N joins:

N contacts any existing node and asks for the successor of N's ID. This tells N where it belongs on the ring.
N adopts its successor's key range (keys between N's predecessor and N now belong to N instead of N's successor).
N initializes its finger table by querying existing nodes.
N notifies its successor and predecessor about its existence.

Stabilization

Finger tables can go stale. Nodes join and leave, and the pointers in finger tables might point to nodes that no longer exist. Chord runs a stabilization protocol in the background:

Periodically verify your successor and predecessor are still alive.
Periodically refresh finger table entries by re-running the successor query for each finger.
Fix any inconsistencies found.

During stabilization, lookups might fail or take extra hops. The protocol is self-healing (eventually all finger tables converge to the correct state), but there is a window of vulnerability during churn.

Kademlia: The Practical Choice

Kademlia (2002) is the DHT that actually won in practice. BitTorrent, IPFS, Ethereum, and Gnutella all use Kademlia or close variants. What makes it better than Chord for real-world use?

XOR Distance

Chord uses clockwise ring distance. Kademlia uses XOR as its distance metric: distance(A, B) = A XOR B. XOR distance has a useful property that Chord's ring distance lacks: symmetry. If A is close to B, then B is equally close to A.

Why does symmetry matter? In Chord, a node that routes a query through you does not help you in return (their position on the ring is not symmetrically useful). In Kademlia, every lookup teaches you about nodes that are useful for your own routing. The network improves its routing tables as a natural byproduct of handling queries.

k-Buckets

Instead of Chord's finger table with one entry per bit position, Kademlia maintains k-buckets. For each bit position i (from 0 to 159 for 160-bit IDs), the node keeps a list of up to k contacts at distance 2^i to 2^(i+1).

The parameter k (typically 20) controls redundancy. With k contacts per bucket, the node has multiple options for routing a query. If one contact is down, try another. This built-in redundancy makes Kademlia much more resilient to churn than Chord, which has a single pointer per finger.

Parallel Lookups

Kademlia lookups are iterative and parallel. To find the node responsible for key K:

Pick alpha (typically 3) nodes from your closest k-bucket to K.
Send FIND_NODE(K) to all alpha nodes simultaneously.
As responses come back, they include the responding node's closest contacts to K.
Pick the alpha closest nodes from all responses and query them.
Repeat until no closer nodes are found.

Parallel queries dramatically reduce lookup latency compared to Chord's sequential hop-by-hop approach. In a 10,000-node network, a Chord lookup takes ~13 sequential round trips. A Kademlia lookup might complete in 3-4 rounds of parallel queries.

Peer Eviction

Kademlia has a clever heuristic for maintaining k-buckets: prefer older contacts. When a bucket is full and a new contact is discovered, ping the oldest contact. If it responds, keep it and discard the new one. If it does not respond, replace it.

Why prefer old contacts? Long-lived nodes tend to stay alive longer (empirically true in P2P networks). A node that has been online for 6 hours is more likely to be online in an hour than a node that just joined. This heuristic makes routing tables more stable without any explicit protocol for it.

Production: BitTorrent's Mainline DHT

BitTorrent's Mainline DHT is the largest Kademlia deployment, with tens of millions of simultaneous participants. It is used for "trackerless" torrents: finding peers for a torrent without contacting a central tracker server.

The key is the info_hash of the torrent (a 160-bit SHA-1 hash). Peers that have or want a torrent "announce" themselves on the DHT by storing their IP address under the info_hash. Other peers look up the info_hash to discover them.

Scale brings challenges that the academic papers do not cover:

Routing table pollution. With millions of nodes, many are behind NATs and unreachable. Kademlia's ping-before-evict strategy handles this, but routing tables fill up with stale contacts faster than in a small network.

Eclipse attacks. An attacker creates many nodes with IDs close to a target info_hash, intercepting all lookups for that torrent. BitTorrent mitigates this with rate limiting on new contacts and preferring established peers.

Churn. In a P2P network, nodes come and go constantly. BitTorrent users close their clients, change IP addresses, and rejoin. The DHT must continuously repair itself. Kademlia's redundant k-buckets (k=8 in BitTorrent) provide enough resilience that lookups succeed despite high churn.

IPFS and Content Addressing

IPFS uses a Kademlia variant for content discovery. The key difference from BitTorrent: IPFS hashes content itself (content-addressing) rather than hashing metadata about content. Looking up a content hash in the DHT returns the set of peers that can serve that content.

IPFS adds a "provider record" system on top of Kademlia. When a node stores content, it publishes a provider record to the k closest nodes to the content's hash. Lookups for that content query those k nodes and retrieve the provider list.

The challenge IPFS faces is latency. A Kademlia lookup across the global IPFS network takes 1-5 seconds due to the geographic spread of nodes. For a file-sharing use case, this is fine. For real-time applications, it is too slow. This is why IPFS layered Bitswap (a direct exchange protocol between known peers) on top of the DHT: use the DHT for discovery, then exchange data directly.

DHTs vs. Gossip vs. Consistent Hashing

These three concepts are related but solve different problems:

Consistent hashing distributes keys across a known set of servers. Every node knows every other node. Used inside datacenters where membership is managed centrally. Cassandra, DynamoDB, Memcached.

Gossip protocols spread information (membership changes, health status) through a network. They are a communication mechanism, not a lookup mechanism. Often used alongside consistent hashing.

DHTs combine consistent hashing with a routing protocol for networks where no single node knows the full membership. Used in P2P systems where nodes come and go without central coordination.

In practice, datacenter systems use consistent hashing (full membership knowledge, O(1) lookup), and P2P systems use DHTs (partial membership knowledge, O(log n) lookup). The DHT routing overhead only makes sense when you cannot afford to give every node a full membership list.

Limitations

Latency. Every hop in a DHT is a network round trip. For internet-scale P2P networks, that means 50-200ms per hop. O(log n) hops becomes O(log n × 100ms), which is too slow for anything latency-sensitive.

Security. Open DHTs are vulnerable to Sybil attacks (create many fake nodes), eclipse attacks (surround a target key), and routing attacks (return wrong results). No satisfactory general-purpose solution exists. Production systems use application-specific mitigations.

Complexity. Running a DHT correctly requires handling network partitions, stale routing tables, replication, key migration, and NAT traversal. The algorithm is simple on paper but the engineering is substantial.

The Central Directory Problem

Chord: The Foundation

The Ring

Naive Lookup: Walking the Ring

Finger Tables: The Shortcut

Finger tables make Chord fast. Each node maintains m entries in its finger table. Entry i points to the successor of (current_node + 2^i) on the ring. So if you are node 4 on a ring of size 64:

Finger 1: successor of 5
Finger 2: successor of 6
Finger 3: successor of 8
Finger 4: successor of 12
Finger 5: successor of 20
Finger 6: successor of 36

Each finger points to a node roughly twice as far away as the previous finger. This gives you logarithmic coverage of the ring.

Node Join

When a new node N joins:

N contacts any existing node and asks for the successor of N's ID. This tells N where it belongs on the ring.
N adopts its successor's key range (keys between N's predecessor and N now belong to N instead of N's successor).
N initializes its finger table by querying existing nodes.
N notifies its successor and predecessor about its existence.

Stabilization

Finger tables can go stale. Nodes join and leave, and the pointers in finger tables might point to nodes that no longer exist. Chord runs a stabilization protocol in the background:

Periodically verify your successor and predecessor are still alive.
Periodically refresh finger table entries by re-running the successor query for each finger.
Fix any inconsistencies found.

Kademlia: The Practical Choice

Kademlia (2002) is the DHT that actually won in practice. BitTorrent, IPFS, Ethereum, and Gnutella all use Kademlia or close variants. What makes it better than Chord for real-world use?

XOR Distance

k-Buckets

Parallel Lookups

Kademlia lookups are iterative and parallel. To find the node responsible for key K:

Pick alpha (typically 3) nodes from your closest k-bucket to K.
Send FIND_NODE(K) to all alpha nodes simultaneously.
As responses come back, they include the responding node's closest contacts to K.
Pick the alpha closest nodes from all responses and query them.
Repeat until no closer nodes are found.

Peer Eviction

Production: BitTorrent's Mainline DHT

Scale brings challenges that the academic papers do not cover:

IPFS and Content Addressing

DHTs vs. Gossip vs. Consistent Hashing

These three concepts are related but solve different problems:

Gossip protocols spread information (membership changes, health status) through a network. They are a communication mechanism, not a lookup mechanism. Often used alongside consistent hashing.

DHTs combine consistent hashing with a routing protocol for networks where no single node knows the full membership. Used in P2P systems where nodes come and go without central coordination.

Architecture

The Central Directory Problem

Chord: The Foundation

The Ring

Naive Lookup: Walking the Ring

Finger Tables: The Shortcut

Node Join

Stabilization

Kademlia: The Practical Choice

XOR Distance

k-Buckets

Parallel Lookups

Peer Eviction

Production: BitTorrent's Mainline DHT

IPFS and Content Addressing

DHTs vs. Gossip vs. Consistent Hashing

Limitations

Key Points

Used By

Common Mistakes

Related

Architecture

The Central Directory Problem

Chord: The Foundation

The Ring

Naive Lookup: Walking the Ring

Finger Tables: The Shortcut

Node Join

Stabilization

Kademlia: The Practical Choice

XOR Distance

k-Buckets

Parallel Lookups

Peer Eviction

Production: BitTorrent's Mainline DHT

IPFS and Content Addressing

DHTs vs. Gossip vs. Consistent Hashing

Limitations

Key Points

Used By

Common Mistakes

Related