Skip Lists
Architecture
The Linked List Problem
Sorted linked lists are great for insertion (splice a node in O(1) once you have the position) but terrible for search (O(n) to find that position). Binary search does not work because you cannot jump to the middle of a linked list. Balanced binary search trees (AVL, red-black) solve this but come with complex rebalancing logic that is tricky to implement, tricky to debug, and tricky to make concurrent.
William Pugh asked a simple question in 1990: what if you could express "jump to the middle" by adding extra forward pointers at different levels? That question led to the skip list.
How Skip Lists Work
Picture a sorted linked list. Now imagine layering multiple linked lists on top of it. The bottom layer contains every element. The next layer up contains roughly every other element. The layer above that contains roughly every fourth element. And so on.
To search for a value, start at the top-left corner (the head of the highest level). Walk right along the current level until the next node is greater than your target. Then drop down one level and continue. Repeat until you find your target or reach the bottom level.
This is exactly how an express subway works. The express train skips stations. You ride the express until you are close to your destination, then switch to the local. The express line is the higher level. The local line is level 1.
Insertion
To insert a value, first search for the correct position (using the multi-level traversal above). Along the way, record the last node visited at each level. These are the nodes whose forward pointers might need updating.
Next, decide the level of the new node. Flip a biased coin: with probability p, go up a level. Keep flipping until you lose or hit the maximum level. If p=0.25, about 75% of nodes are level 1 only, about 18.75% reach level 2, about 4.7% reach level 3, and so on.
Finally, splice the new node into each level up to its assigned level by updating the forward pointers of the predecessors you recorded during the search.
Deletion
Search for the node, record predecessors at each level, and update their forward pointers to skip over the deleted node. If the deleted node was the only node at the highest level, reduce the effective maximum level.
The Math
Expected search time is O(log n) with high probability. The intuition: at each level, you skip over a fraction (1/p) of the elements before dropping down. The expected number of levels is O(log_{1/p} n). At each level, you examine O(1/p) nodes before dropping. So total work is O((1/p) * log_{1/p} n) = O(log n).
The constant factor depends on p. With p=0.5, you examine about 2 nodes per level across about log_2(n) levels. With p=0.25, you examine about 4 nodes per level across about log_4(n) levels. Both give O(log n), but the constant is slightly different. p=0.25 trades a bit of search speed for less memory (fewer pointers per node).
Why Redis Chose Skip Lists
When Salvatore Sanfilippo (antirez) designed Redis sorted sets, he needed a data structure that supports:
- Insert and delete in O(log n)
- Lookup by score in O(log n)
- Range queries (give me all elements with score between 10 and 50)
- Rank queries (what is the rank of this element?)
Both balanced BSTs and skip lists handle 1-3. But skip lists make range queries trivially efficient: find the start of the range, then walk the bottom-level linked list forward until you pass the end. No tree traversal, no stack, no in-order iterator state. Just follow forward pointers.
For rank queries, Redis augments each skip list node with a span field: the number of nodes skipped by each forward pointer. By summing spans during traversal, you can compute the rank of any element in O(log n) time. Doing this with a balanced BST requires augmenting every node with a subtree size, which is doable but adds complexity to every rotation.
Antirez has also cited implementation simplicity. A skip list is about half the code of a red-black tree and easier to get right. For a project like Redis, where correctness is critical and the data structure is a core building block, simplicity has real value.
Concurrent Skip Lists
This is where skip lists really shine compared to balanced trees.
A balanced BST insertion might trigger rotations that affect multiple nodes across the tree. Making this concurrent requires either coarse-grained locking (which kills throughput) or fine-grained locking on individual nodes (which is complex and prone to deadlocks).
Skip list insertions only modify the predecessors at each level. Two insertions into different parts of the list touch different nodes and do not interfere. Fine-grained locking is straightforward: lock only the predecessors.
Lock-free skip lists go further. The key insight: you can break insertion into level-by-level pointer updates, each of which is a single compare-and-swap (CAS) operation. If the CAS fails (because another thread modified the same pointer), retry. The Herlihy and Shavit algorithm (from "The Art of Multiprocessor Programming") is the standard reference.
Java's ConcurrentSkipListMap implements this. It is the only concurrent sorted map in the Java standard library, and it exists specifically because skip lists lend themselves to lock-free implementation in a way that balanced trees do not.
Skip Lists as Memtables
LevelDB and RocksDB use skip lists for their in-memory write buffer (the memtable). Here is why.
Writes in an LSM tree go to the memtable first. The memtable must support fast concurrent inserts (multiple writer threads) and fast concurrent reads (queries that check the memtable before checking on-disk SSTables). When the memtable reaches a size threshold, it is frozen (made read-only) and flushed to an SSTable on disk.
A skip list fits this workload perfectly:
- Fast inserts: O(log n) with minimal locking
- Concurrent reads during writes: readers traverse forward pointers that are updated atomically. A reader might see a slightly stale view (missing the very latest insert), but it will never see a corrupted data structure. This is acceptable for the memtable use case.
- Sorted output for flushing: when it is time to flush to an SSTable, just walk the bottom-level linked list from head to tail. The output is already sorted, which is exactly what SSTable files need.
- No rebalancing: unlike a tree, there are no structural changes that require locking large portions of the data structure. An insert at position X does not affect the structure at position Y.
RocksDB also offers a hash-based skip list variant where the key space is hash-partitioned into buckets, each with its own skip list. This reduces contention further for workloads where keys are distributed across many prefixes.
Comparison with Balanced BSTs
| Property | Skip List | Red-Black Tree |
|---|---|---|
| Search | O(log n) expected | O(log n) worst case |
| Insert | O(log n) expected | O(log n) worst case + rotations |
| Delete | O(log n) expected | O(log n) worst case + rotations |
| Range query | Walk linked list | In-order traversal |
| Implementation | ~100 lines | ~300 lines |
| Concurrent access | Natural (CAS per level) | Hard (rotations touch multiple nodes) |
| Memory | 1.33-2 pointers/node | 3 pointers + color bit/node |
| Worst case | O(n) theoretically | O(log n) guaranteed |
The worst-case O(n) for skip lists sounds scary, but it happens with probability roughly 1/n^c for a constant c that depends on p. With a million elements, the probability of anything close to O(n) is astronomically small. In practice, skip list performance is as reliable as balanced tree performance.
The real trade-off is guaranteed worst case (BST) vs. implementation simplicity and concurrency friendliness (skip list). For single-threaded applications where worst-case guarantees matter (real-time systems), use a balanced tree. For concurrent applications or when simplicity matters, skip lists are hard to beat.
Deterministic Skip Lists
Randomized skip lists rely on random level assignment. Deterministic skip lists (like 1-2 skip lists or 1-2-3 skip lists) assign levels based on structural invariants, similar to how B-trees maintain balance through splitting. These give O(log n) worst case but lose the simplicity advantage and the concurrency friendliness. They exist primarily as theoretical curiosities. In practice, the randomized version is what people use.
The Memory Layout Problem
On modern hardware, cache performance matters as much as algorithmic complexity. Linked structures like skip lists have poor spatial locality: each node is allocated separately, and following a pointer often means a cache miss. B-trees win on cache performance because they pack many keys into a single cache-line-sized node.
For in-memory data, this can matter. A B-tree with a branching factor of 16 fits 15 keys per node, and those keys are likely in the same cache line. A skip list touches one node per level per step, and each node could be anywhere in memory.
For the memtable use case, this is acceptable because the memtable is small enough to mostly fit in L2/L3 cache. For larger data structures, consider cache-oblivious alternatives or B-trees. Redis skip lists work well because Redis values are reference-counted objects, so the skip list nodes contain pointers to values rather than the values themselves, and the working set is typically cache-friendly.
Key Points
- •A skip list is a layered linked list where higher layers skip over multiple elements, giving you O(log n) search, insert, and delete on a data structure that is conceptually much simpler than a balanced binary tree
- •Level assignment is randomized: each new node gets level 1, then flips a coin repeatedly to decide if it should also appear at level 2, level 3, and so on. This probabilistic approach gives expected O(log n) performance without any rebalancing logic
- •Redis chose skip lists over balanced trees for sorted sets (ZSET) because skip lists are simpler to implement, naturally support range queries, and are easier to reason about for concurrent access. Antirez has said as much in the Redis documentation
- •LevelDB and RocksDB use skip lists for their in-memory memtable because skip lists support concurrent reads without locking and concurrent inserts with minimal synchronization. When the memtable fills up, it flushes to an SSTable on disk
- •The space overhead is modest. With p=0.25 (each node has a 25% chance of being promoted to the next level), the expected number of pointers per node is 1.33. With p=0.5, it is 2. In practice, the pointer overhead is small compared to the keys and values stored
Used By
Common Mistakes
- ✗Using p=0.5 (50% promotion probability) without thinking about it. p=0.5 gives the best search performance but uses more memory (2 pointers per node on average). p=0.25 uses less memory (1.33 pointers) with only slightly slower search. Redis uses p=0.25
- ✗Setting maxLevel too low. The maximum level should be log(1/p) of the expected number of elements. With 1 million elements and p=0.25, you need at least 10 levels. Too few levels and the top layers do not skip enough, degrading to linear search
- ✗Forgetting that skip lists are probabilistic. A single skip list can get unlucky and have O(n) performance if the random level assignments are adversarial. In practice with a reasonable random number generator, this essentially never happens, but it is good to know the worst case
- ✗Implementing concurrent skip lists naively. Lock-free skip list insertion requires careful ordering of pointer updates and memory barriers. The Herlihy and Shavit algorithm is the standard reference, but getting it right is harder than it looks