Skip Lists

The Linked List Problem

Sorted linked lists are great for insertion (splice a node in O(1) once you have the position) but terrible for search (O(n) to find that position). Binary search does not work because you cannot jump to the middle of a linked list. Balanced binary search trees (AVL, red-black) solve this but come with complex rebalancing logic that is tricky to implement, tricky to debug, and tricky to make concurrent.

William Pugh asked a simple question in 1990: what if you could express "jump to the middle" by adding extra forward pointers at different levels? That question led to the skip list.

How Skip Lists Work

Picture a sorted linked list. Now imagine layering multiple linked lists on top of it. The bottom layer contains every element. The next layer up contains roughly every other element. The layer above that contains roughly every fourth element. And so on.

To search for a value, start at the top-left corner (the head of the highest level). Walk right along the current level until the next node is greater than your target. Then drop down one level and continue. Repeat until you find your target or reach the bottom level.

This is exactly how an express subway works. The express train skips stations. You ride the express until you are close to your destination, then switch to the local. The express line is the higher level. The local line is level 1.

Insertion

To insert a value, first search for the correct position (using the multi-level traversal above). Along the way, record the last node visited at each level. These are the nodes whose forward pointers might need updating.

Next, decide the level of the new node. Flip a biased coin: with probability p, go up a level. Keep flipping until you lose or hit the maximum level. If p=0.25, about 75% of nodes are level 1 only, about 18.75% reach level 2, about 4.7% reach level 3, and so on.

Finally, splice the new node into each level up to its assigned level by updating the forward pointers of the predecessors you recorded during the search.

Deletion

Search for the node, record predecessors at each level, and update their forward pointers to skip over the deleted node. If the deleted node was the only node at the highest level, reduce the effective maximum level.

The Math

Expected search time is O(log n) with high probability. The intuition: at each level, you skip over a fraction (1/p) of the elements before dropping down. The expected number of levels is O(log_{1/p} n). At each level, you examine O(1/p) nodes before dropping. So total work is O((1/p) * log_{1/p} n) = O(log n).

The constant factor depends on p. With p=0.5, you examine about 2 nodes per level across about log_2(n) levels. With p=0.25, you examine about 4 nodes per level across about log_4(n) levels. Both give O(log n), but the constant is slightly different. p=0.25 trades a bit of search speed for less memory (fewer pointers per node).

Why Redis Chose Skip Lists

When Salvatore Sanfilippo (antirez) designed Redis sorted sets, he needed a data structure that supports:

Insert and delete in O(log n)
Lookup by score in O(log n)
Range queries (give me all elements with score between 10 and 50)
Rank queries (what is the rank of this element?)

Both balanced BSTs and skip lists handle 1-3. But skip lists make range queries trivially efficient: find the start of the range, then walk the bottom-level linked list forward until you pass the end. No tree traversal, no stack, no in-order iterator state. Just follow forward pointers.

For rank queries, Redis augments each skip list node with a span field: the number of nodes skipped by each forward pointer. By summing spans during traversal, you can compute the rank of any element in O(log n) time. Doing this with a balanced BST requires augmenting every node with a subtree size, which is doable but adds complexity to every rotation.

Antirez has also cited implementation simplicity. A skip list is about half the code of a red-black tree and easier to get right. For a project like Redis, where correctness is critical and the data structure is a core building block, simplicity has real value.

Concurrent Skip Lists

This is where skip lists really shine compared to balanced trees.

A balanced BST insertion might trigger rotations that affect multiple nodes across the tree. Making this concurrent requires either coarse-grained locking (which kills throughput) or fine-grained locking on individual nodes (which is complex and prone to deadlocks).

Skip list insertions only modify the predecessors at each level. Two insertions into different parts of the list touch different nodes and do not interfere. Fine-grained locking is straightforward: lock only the predecessors.

Lock-free skip lists go further. The key insight: you can break insertion into level-by-level pointer updates, each of which is a single compare-and-swap (CAS) operation. If the CAS fails (because another thread modified the same pointer), retry. The Herlihy and Shavit algorithm (from "The Art of Multiprocessor Programming") is the standard reference.

Java's ConcurrentSkipListMap implements this. It is the only concurrent sorted map in the Java standard library, and it exists specifically because skip lists lend themselves to lock-free implementation in a way that balanced trees do not.

Skip Lists as Memtables

LevelDB and RocksDB use skip lists for their in-memory write buffer (the memtable). Here is why.

Writes in an LSM tree go to the memtable first. The memtable must support fast concurrent inserts (multiple writer threads) and fast concurrent reads (queries that check the memtable before checking on-disk SSTables). When the memtable reaches a size threshold, it is frozen (made read-only) and flushed to an SSTable on disk.

A skip list fits this workload perfectly:

Fast inserts: O(log n) with minimal locking
Concurrent reads during writes: readers traverse forward pointers that are updated atomically. A reader might see a slightly stale view (missing the very latest insert), but it will never see a corrupted data structure. This is acceptable for the memtable use case.
Sorted output for flushing: when it is time to flush to an SSTable, just walk the bottom-level linked list from head to tail. The output is already sorted, which is exactly what SSTable files need.
No rebalancing: unlike a tree, there are no structural changes that require locking large portions of the data structure. An insert at position X does not affect the structure at position Y.

RocksDB also offers a hash-based skip list variant where the key space is hash-partitioned into buckets, each with its own skip list. This reduces contention further for workloads where keys are distributed across many prefixes.

Comparison with Balanced BSTs

Property	Skip List	Red-Black Tree
Search	O(log n) expected	O(log n) worst case
Insert	O(log n) expected	O(log n) worst case + rotations
Delete	O(log n) expected	O(log n) worst case + rotations
Range query	Walk linked list	In-order traversal
Implementation	~100 lines	~300 lines
Concurrent access	Natural (CAS per level)	Hard (rotations touch multiple nodes)
Memory	1.33-2 pointers/node	3 pointers + color bit/node
Worst case	O(n) theoretically	O(log n) guaranteed

The worst-case O(n) for skip lists sounds scary, but it happens with probability roughly 1/n^c for a constant c that depends on p. With a million elements, the probability of anything close to O(n) is astronomically small. In practice, skip list performance is as reliable as balanced tree performance.

The real trade-off is guaranteed worst case (BST) vs. implementation simplicity and concurrency friendliness (skip list). For single-threaded applications where worst-case guarantees matter (real-time systems), use a balanced tree. For concurrent applications or when simplicity matters, skip lists are hard to beat.

Deterministic Skip Lists

Randomized skip lists rely on random level assignment. Deterministic skip lists (like 1-2 skip lists or 1-2-3 skip lists) assign levels based on structural invariants, similar to how B-trees maintain balance through splitting. These give O(log n) worst case but lose the simplicity advantage and the concurrency friendliness. They exist primarily as theoretical curiosities. In practice, the randomized version is what people use.

The Memory Layout Problem

On modern hardware, cache performance matters as much as algorithmic complexity. Linked structures like skip lists have poor spatial locality: each node is allocated separately, and following a pointer often means a cache miss. B-trees win on cache performance because they pack many keys into a single cache-line-sized node.

For in-memory data, this can matter. A B-tree with a branching factor of 16 fits 15 keys per node, and those keys are likely in the same cache line. A skip list touches one node per level per step, and each node could be anywhere in memory.

For the memtable use case, this is acceptable because the memtable is small enough to mostly fit in L2/L3 cache. For larger data structures, consider cache-oblivious alternatives or B-trees. Redis skip lists work well because Redis values are reference-counted objects, so the skip list nodes contain pointers to values rather than the values themselves, and the working set is typically cache-friendly.

The Linked List Problem

William Pugh asked a simple question in 1990: what if you could express "jump to the middle" by adding extra forward pointers at different levels? That question led to the skip list.

How Skip Lists Work

Insertion

Finally, splice the new node into each level up to its assigned level by updating the forward pointers of the predecessors you recorded during the search.

Deletion

The Math

Why Redis Chose Skip Lists

When Salvatore Sanfilippo (antirez) designed Redis sorted sets, he needed a data structure that supports:

Insert and delete in O(log n)
Lookup by score in O(log n)
Range queries (give me all elements with score between 10 and 50)
Rank queries (what is the rank of this element?)

Concurrent Skip Lists

This is where skip lists really shine compared to balanced trees.

Skip Lists as Memtables

LevelDB and RocksDB use skip lists for their in-memory write buffer (the memtable). Here is why.

A skip list fits this workload perfectly:

Fast inserts: O(log n) with minimal locking
Concurrent reads during writes: readers traverse forward pointers that are updated atomically. A reader might see a slightly stale view (missing the very latest insert), but it will never see a corrupted data structure. This is acceptable for the memtable use case.
Sorted output for flushing: when it is time to flush to an SSTable, just walk the bottom-level linked list from head to tail. The output is already sorted, which is exactly what SSTable files need.
No rebalancing: unlike a tree, there are no structural changes that require locking large portions of the data structure. An insert at position X does not affect the structure at position Y.

Comparison with Balanced BSTs

Property	Skip List	Red-Black Tree
Search	O(log n) expected	O(log n) worst case
Insert	O(log n) expected	O(log n) worst case + rotations
Delete	O(log n) expected	O(log n) worst case + rotations
Range query	Walk linked list	In-order traversal
Implementation	~100 lines	~300 lines
Concurrent access	Natural (CAS per level)	Hard (rotations touch multiple nodes)
Memory	1.33-2 pointers/node	3 pointers + color bit/node
Worst case	O(n) theoretically	O(log n) guaranteed

Architecture

The Linked List Problem

How Skip Lists Work

Insertion

Deletion

The Math

Why Redis Chose Skip Lists

Concurrent Skip Lists

Skip Lists as Memtables

Comparison with Balanced BSTs

Deterministic Skip Lists

The Memory Layout Problem

Key Points

Used By

Common Mistakes

Related

Skip Lists

Architecture

The Linked List Problem

How Skip Lists Work

Insertion

Deletion

The Math

Why Redis Chose Skip Lists

Concurrent Skip Lists

Skip Lists as Memtables

Comparison with Balanced BSTs

Deterministic Skip Lists

The Memory Layout Problem

Key Points

Used By

Common Mistakes

Related