HashMap
An array of buckets, a hash function, and linked lists for collisions. Simple pieces that combine into the most used data structure in software engineering.
Key Abstractions
Public API for put, get, remove. Manages capacity and triggers resizing.
Key-value-hash triple. Stores the precomputed hash to avoid rehashing during resize.
Strategy interface for computing hash codes. Swappable for different key types.
Class Diagram
How It Works
Every HashMap boils down to the same three-step dance: hash the key, pick a bucket, and handle collisions. You take a key, run it through a hash function to get an integer, then map that integer to an array index. When two keys land on the same index (and they will), you need a collision resolution strategy.
We use separate chaining here. Each bucket holds a linked list of entries. When a collision happens, the new entry gets prepended to the list at that index. It sounds like it could get slow, but as long as the load factor stays below 0.75, most chains have zero or one entry. Average case stays O(1).
Requirements
Functional
put(key, value)inserts a new entry or updates an existing oneget(key)returns the associated value, or null/None if the key doesn't existremove(key)deletes the entry and returns whether it existed- Automatically resizes when the number of entries exceeds the load factor threshold
Non-Functional
- O(1) amortized time for put, get, and remove
- Thread-safe for concurrent access
- Hash function should be swappable via Strategy pattern
Design Decisions
Why separate chaining over open addressing?
Open addressing (linear probing, double hashing) stores everything in the main array. Sounds cache-friendly, and it is. But deletions become painful. You need tombstone markers, and over time those tombstones degrade performance unless you periodically rehash. Separate chaining handles deletions cleanly since you just unlink a node. It also degrades more gracefully under high load. When chains grow, performance drops linearly. Open addressing can hit clustering cascades where a single hot region tanks the whole table.
Why power-of-2 capacity?
With an arbitrary capacity, you compute the bucket index using modulo: hash % capacity. Division is slow. With power-of-2 capacity, you replace it with a bitmask: hash & (capacity - 1). Same result, much faster. The tradeoff is that poor hash functions with patterns in the low bits will cluster more. That's why we include bit-spreading in the hash function, mixing high bits into low bits to counteract this.
Why rehash at 0.75 load factor?
It's a balance between space and time. At 0.5 you waste half your memory. At 1.0 you have an average chain length of 1 but worst-case chains get long enough to hurt. 0.75 is the sweet spot that Java's HashMap settled on after benchmarking, and it works well in practice. Most buckets hold 0 or 1 entries, and resize frequency stays reasonable.
Why store the hash code in each Entry?
During resize, every single entry needs to be re-indexed into the new bucket array. If you don't store the hash, you have to recompute it for every entry. For objects with expensive hashCode() implementations (like long strings), that adds up fast. One extra int per entry is a small price for avoiding redundant computation during resize.
Interview Follow-ups
- "How would you handle a hash-flooding attack?" Use a randomized hash function or switch to a balanced tree for buckets once chain length exceeds a threshold (this is what Java 8+ does at length 8). Randomization prevents attackers from crafting inputs that all hash to the same bucket.
- "What about concurrent reads without locking?" You could use a lock-free approach with volatile reads on the bucket array and CAS operations for writes. Java's ConcurrentHashMap uses segment-level locking and lock-free reads as a middle ground.
- "How would you implement iteration?" Walk the bucket array left to right, follow each chain to completion. Modification during iteration is the hard part. You either snapshot or use a fail-fast counter that detects structural changes.
- "Why not use a balanced BST per bucket instead of a linked list?" For most workloads, chains are short enough that linked list traversal is faster due to lower overhead. But if you expect adversarial inputs, trees cap worst-case lookup at O(log n) per bucket instead of O(n).
Code Implementation
1 from __future__ import annotations
2 from abc import ABC, abstractmethod
3 from threading import Lock
4 from typing import Any, Iterator
5
6
7 class HashFunction(ABC):
8 """Strategy interface for hashing keys."""
9
10 @abstractmethod
11 def hash(self, key: Any) -> int: ...
12
13
14 class DefaultHashFunction(HashFunction):
15 """Uses Python's built-in hash with bit spreading to reduce clustering."""
16
17 def hash(self, key: Any) -> int:
18 h = hash(key)
19 # Spread high bits into low bits (similar to Java's HashMap)
20 return h ^ (h >> 16)
21
22
23 class Entry:
24 """Key-value-hash triple. Linked via next pointer for chaining."""
25
26 __slots__ = ("key", "value", "hash_code", "next")
27
28 def __init__(self, key: Any, value: Any, hash_code: int):
29 self.key = key
30 self.value = value
31 self.hash_code = hash_code
32 self.next: "Entry | None" = None
33
34
35 class HashMap:
36 """Hash map with separate chaining, dynamic resizing, and pluggable hash function."""
37
38 _DEFAULT_CAPACITY = 16
39 _LOAD_FACTOR = 0.75
40
41 def __init__(self, capacity: int = _DEFAULT_CAPACITY, hash_fn: HashFunction | None = None):
42 # Force capacity to nearest power of 2
43 self._capacity = self._next_power_of_two(capacity)
44 self._size = 0
45 self._buckets: list[Entry | None] = [None] * self._capacity
46 self._hash_fn = hash_fn or DefaultHashFunction()
47 self._lock = Lock()
48
49 @staticmethod
50 def _next_power_of_two(n: int) -> int:
51 if n <= 1:
52 return 1
53 n -= 1
54 n |= n >> 1
55 n |= n >> 2
56 n |= n >> 4
57 n |= n >> 8
58 n |= n >> 16
59 return n + 1
60
61 def _index(self, hash_code: int) -> int:
62 return hash_code & (self._capacity - 1)
63
64 def put(self, key: Any, value: Any) -> None:
65 with self._lock:
66 h = self._hash_fn.hash(key)
67 idx = self._index(h)
68 entry = self._buckets[idx]
69
70 while entry is not None:
71 if entry.hash_code == h and entry.key == key:
72 entry.value = value
73 return
74 entry = entry.next
75
76 new_entry = Entry(key, value, h)
77 new_entry.next = self._buckets[idx]
78 self._buckets[idx] = new_entry
79 self._size += 1
80
81 if self._size > self._capacity * self._LOAD_FACTOR:
82 self._resize()
83
84 def get(self, key: Any) -> Any | None:
85 with self._lock:
86 h = self._hash_fn.hash(key)
87 idx = self._index(h)
88 entry = self._buckets[idx]
89
90 while entry is not None:
91 if entry.hash_code == h and entry.key == key:
92 return entry.value
93 entry = entry.next
94 return None
95
96 def remove(self, key: Any) -> bool:
97 with self._lock:
98 h = self._hash_fn.hash(key)
99 idx = self._index(h)
100 entry = self._buckets[idx]
101 prev = None
102
103 while entry is not None:
104 if entry.hash_code == h and entry.key == key:
105 if prev is None:
106 self._buckets[idx] = entry.next
107 else:
108 prev.next = entry.next
109 self._size -= 1
110 return True
111 prev = entry
112 entry = entry.next
113 return False
114
115 def _resize(self) -> None:
116 old_buckets = self._buckets
117 self._capacity *= 2
118 self._buckets = [None] * self._capacity
119 self._size = 0
120
121 for head in old_buckets:
122 entry = head
123 while entry is not None:
124 self.put.__wrapped__(self, entry.key, entry.value) if hasattr(self.put, '__wrapped__') else self._put_internal(entry.key, entry.value, entry.hash_code)
125 entry = entry.next
126
127 def _put_internal(self, key: Any, value: Any, hash_code: int) -> None:
128 idx = self._index(hash_code)
129 new_entry = Entry(key, value, hash_code)
130 new_entry.next = self._buckets[idx]
131 self._buckets[idx] = new_entry
132 self._size += 1
133
134 def __len__(self) -> int:
135 return self._size
136
137 def __contains__(self, key: Any) -> bool:
138 return self.get(key) is not None
139
140 def keys(self) -> Iterator[Any]:
141 for head in self._buckets:
142 entry = head
143 while entry is not None:
144 yield entry.key
145 entry = entry.next
146
147
148 if __name__ == "__main__":
149 hm = HashMap(capacity=4)
150
151 # Basic put and get
152 hm.put("name", "Alice")
153 hm.put("age", 30)
154 hm.put("city", "Seattle")
155 print(f"get('name') = {hm.get('name')}") # Alice
156 print(f"get('age') = {hm.get('age')}") # 30
157 print(f"size = {len(hm)}") # 3
158
159 # Update existing key
160 hm.put("age", 31)
161 print(f"get('age') after update = {hm.get('age')}") # 31
162
163 # Force collisions and resize by adding more entries
164 for i in range(20):
165 hm.put(f"key_{i}", i)
166 print(f"size after bulk insert = {len(hm)}") # 23
167 print(f"get('key_15') = {hm.get('key_15')}") # 15
168
169 # Remove
170 removed = hm.remove("name")
171 print(f"remove('name') = {removed}") # True
172 print(f"get('name') after remove = {hm.get('name')}") # None
173
174 # Contains
175 print(f"'city' in map = {'city' in hm}") # True
176 print(f"'name' in map = {'name' in hm}") # False
177
178 print("All operations completed successfully.")Common Mistakes
- ✗Using key equality without checking hash first. Compare hashes before calling equals() to short-circuit expensive comparisons.
- ✗Forgetting to handle null keys or duplicate keys on put(). Duplicates should update the value, not create a second entry.
- ✗Not resizing. Without resize, chains grow linearly and your O(1) map degrades to O(n).
- ✗Rehashing with the old capacity instead of the new one. Every entry must be re-indexed against the new size.
Key Points
- ✓Separate chaining handles collisions by appending to a linked list at each bucket index.
- ✓Power-of-2 capacity lets you replace modulo with bitwise AND. index = hash & (capacity - 1) is faster.
- ✓Rehashing at 0.75 load factor keeps average chain length short, preserving O(1) amortized access.
- ✓Storing the hash in each Entry avoids recomputing it during resize. Small cost, big win at scale.