System Design: URL Shortener (10B Short URLs, 100K Redirects/sec)
Goal. Build a URL shortener for 10 billion stored URLs, 100K redirect lookups/sec, and 1K creates/sec. Custom aliases, expiration, click analytics, and cache-hit redirect latency under 5ms p99 server-side.
TL;DR. This is a read-heavy key-value lookup with an async analytics sidecar. Each region mints IDs from its own counter with zero cross-region coordination, shuffled and Base62-encoded into 7-character codes (62⁷ ≈ 2⁴² IDs). ScyllaDB stores the mappings, replicated active-active across three regions. The read path is CDN → local LRU → Valkey → Scylla, with each layer absorbing what the previous one missed. Kafka + Flink + ClickHouse run analytics asynchronously, off the critical path.
Pick your path
| Time | Read this | Covers |
|---|---|---|
| 2 min | TL;DR + §1 + §11.2 Zipf callout | The shape of the system and its most fragile assumption |
| 15 min | §1–§13 | Every core design decision, interview-grade |
| 30 min | Full post | Production detail, ops, and appendices |
1. Final Architecture
Three independent paths. The read path is fast (sub-5ms server-side on cache hit), the write path is regional and not latency-sensitive, the analytics path is fully off the critical path.
Write path:
Client → Nearest region → Create Service
→ Regional counter (Scylla LWT) → shuffle → Base62
→ ScyllaDB (NetworkTopologyStrategy, RF=3) → Valkey warm
Read path:
Client → CDN → Nearest region → Redirect Service
→ Local LRU → Valkey → ScyllaDB LOCAL_ONE (on miss) → 302
Analytics path:
Redirect Service → Kafka (async) → Flink (5s batches) → ClickHouse
Every section below zooms into one piece of this picture.
2. Problem Statement
A URL shortener is an afternoon prototype and a months-long production system. Three things make it hard at scale:
- Unique IDs with no coordination. At 1K creates/sec across regions, a single auto-increment row is a bottleneck. Random IDs need a DB read per write. Hash truncation collides at 42 bits (the birthday paradox bites hard around 10B rows).
- Redirect latency under skewed load. 100K/sec of reads, a Zipf-shaped hot set, and a sub-5ms p99 target on cache hit. You can't touch the DB on the hot path.
- Analytics without coupling. Every redirect produces a click event. Writing those synchronously adds 10–50ms to every 302. They have to go through a queue.
Scale numbers.
- 10B URLs stored
- 1K creates/sec (~86M/day)
- 100K redirects/sec (~8.6B/day), distributed across regions
- 100:1 read-to-write ratio
- 200 B average long URL; 7-character short code
3. Functional Requirements
| ID | Requirement | Priority |
|---|---|---|
| FR-01 | Create a short URL from a long URL, returning a unique 7-character code | P0 |
| FR-02 | Redirect short URL to original long URL via HTTP 301/302 | P0 |
| FR-03 | Support custom aliases (user-chosen short codes) | P0 |
| FR-04 | URL expiration: optional TTL (1 day, 7 days, 30 days, 1 year, never) | P0 |
| FR-05 | Click analytics: total clicks, clicks over time, geographic distribution | P1 |
| FR-06 | Referrer and device tracking per click | P1 |
| FR-07 | Bulk URL creation via API (up to 1000 URLs per request) | P1 |
| FR-08 | URL deletion by owner | P1 |
| FR-09 | API key authentication for URL creation | P0 |
| FR-10 | Rate limiting per API key (100 creates/min default) | P0 |
| FR-11 | QR code generation for any short URL | P2 |
| FR-12 | Link preview metadata (title, description, image from target page) | P2 |
4. Non-Functional Requirements
| ID | Requirement | Target |
|---|---|---|
| NFR-01a | Redirect latency, cache-hit path (server-side) | p50 < 2ms / p99 < 5ms |
| NFR-01b | Redirect latency, cache-miss path (server-side) | p50 < 8ms / p99 < 15ms |
| NFR-01c | Redirect latency, end-to-end (client-observed, intra-region) | p50 5–10ms / p99 20–40ms |
| NFR-02 | Create latency (p50 / p99) | < 20ms / < 50ms |
| NFR-03 | Redirect throughput | 100K/sec sustained globally |
| NFR-04 | Create throughput | 1K/sec sustained (10K burst) |
| NFR-05 | Availability | 99.99% (52 min downtime/year) |
| NFR-06 | URL durability | Zero data loss for non-expired URLs |
| NFR-07 | Data retention | Expired URLs purged automatically via Scylla TTL; active URLs stored indefinitely |
| NFR-08 | Analytics freshness | < 5 second lag from click to dashboard |
| NFR-09 | Short code length | 7 characters (Base62 = 3.5 trillion combinations) |
NFR-01a/b are server-side — request arriving at the Redirect Service to response leaving it — and they're what we page on. NFR-01c is end-to-end and dominated by network round-trip, not server work (intra-region ~10–20ms, cross-region 40–150ms). CDN edge caching, not origin speed, is what keeps user-perceived time low worldwide.
5. Design Assumptions
A single box of non-negotiables. Every number and decision downstream inherits from here.
- Read-to-write ratio ~100:1. A 10:1 or 1000:1 workload changes cache/DB sizing materially.
- Zipf-shaped URL popularity. The top 1–2% of URLs get most clicks. If traffic is uniform, the cache collapses — see §11.2 for the failure math.
- 3 regions active-active. Virginia, Frankfurt, Tokyo. No China deployment (GFW/ICP out of scope).
- No file uploads, no OAuth, no SEO-optimized variant. Different storage and auth story.
- GDPR in scope; HIPAA/SOX out of scope.
- Numbers are design targets, not benchmark results.
6. High-Level Architecture
This is a read-heavy key-value lookup with an async analytics sidecar. Every choice downstream reinforces that shape.
6.1 Layers
One stack shown; an identical stack runs in every region. Scylla replicates across DCs via NetworkTopologyStrategy; Valkey and the app tier are regional. §13 covers the multi-region topology.
- Edge. CDN caches redirects with a 60s TTL so viral links don't reach origin. The L7 LB splits
/api/createfrom/:shortCode. - Application. Three stateless services. Redirect handles the hot path (local LRU → Valkey → Scylla). Create allocates IDs and writes to Scylla. Analytics queries ClickHouse for dashboards.
- Cache. Valkey holds ~150M hot URL mappings per region, cache-aside with a 24h TTL.
- Storage. Scylla is the primary store, RF=3 per DC. ClickHouse holds click events, partitioned by day.
- Async. Every redirect fires a click event into Kafka. Flink batches them and inserts into ClickHouse, keeping analytics off the hot path.
6.2 Store Selection
| Store | Technology | Role |
|---|---|---|
| Primary store | ScyllaDB (Cassandra API) | URL mappings, custom aliases, API keys |
| Cache | Valkey 8 Cluster | Hot URL lookups (~150M per region) |
| Analytics store | ClickHouse | Click event aggregations |
| Event bus | Kafka | Click events, async processing |
| Counter service | ScyllaDB counter_ranges table | LWT range allocation per region |
| CDN | CloudFront / Cloudflare | Edge redirect caching for viral URLs |
6.3 Why ScyllaDB
Four properties carry the choice for this specific workload:
- Shard-per-core (Seastar/C++). Each CPU core owns its partition range with no global locks — p99
SELECTby partition key stays sub-millisecond at our read rate. - Cassandra-compatible multi-region.
NetworkTopologyStrategywith RF=3 per DC gives active-active replication out of the box, with battle-tested hinted handoff, read repair, and anti-entropy repair. - Storage-engine primitives we actually need. Native per-insert TTL +
TimeWindowCompactionStrategyhandle URL expiration with no cleanup job; Paxos-backed LWT handles custom-alias uniqueness and counter range allocation on the one or two paths that need it.
Cost. ~$3K/mo self-managed on 3-year RI (18 × i4i.2xlarge across 3 regions) + ~$1.5K cross-region transfer; ~$8–10K/mo on Scylla Cloud. Operational tax is real: repair schedules, compaction tuning, consistency-level discipline, tombstone awareness (§19). Single-region RDS Postgres at ~$500–1K/mo is a legitimate alternative if you don't need active-active multi-region and your team already owns Postgres — we don't pick it here because the 4 TB partitioned-PG write rate plus active-active makes Scylla the cleaner fit.
7. Back-of-the-Envelope
7.1 Throughput
Writes: 1K creates/sec globally → ~333/sec per region
Reads: 100K redirects/sec globally
CDN absorbs ~25% (viral URLs): 25K/sec at edge
Valkey absorbs ~80% of origin: 60K/sec from cache
ScyllaDB: ~15K reads/sec + 1K writes/sec globally
Per region (of 3): ~5K Scylla reads/sec
7.2 Storage
Bottom line: ~350 B per row × 10B rows × RF=3 per DC × 3 DCs = ~31.5 TB total replicated, landing at ~2.6 TB steady state per node (~70% of raw NVMe) after compaction overhead. The 70% ceiling is deliberate — it leaves headroom so recompaction doesn't hit 100% and stall reads.
Show full derivation
Per URL row in Scylla:
short_code 7 B + long_url 200 B + created_at 8 B + expires_at 8 B
+ user_id 16 B + flags 1 B + region 1 B + metadata ~50 B
+ Scylla row overhead (column names, write ts, TTL metadata) ~60 B
Effective: ~350 B per row
Primary table: 10B × 350 B ≈ 3.5 TB per DC before replication
RF=3 per DC: 3.5 TB × 3 ≈ 10.5 TB per DC on disk
Across 3 DCs: ≈ 31.5 TB total replicated
Per node (6 × i4i.2xlarge, 2×1.875 TB NVMe = 3.75 TB):
10.5 TB ÷ 6 ≈ 1.75 TB per node (~47% raw NVMe)
+ compaction overhead ~1.5× ≈ 2.6 TB steady state (~70%)
Compaction overhead comes from merging SSTables — during a compaction, both the old and new SSTables coexist briefly.
7.3 Cache Sizing
Top 150M URLs (1.5%) cover ~80% of origin traffic (Zipf).
150M × 250 B raw = 37.5 GB
Valkey overhead (hash table, pointers, replication) ~1.6×
= ~60 GB per region → 6 shards × 10 GB
7.4 Analytics
100K events/sec × 300 B = 30 MB/sec = 2.6 TB/day raw
ClickHouse compression ~10× → ~260 GB/day
1-year hot retention + rollups → ~20 TB active
8. Data Model
8.1 ScyllaDB Schema (CQL)
Query-first design — each access pattern gets its own table instead of a secondary index. The primary urls table is inline below; the other five tables (click counts, custom aliases, counter ranges, user reverse-lookup, API keys) are collapsed for readability.
CREATE KEYSPACE url_shortener
WITH replication = {
'class': 'NetworkTopologyStrategy',
'us-east': 3,
'eu-central': 3,
'ap-northeast': 3
} AND durable_writes = true;
-- Primary URL mapping table. Immutable after create; TTL reaps expired rows.
CREATE TABLE url_shortener.urls (
short_code text,
long_url text,
user_id uuid,
created_at timestamp,
expires_at timestamp,
is_custom_alias boolean,
region_id tinyint,
metadata map<text, text>,
PRIMARY KEY (short_code)
) WITH gc_grace_seconds = 864000 -- 10 days, must exceed repair interval
AND compaction = {
'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 30
};Show the other 5 tables (click counts, custom aliases, counter ranges, user reverse-lookup, API keys)
-- Counter columns must be isolated from non-counter columns in Cassandra.
CREATE TABLE url_shortener.url_click_counts (
short_code text PRIMARY KEY,
clicks counter
) WITH compaction = {'class': 'SizeTieredCompactionStrategy'};
-- Custom alias uniqueness table. Lowercased alias as partition key.
-- Writes go through the leader region (§13.3) to avoid cross-DC LWT cost.
CREATE TABLE url_shortener.custom_aliases (
alias text PRIMARY KEY,
short_code text,
owner_user_id uuid,
created_at timestamp
);
-- Regional counter rows for ID allocation. Accessed once per ~100K URLs per pod.
CREATE TABLE url_shortener.counter_ranges (
region_id tinyint PRIMARY KEY,
current_value bigint
);
-- Reverse lookup: list a user's URLs, newest first.
CREATE TABLE url_shortener.urls_by_user (
user_id uuid,
created_at timestamp,
short_code text,
PRIMARY KEY ((user_id), created_at, short_code)
) WITH CLUSTERING ORDER BY (created_at DESC);
-- API key metadata. key_hash is SHA-256 of the raw key.
CREATE TABLE url_shortener.api_keys (
key_hash blob PRIMARY KEY,
user_id uuid,
name text,
rate_limit int,
created_at timestamp,
revoked_at timestamp
);Key schema decisions:
urls_by_useris a separate table, not a secondary index — Cassandra secondary indexes don't scale for high-cardinality columns.TimeWindowCompactionStrategyonurlsbecause rows are immutable. Whole SSTables drop when their time window ages out, so TTL reaping doesn't cost extra compaction work.gc_grace_seconds = 10 daysso scheduled repair can propagate tombstones before physical deletion. Must exceed the repair interval.- TTL is per-insert, not
default_time_to_live, because most URLs don't expire.
8.2 Valkey Key Patterns
url:{short_code} → {long_url}|{expires_at} (24h TTL)
ratelimit:{api_key}:{minute} → count (sliding window, 60s TTL)
counter:{pod_id}:current → current counter value
counter:{pod_id}:max → range end
8.3 ClickHouse Schema
Partitioned daily, MergeTree-ordered by (short_code, clicked_at), 1-year TTL. A SummingMergeTree materialized view rolls up hourly counts and unique-country counts per short code.
Show full DDL
CREATE TABLE click_events (
event_id UUID DEFAULT generateUUIDv4(),
short_code String,
clicked_at DateTime64(3),
referrer String,
user_agent String,
ip_country LowCardinality(String),
ip_city String,
device_type LowCardinality(String),
browser LowCardinality(String),
os LowCardinality(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(clicked_at)
ORDER BY (short_code, clicked_at)
TTL clicked_at + INTERVAL 1 YEAR;
CREATE MATERIALIZED VIEW click_counts_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMMDD(period_start)
ORDER BY (short_code, period_start)
AS SELECT
short_code,
toStartOfHour(clicked_at) AS period_start,
count() AS clicks,
uniqExact(ip_country) AS unique_countries
FROM click_events
GROUP BY short_code, period_start;9. API Design
9.1 Create Short URL
POST /api/v1/urls
Authorization: Bearer {api_key}
Content-Type: application/json
X-Idempotency-Key: {uuid}
Request:
{
"url": "https://example.com/very/long/path/to/some/resource?param=value",
"custom_alias": null,
"expires_in": "30d",
"metadata": {"campaign": "spring_sale", "source": "email"}
}Response 201 Created:
{
"short_code": "Ab3xK9f",
"short_url": "https://sho.rt/Ab3xK9f",
"preview_url": "https://sho.rt/Ab3xK9f+",
"long_url": "https://example.com/very/long/path/to/some/resource?param=value",
"created_at": "2026-03-25T10:30:00Z",
"expires_at": "2026-04-24T10:30:00Z",
"qr_code_url": "https://sho.rt/api/v1/qr/Ab3xK9f"
}9.2 Redirect
GET /{short_code}
HTTP/1.1 302 Found
Location: https://example.com/very/long/path/...
Cache-Control: private, max-age=60
X-Short-Code: Ab3xK9f
We default to 302, not 301. A 301 tells browsers to cache the redirect forever, breaking future edits/expirations and killing click tracking (the browser never hits our server again). 301 is available as an opt-in per URL for users who want SEO link juice.
9.3 Preview Mode
GET /{short_code}+
A + suffix returns an HTML preview (destination, owner, created date, click count) without redirecting. Same convention as Bitly. For URLs flagged as borderline by the safety scanner, the preview page is served instead of the 302 and requires a "Continue to destination" click-through.
9.4 Other Endpoints
GET /api/v1/urls/{short_code}/analytics?period=7d&granularity=hour
DELETE /api/v1/urls/{short_code} -- owner only
GET /api/v1/urls?user_id={id}&page=1 -- paginated list
POST /api/v1/urls/bulk -- up to 1000
GET /api/v1/qr/{short_code} -- SVG QR code
PATCH /api/v1/urls/{short_code} -- update expiration or metadata
10. ID Generation
The create path needs a unique, non-guessable 7-character short code with zero per-write coordination. Random + collision-check needs a DB read per write. Hash truncation collides at 42 bits (birthday paradox, nearly certain at 10B). A region-prefixed counter with a bijective shuffle is collision-free by construction and mints new codes from local memory. That's the choice.
10.1 Why 42 Bits
We need exactly 42 bits of ID space because 62⁷ ≈ 2⁴² — that's the full capacity of a 7-character Base62 ([0-9a-zA-Z]) code, and we want every bit usable. The 42 bits split as:
┌────────────┬────────────────────────────────────┐
│ 4 bits │ 38 bits │
│ region_id │ per-region counter │
└────────────┴────────────────────────────────────┘
42 bits → shuffle → Base62 → 7 chars
- 4-bit region prefix supports 16 regions — 3 today, 13 slots of headroom.
- 38-bit counter gives ~275B IDs per region. At 10B URLs across 16 regions we'd use ~625M per region: 440× headroom.
- Regional counters are disjoint by construction (different top bits), so a bijective shuffle on the full 42 bits always produces disjoint outputs. No two regions can ever mint the same short code even without talking to each other.
10.2 Create Flow in One Region
-
Range allocation (once per ~100K URLs per pod, ~every 100 seconds at 1K creates/sec). The Create Service pod runs a Scylla LWT compare-and-set against its regional
counter_rangesrow:cqlSELECT current_value FROM counter_ranges WHERE region_id = ?; -- LOCAL_SERIAL UPDATE counter_ranges SET current_value = ? -- old + 100000 WHERE region_id = ? IF current_value = ?; -- LOCAL_SERIAL CASOn success, the pod owns
[old+1, old+100000]. On conflict it retries with the fresh value. Contention is rare because the row is touched ~once every 100s per pod. -
Local increment. Each create bumps the local counter and constructs the 42-bit value:
(region_id << 38) | counter. No DB call. -
Bijective shuffle. The value is passed through a keyed bijection so sequential counters don't produce sequential codes (which would enable enumeration attacks). The shuffle is reversible with the server's key, so operators can decode a short code back to its counter and origin region. Pseudocode in Appendix A.
-
Base62 encode. The shuffled value maps to 7 characters from
[0-9a-zA-Z]. Base64 includes URL-unsafe+and/; hex would need 11 characters. -
Persist. Write the row to Scylla at
LOCAL_QUORUM; warm the local Valkey entry with a 24h TTL.
10.3 Counter Durability
The counter_ranges row is RF=3 in the local DC — a single node loss still achieves LOCAL_SERIAL quorum. Each pod pre-fetches two ranges on startup (one active, one spare). When the active range is exhausted, the spare takes over and a new spare is fetched in the background. A pod keeps creating for ~200K URLs (~3 minutes at 1K/sec) even if Scylla is briefly unreachable. A full regional Scylla outage only blocks creates in that region — other regions keep minting codes from their own counters.
10.4 Lifecycle and Exhaustion Plan
At 275B IDs/region and ~625M needed per region, exhaustion is a decade-plus problem. The escape hatch triggers at ~90% utilization in any region:
- Rotate to 8-character codes. Extend the counter to 44 bits (48 total with the 4-bit region prefix), shuffle across 48 bits, Base62-encode to 8 characters. 62⁸ ≈ 2⁴⁸ — another ~300× headroom.
- Old 7-character codes remain valid forever. The Redirect Service handles both lengths; no retroactive migration.
- Mixed-length minting during the rollover window — new URLs get 8-char codes while the 7-char space drains.
The same mechanism covers adding regions: reserve another prefix bit (5 bits → 32 regions), provision the new counter_ranges row, deploy. The ID space is append-only — no data migration.
Check yourself
Why does the design rotate to 8-char codes at ~90% counter utilization, not at 99% or 100%?
Answer
Two reasons. First, the last 10% of counter range buys years of runway at the current rate, so there's no urgency cost to rotating early. Second, running a rollover at 99%+ means any unexpected traffic spike (a batch importer, a sudden viral week) can exhaust the range before the rollover pipeline reaches steady state. 90% is the sweet spot: comfortable headroom, still a decade off, and the rollover can be rehearsed without emergency pressure.11. Caching
The redirect path must serve 100K req/sec at sub-5ms server-side p99 on cache hit. A three-layer cache makes this possible.
11.1 Three Layers
- CDN edge (~25% hit). Popular URLs cached at edge with
Cache-Control: private, max-age=60. Short enough that deletions propagate quickly; long enough to absorb viral spikes. - Valkey (~80% of what reaches origin). Redirect Service checks the per-pod LRU, then Valkey. Cache-aside with a 24h TTL. ~150M hot URLs per region.
- ScyllaDB. Whatever gets through reads at
LOCAL_ONE(rows are immutable, stale reads don't matter). Shard-per-core keeps the long tail sub-millisecond.
Effective origin load:
100K req/sec total
- CDN 25% → 25K/sec at edge
- Valkey 80% of remainder → 60K/sec from cache
- ScyllaDB → ~15K SELECT/sec + 1K INSERT/sec globally (~5K reads/sec per region)
11.2 Cache Hit Rate Is a Zipf Assumption
⚠️ Design risk. The 80% Valkey and 25% CDN hit rates only hold if URL popularity follows a Zipf distribution. If traffic is uniform — for example, a batch importer creates 100M URLs that all get a burst of clicks — the top 150M URLs cover only ~30% of reads, origin traffic jumps from ~15K to ~52K Scylla reads/sec, and tail latency climbs from sub-millisecond to tens of ms. Monitor
cache_hit_ratecontinuously. Page on sustained drops below 60% and scale Valkey (or Scylla) immediately.
Zipf is well-established for social and news URLs, but measure it, don't assume it. A change in traffic mix is the fastest way to push the working set outside the cache.
11.3 Hot Key Problem
One viral link can take 1M req/sec, and the entire load lands on one Valkey shard — one key, one CPU, instant saturation. This is the most common production failure mode for URL shorteners.
Four mitigations, in order of effectiveness:
- Local in-process LRU. The Redirect Service keeps a ~10K-entry LRU with a 10s TTL per pod. Viral keys are served from process memory and never touch Valkey. Across 20 pods, this alone absorbs most of a viral burst.
- CDN TTL with jitter. Base 60s TTL plus ±10s jitter so edge entries don't all expire at the same millisecond and stampede origin.
- Request coalescing. On a cache miss, the first request acquires a short-lived lock (
SET lock:url:{code} 1 NX EX 5) while fetching from Scylla; concurrent requests briefly wait-and-retry instead of all hitting the DB. "Coalescing" just means collapsing a thundering herd into one backend query. Pseudocode in Appendix B. - Hot-key replication. If a single Valkey shard is still overloaded, replicate known hot keys across all shards with a client-side aliasing scheme (
url:{code}:{shard_hint}).
These activate automatically on the first 10–100 requests to a new URL — no prior knowledge of "which keys are hot" is needed.
11.4 Server-Side Latency Budget (Redirect Service only)
End-to-end latency is dominated by network, not server work. The numbers below are the server's contribution. §13 covers how multi-region deployment handles the network half of NFR-01c.
Cache-hit path (target: p99 < 5ms server-side):
LB routing: <1 ms
Local LRU: ~0.01 ms
or Valkey GET: 0.1–0.3 ms
Response build: <0.1 ms
Total: 1–2 ms
Cache-miss path (target: p99 < 15ms server-side):
LB routing: <1 ms
Valkey GET (miss): 0.1 ms
Scylla SELECT LOCAL_ONE: 0.5–4 ms (shard-per-core, in-region)
Valkey SET: 0.1 ms
Response build: <0.1 ms
Total: 2–6 ms
Check yourself
If URL popularity were uniform instead of Zipf, which number in this design collapses first?
Answer
The Valkey hit rate. With uniform traffic, the top 150M URLs cover only ~30% of reads instead of ~80%, which pushes Scylla reads from ~15K/sec to ~52K/sec globally and blows past the shard-per-core budget. p99 climbs from sub-millisecond into the tens of milliseconds and the cache-hit SLO starts burning error budget immediately. This is why §11.2 names the assumption explicitly and pages on hit rate dropping below 60%.12. Click Analytics Pipeline
Every redirect emits a click event. 100K events/sec without touching redirect latency requires a fully async pipeline.
Design decisions:
- Async Kafka producer, best-effort delivery.
producer.send()without waiting for ack; buffered in 32 MB of producer memory; overflow drops silently. Analytics gap during a Kafka outage is acceptable — extra redirect latency is not. This is the right call for click analytics only. Billing, audit, and compliance streams need synchronousacks=alland at-least-once delivery. - Flink 5s tumbling windows. ClickHouse wants large batch inserts, not row-by-row. 100K–500K rows per batch keeps merge pressure low.
- Enrichment in Flink. MaxMind GeoIP and user-agent parsing happen in the consumer, not on the redirect path.
Write amplification control — a naive pipeline overwhelms Kafka and ClickHouse:
- Partition Kafka by
hash(short_code), not byshort_code— otherwise all events for a viral link land on one partition (instant hot spot). - Producer batching:
linger.ms=50,batch.size=64KB. 100K events/sec becomes ~500 Kafka requests/sec at the broker. - LZ4 compression on topics cuts wire and disk ~4× at negligible CPU cost.
- One ClickHouse part per batch — infrequent large inserts keep merges cheap.
Raw: 100K events/sec × 300 B = 30 MB/sec
After LZ4: ~7.5 MB/sec on the wire
Producer batching: ~500 Kafka requests/sec
ClickHouse inserts: ~1–2/sec of 100K-row parts
13. Multi-Region Writes
Every region runs the full stack. Reads and creates both serve from the nearest region; there is no primary region. Regional counters make this safe for short codes, and leader-region routing handles the one operation that can't parallelize (custom aliases).
13.1 Regional Topology
A Tokyo user's create hits Tokyo's Create Service, allocates from Tokyo's counter row, writes to Tokyo's Scylla DC at LOCAL_QUORUM (2 of 3 local replicas), and replicates to other DCs asynchronously. A Tokyo user's redirect hits Tokyo's Valkey first, then Tokyo's Scylla at LOCAL_ONE on miss.
13.2 Why LOCAL_QUORUM Writes Are Safe
LOCAL_QUORUM + async cross-DC replication is the standard Cassandra multi-DC pattern, but it only works because this workload makes cross-region conflicts impossible:
- Short codes can't conflict. Region-prefixed counters make regional write sets disjoint.
- Custom aliases can't conflict. All custom-alias writes are routed to a leader region (see §13.3), so LWT Paxos stays local.
- Click counts are in a counter table that's approximate-by-design; any drift is reconciled against ClickHouse.
The trade-off is explicit: we give up instant global consistency (a newly-created URL is visible in other regions within ~1s, not immediately) in exchange for zero cross-region latency on every write. For a URL shortener, that's the right call — creators don't share-and-click within one second.
13.3 Custom Alias Coordination
Custom aliases are the only operation that can't parallelize — two users could race on sho.rt/summer-sale across regions. All custom-alias writes go through a leader region (Virginia) via an L7 LB rule keyed on the custom_alias field. Non-US users pay ~100ms of extra WAN latency when creating custom aliases, acceptable because (a) creates aren't latency-sensitive and (b) custom aliases are a small fraction of traffic. Redirects still hit the nearest region — the custom_aliases table replicates to all DCs and reads are regional.
An alternative (cross-DC SERIAL LWT) exists for teams that want to avoid the single-region dependency; it's in Appendix C.
13.4 Cross-DC Replication Mechanics
- Async cross-DC replication — writes commit at
LOCAL_QUORUM; the coordinator fans out to other DCs in the background. Typical lag <1s. - Hinted handoff — brief unavailability captured as hints that replay on recovery (up to 3h default window).
- Read repair — any read finding divergent replicas triggers async repair.
- Scheduled anti-entropy — weekly
nodetool repairguarantees convergence; details in §17.3.
13.5 Replication Lag Window
A URL created in Tokyo is visible in Virginia/Frankfurt within ~1 second. The narrow window where a just-created link 404s in a remote region is acceptable for the normal share-and-click flow. If a specific client needs stronger guarantees (e.g., a QR-code generator rendering immediately after creation), the Create Service can warm the originating region's Valkey entry synchronously and include a cache-warm hint in the response.
Check yourself
Why can we safely use LOCAL_QUORUM for short-code creates without cross-region coordination, but not for custom aliases?
Answer
Short-code creates can't conflict across regions because the region-prefixed counter makes every region's ID space disjoint — Virginia and Tokyo literally cannot mint the same code, so `LOCAL_QUORUM` is enough. Custom aliases share a single global key space (`my-brand` is the same string everywhere), so two users in different regions can race on the same alias within the ~1s replication lag. That race needs a single serialization point — either a leader region with `LOCAL_SERIAL` LWT (what we picked) or cross-DC `SERIAL` LWT (Appendix C). There's no way to keep it `LOCAL_QUORUM` and also guarantee uniqueness.14. Custom Alias & Expiration
14.1 Custom Aliases
Custom aliases need global uniqueness. Two users requesting the same alias simultaneously is the race condition; without proper handling, one silently overwrites the other.
Case-insensitive conflicts.
my-brand,My-Brand, andMY-BRANDare the same alias. The Create Service lowercases input before the uniqueness check and stores it lowercase. Normalize once at the edge — never rely on downstream components to handle casing consistently.
Solution: Scylla LWT in the leader region.
INSERT INTO custom_aliases (alias, short_code, owner_user_id, created_at)
VALUES (?, ?, ?, ?)
IF NOT EXISTS;
-- LOCAL_SERIAL consistencyOn [applied=false], the Create Service returns 409 Conflict. Because the write is routed to the leader region (§13.3), the Paxos round stays local and cheap (~5–10ms).
Reserved words. A blocklist (api, admin, login, help, about, static, …) is checked before the INSERT so users can't claim paths that collide with application routes.
Validation. 3–30 characters, alphanumeric plus hyphens, no leading/trailing hyphens.
14.2 URL Expiration
- Soft expiration (instant). On every cache hit and Scylla read, the Redirect Service checks
expires_at. If past, it returns410 Gone. The row stays in the database until TTL reaping. - Hard deletion via TTL. URLs with a finite lifetime use
INSERT ... USING TTL <seconds>.TimeWindowCompactionStrategydrops expired rows as whole SSTables age out — no cleanup job. - Cache entry carries
expires_at. The Redirect Service checks it on every read; no proactive eviction needed. - Deletion propagation window. CDN entries use
Cache-Control: max-age=60, so a deleted or expired URL stops redirecting worldwide within ~60 seconds. During that window, the CDN may still serve the cached 302 even though origin now returns410 Gone— the 60s ceiling is deliberate, trading a brief stale window for the traffic absorption a longer TTL would give. A stronger SLA would require an active CDN purge on delete, which is extra operational surface we don't need here.
15. Security & Abuse
URL shorteners are abused for phishing, malware, and spam, and they handle user data under GDPR. The design-specific work is abuse prevention on create and the GDPR deletion flow. Everything else — TLS, encryption at rest, KMS, audit logging — is standard baseline covered in one paragraph at the end of this section.
15.1 Abuse Prevention on Create
Three clusters of controls:
Upfront validation.
- URLs must be HTTP/HTTPS, max 2048 chars. Custom aliases: 3–30 chars, alphanumeric + hyphens, reserved-word checked, lowercased.
- Destinations are checked against Google Safe Browsing, an internal blocklist, and homograph/suspicious-pattern matchers. Adds ~50ms per create, acceptable because creates aren't latency-sensitive.
- The safety scanner issues up to 3 HEAD requests following any redirect the destination returns, and rejects if the final target is flagged or points back at us.
Rate limiting and authentication.
- 100 creates/minute per API key; 10/minute per IP for anonymous creates (sliding-window counters in Valkey).
- No API key → CAPTCHA required.
Chain prevention.
- Reject destinations pointing at our own shortener domains (direct, CNAME, or IP literal).
- Reject destinations pointing at known external shorteners (
bit.ly,tinyurl.com,t.co,goo.gl,is.gd,ow.ly, etc.). Chain hops across independent shorteners make safety scanning unreliable and are a classic phishing vector.
For URLs flagged as borderline, the /{code}+ preview page (§9.3) is served instead of the 302 and requires an explicit click-through.
Retroactive scanning. A background job re-scans existing URLs weekly; if a destination that was clean at creation turns malicious, the URL is disabled and the owner is notified.
15.2 GDPR and PII Minimization
- Click-event IPs are salted-hashed (HMAC-SHA-256 with a rotating salt) before ClickHouse. Raw IPs are never stored; the hash still supports unique-visitor counting.
- Device/browser/OS bucketed — no user-agent fingerprinting.
- Delete-by-user flow (Article 15/17):
- Query
urls_by_userfor allshort_codes owned by the user. DELETE FROM urls WHERE short_code = ?(app-level fanout also cleansurls_by_user).DELETE FROM url_click_counts WHERE short_code = ?.ALTER DELETEon the matching ClickHouse click events.- Audit the deletion with the request ID.
- Query
- Tombstone awareness. Scylla deletes are logical — rows are physically removed after
gc_grace_seconds(10 days). Disclosed in the GDPR response; still satisfies the law.
15.3 Standard Security Baselines
TLS 1.3 for client-facing traffic; mTLS service-to-service via the service mesh; Scylla internode and cross-DC gossip encryption; Transparent Data Encryption on SSTables and commit logs with a KMS-backed master key per DC; ClickHouse disk-level encryption; Kafka broker-side encryption; API keys SHA-256-hashed at rest; service secrets in Vault with automatic 90-day rotation; Scylla and application audit logs shipped to an immutable S3 bucket with 1-year retention. These are industry-standard, not design-specific — applied like any other production system.
One special case: the Feistel shuffle key (Appendix A) can't be rotated in place without changing future short-code outputs. Rotation requires a new epoch bit reserved from the counter prefix so old and new keys coexist until old URLs age out.
16. Failure Scenarios
16.1 Valkey Cluster Node Failure
SCENARIO. One of 6 Valkey shards in a region crashes.
| Time | Event |
|---|---|
| T+0s | Shard 3 crashes. ~25M cached URLs unavailable on that shard. |
| T+0s | Redirect Service gets connection errors; circuit breaker opens; requests for shard 3 keys fall through to Scylla. |
| T+5s | Valkey Cluster promotes replica to primary. |
| T+10s | New primary available; circuit breaker closes; cache cold for shard 3. |
| T+60s | Cache warms organically from redirect traffic. |
Impact. ~16% of regional redirects hit Scylla for ~60s. Regional Scylla reads spike from ~5K to ~10K/sec — comfortably within 6-node capacity. No data loss, no user-visible errors, slightly slower redirects (2–5ms vs <1ms) during warm-up.
Lesson. The whole point of sizing Scylla for 2× the cache-miss load is to absorb cold-cache moments like this silently. If a cache shard crash were enough to page on-call, you sized the DB wrong, not the cache.
16.2 ScyllaDB Node Failure
SCENARIO. One Scylla node fails in a 6-node DC.
| Time | Event |
|---|---|
| T+0s | Node 3 fails. Gossip marks it down within ~2s. |
| T+2s | LOCAL_QUORUM writes require 2-of-2 remaining replicas. Reads at LOCAL_ONE stay fast. |
| T+2s | Coordinator starts writing hints for the down node. |
| T+Xs | Operator replaces node (nodetool removenode or bootstrap replacement). |
| T+X+min | Replacement streams its data share; hint replay catches it up. |
| T+X+hr | nodetool repair -pr guarantees full convergence. |
Impact. Zero write loss (hinted handoff). Zero read impact. Write p99 may bump slightly during the handoff window. No user or on-call action needed during the hours it takes to replace the node.
Lesson. Hinted handoff makes single-node loss invisible to writes. If you're not seeing errors during a node outage, the system is working as designed — don't mistake the silence for a problem that needs investigating.
16.3 Regional DC Outage
SCENARIO. Full Scylla DC outage in us-east-1.
| Time | Event |
|---|---|
| T+0s | us-east-1 Scylla DC unavailable. |
| T+0s | us-east-1 Redirect Service serves Valkey/LRU hits; cache misses fail. |
| T+0s | us-east-1 Create Service fails (can't reach counter row). |
| T+5s | Health checks detect unhealthy DC. |
| T+30s | Route 53 DNS failover removes us-east-1 from the redirect record set. |
| T+60s | Global traffic redistributes to eu-central-1 and ap-northeast-1. |
| T+hours | DC returns; Scylla Manager runs full repair to reconcile divergence. |
Impact. us-east-1 clients see 30–60s of cache-miss errors during DNS failover. Creates route to the nearest healthy region. No data loss — other DCs have full copies via NetworkTopologyStrategy. Recovery cost is IO-heavy repair, scheduled off-peak.
Lesson. DNS is the slowest part of failover — the 30–60s floor comes from TTL propagation, not Scylla recovery. If you need faster DC failover, invest in client-side endpoint discovery (e.g., service mesh with active health checks), not in making the DB recover any faster.
16.4 Kafka Unavailable
SCENARIO. Kafka unavailable for 5 minutes.
| Time | Event |
|---|---|
| T+0s | Brokers unreachable. Async producer buffers to 32 MB. |
| T+0s | Redirects continue — Kafka is off the critical path. |
| T+2m | Producer buffer fills; new events drop silently. |
| T+5m | Kafka recovers; producer flushes buffered events. |
Impact. Zero impact on redirects. Analytics has a 5-minute gap; some events during buffer-overflow are permanently lost. Counter-column lag from ClickHouse reconciliation catches up within minutes.
Lesson. A best-effort async producer is a feature, not a bug. The moment you flip Kafka to acks=all to "not lose events," you've handed Kafka a veto over redirect availability — and redirect availability is the SLO that matters. Durability choices follow from what the downstream consumer actually needs, not from a vague desire for correctness.
16.5 Cross-DC Network Partition
SCENARIO. 10-minute partition between us-east-1 and the other two DCs.
| Time | Event |
|---|---|
| T+0m | Cross-DC links drop. us-east-1 not partitioned from its clients. |
| T+0m | All DCs continue serving LOCAL_QUORUM writes and LOCAL_ONE reads independently. |
| T+0–10m | Each DC accumulates hints for the unreachable peers (3h default window). |
| T+10m | Network restores; gossip reconnects; hint replay pushes accumulated writes. |
| T+15–30m | Read repair and background anti-entropy clean up anything that didn't fit in hints. |
Impact. No data loss — partition shorter than the hint window. Clients in each region saw their own writes immediately; other regions converged after recovery. This is the point of multi-region active-active: a full cross-DC partition doesn't degrade service for anyone.
Lesson. The 3-hour hint window is the real SLA on partition tolerance. A partition longer than that starts dropping hints and requires full anti-entropy repair to recover — so "multi-region active-active" isn't a free lunch, it's a bet that network partitions stay short. Monitor scylla_hints_pending and page before it approaches the hint-log capacity, not after.
17. Operational Playbook
A design that doesn't document its ops story isn't production-grade.
17.1 Deployment
- Services are stateless Kubernetes containers. Rolling update with
maxUnavailable: 0,maxSurge: 25%. - Canary uses LB weight: 5% for 10 min → 25% for 10 min → 100%. Auto-rollback if cache-hit p99 climbs >2ms, cache-miss p99 >5ms, or error rate >0.1%. Symmetric rollback takes ~5 minutes.
- Flink job blue-green: new job joins the same consumer group, old job stops after the new one catches up.
17.2 Key Metrics and Alerts
Only design-specific alerts live in the main dashboard. Standard infrastructure metrics (error rate, throughput, pod CPU) live in the platform dashboard and aren't duplicated here. The three numbers that matter most are below; the full alert set is collapsed.
| Metric | Alert Threshold |
|---|---|
cache_hit_rate | < 60% for 10 minutes |
redirect_latency_cache_hit_p99 | > 10ms for 5 minutes |
counter_range_remaining | < 10K remaining IDs |
Show all 8 design-specific alerts
| Metric | Alert Threshold |
|---|---|
cache_hit_rate | < 60% for 10 minutes |
redirect_latency_cache_hit_p99 | > 10ms for 5 minutes |
redirect_latency_cache_miss_p99 | > 20ms for 5 minutes |
counter_range_remaining | < 10K remaining IDs |
scylla_load_avg_per_shard | > 0.7 for 15 minutes |
scylla_hints_pending | > 100K for 10 minutes |
scylla_cross_dc_latency_p99 | > 2 seconds |
kafka_consumer_lag | > 1M events for 5 minutes |
17.3 Repair and Compaction
Scylla has three anti-entropy layers: read repair runs constantly on every divergent read (free), hinted handoff captures brief unavailability and replays on recovery, and scheduled repair is the backstop that catches anything the other two missed and reaps tombstones before gc_grace_seconds expires.
Schedule. Scylla Manager runs incremental repair daily and full repair weekly during off-peak hours. Full repair on the 18-node cluster with ~30 TB replicated takes 4–8 hours of background IO, capped at 30% of disk budget so foreground reads stay fast. Repairs must complete within gc_grace_seconds (10 days) or tombstones start resurrecting deleted data.
Compaction per table:
urls—TimeWindowCompactionStrategy, 30-day windows. Immutable rows, whole SSTables drop at once as TTLs age out.custom_aliases,counter_ranges—SizeTieredCompactionStrategy. Small tables, low rate, compaction irrelevant.url_click_counts— STCS withmin_threshold=2so counter updates consolidate quickly.
Throttle knobs: compaction_throughput_mb_per_sec = 64 baseline, dropped to 16 during traffic peaks via nodetool setcompactionthroughput. concurrent_compactors = 4 per node, one per NVMe drive.
Write amplification. With TWCS and 30-day windows, Scylla write amplification stays in the normal 3–5× range. Monitor scylla_commitlog_writes against client writes; a large delta signals heavy merges that may need a throttle adjustment.
Compaction storm (worked example). If multiple nodes run major compactions simultaneously and SSD IO saturates, coordinator read p99 climbs from <1ms to 5–8ms (still under NFR-01b 15ms). Mitigation: nodetool setcompactionthroughput 16 to throttle harder, then rebalance the schedule so future compactions stagger across nodes.
17.4 Backup and Recovery
- Daily snapshots via Scylla Manager to S3 (hard-linked at the SSTable level, cheap).
- Incremental SSTable backups every 4 hours between snapshots.
- Cross-region S3 replication on backup artifacts.
- Quarterly restore drill on a random node — no restore is real until you've done it.
Recovery Objectives by Failure Type:
| Failure | RPO | RTO | Data Loss |
|---|---|---|---|
| Valkey node | N/A | <1 min | No (replicated) |
| Scylla node | 0 | 2–6 h | No (hinted handoff + LOCAL_QUORUM) |
| Single DC | 4 h | 2–6 h | No (replicated to other DCs) |
| Kafka outage | 5 min | <5 min | Analytics only (acceptable) |
| Full cluster rebuild from backup | 4 h | 6 h | No |
That covers the running-it story.
§17.1–§17.4 is what every engineer on the project should know. What follows (capacity planning, schema migrations, top 5 alerts) is on-call and senior-ops territory — skim unless you're paged.
17.5 Capacity Planning
When you need this: sizing the cluster for next quarter, or deciding whether a node needs adding.
Three leading indicators:
- Disk per node. Warn 60%, scale 70%, page 80%. Scylla rebuild takes time — earlier warnings = calmer scaling.
scylla_load_avg_per_shard. Target <0.5 average, <0.7 peak. Sustained >0.7 signals a hot partition or that the DC needs more nodes.- Cross-DC replication lag. Target p99 <1s. Sustained lag means cross-region links are saturated.
Adding a node takes 2–6 hours to bootstrap (data share streams from replicas). Adding a DC is a days-long operation: provision, bootstrap, nodetool rebuild, update NetworkTopologyStrategy.
17.6 Schema Migrations
When you need this: shipping a schema change to the urls or custom_aliases table.
CQL DDL is online but the reality at 10B rows is subtler. ALTER TABLE ADD column is metadata-only and cheap. ALTER TABLE DROP column rewrites SSTables and stalls reads on hot nodes — schedule off-peak. Secondary indexes are avoided — we use query-pattern tables instead. All schema changes live in schema/migrations/ with a schema_versions table tracking applied IDs. Forward-only: a mistake requires a new compensating migration, never a rollback.
17.7 Top 5 Alerts and Mitigations
When you need this: 3 AM page.
Every on-call engineer should know these cold:
- Hot partition (
scylla_load_avg_per_shard > 0.7) — identify the hotshort_codevia query tracing; if legit (viral link), rely on LRU + CDN jitter; if bot, rate-limit the source. - Compaction starvation (
scylla_compactions_pending > 10) —nodetool compactionstats; temporarily bump throughput if traffic is off-peak; investigate strategy mismatch. - Hint pile-up (
scylla_hints_pending > 100K) — a replica is dropping writes; checknodetool status; if flapping, check NICs and disks. - Cross-DC lag (
scylla_cross_dc_latency_p99 > 2s) — cross-region link saturated or remote DC overloaded; check network graphs and recent deploys. - Disk space warning (node >70%) — start capacity planning; short-term, check runaway tombstones with
nodetool cfstats, force compaction on the worst offender, or clean old snapshots.
18. SLOs and Error Budgets
SLOs make the quality target concrete. Error budgets turn "be more careful" into "freeze deploys this week."
| SLI | SLO | Monthly Error Budget |
|---|---|---|
| Redirect cache-hit p99 ≤ 5ms (server-side) | 99.95% | 21.6 min |
| Redirect availability (non-5xx) | 99.99% | 4.3 min |
| URL durability (non-expired) | 100% over rolling year | Budget-less — any loss is a major postmortem |
Show 3 more SLOs (cache-miss latency, create latency, analytics freshness)
| SLI | SLO | Monthly Error Budget |
|---|---|---|
| Redirect cache-miss p99 ≤ 15ms (server-side) | 99.5% | 3.6 h |
| Create latency p99 ≤ 50ms (server-side) | 99% | 7.2 h |
| Analytics freshness ≤ 5s | 99% | 7.2 h |
Error budget policy. Normal burn (≤1× over 30d): business as usual. Fast burn (>2× over 7d): freeze non-critical deploys, director sign-off for launches. Very fast burn (>4× over 1d): page on-call, freeze all non-rollback deploys, launch IR. Exhausted: next sprint goes to reliability.
Alert tiering.
- Page-now. Availability burn, error-rate spike, Scylla
UN→DN, hint pile-up,counter_range_remaining < 10K. 5-minute response. - Page-business-hours. Latency budget burn, cache hit rate dropping, compaction starvation, cross-DC lag.
- Ticket-only. Capacity warnings, disk >60%, pending migrations, individual repair failures.
19. Appendix
A. Bijective Shuffle (Feistel)
The shuffle is a 2-round Feistel cipher on the 42-bit ID. Feistel is bijective by construction: every input maps to a unique output and the mapping is reversible with the key. Two rounds on a 42-bit block provides enough diffusion to make sequential counter values produce non-sequential codes, without cryptographic strength requirements (we're avoiding enumeration, not defending against a state actor).
Show pseudocode
shuffle(x, key):
L = x >> 21 # high 21 bits
R = x & ((1 << 21) - 1) # low 21 bits
for i in [0, 1]:
F = hash(R, key[i]) & ((1 << 21) - 1)
L, R = R, L XOR F
return (L << 21) | R
hash() is a fast non-cryptographic mixer (e.g., xxHash) truncated to 21 bits. The inverse runs the same structure in reverse, so any operator with the key can decode a short code back to its counter value (and therefore its origin region) for admin tooling.
B. Request Coalescing (Cache Miss)
The first request that misses both local LRU and Valkey acquires a short-lived Valkey lock, fetches from Scylla, and writes back to both caches. Concurrent requests wait briefly and retry against Valkey instead of stampeding Scylla.
Show Python implementation
def get_url(short_code):
if val := local_cache.get(short_code):
return val
if val := valkey.get(f"url:{short_code}"):
local_cache.set(short_code, val, ttl=10)
return val
lock_key = f"lock:url:{short_code}"
if valkey.set(lock_key, "1", nx=True, ex=5):
row = scylla_session.execute(
SELECT_URL_PS, [short_code],
consistency_level=ConsistencyLevel.LOCAL_ONE,
).one()
if row:
valkey.set(f"url:{short_code}", row.long_url, ex=86400)
local_cache.set(short_code, row.long_url, ttl=10)
valkey.delete(lock_key)
return row.long_url if row else None
else:
time.sleep(0.01)
return valkey.get(f"url:{short_code}")C. Cross-DC LWT Alternative for Custom Aliases
Instead of leader-region routing, Scylla LWT can run at SERIAL (not LOCAL_SERIAL), forcing Paxos rounds across DCs. Per-write cost is ~100–150ms vs ~5–10ms local, but there's no single-region dependency and custom-alias throughput isn't bottlenecked by the leader region. Pick this when either (a) the extra WAN latency is unacceptable for custom-alias creates or (b) eliminating the single-region dependency matters more than the latency cost.
If you only remember six things
- ID. 42-bit region-prefixed counter → bijective shuffle → Base62 (7 chars). No cross-region coordination, 440× headroom.
- Storage. ScyllaDB RF=3 per DC, active-active across Virginia, Frankfurt, Tokyo.
LOCAL_QUORUMwrites,LOCAL_ONEreads. - Cache. CDN → local LRU → Valkey → Scylla. Each layer absorbs what the previous missed; sub-5ms p99 server-side on hit.
- Analytics. Kafka (best-effort async) → Flink (5s batches) → ClickHouse. Off the critical path.
- Biggest risk. Zipf cache assumption. Monitor hit rate; page on sustained drops below 60%.
- Escape hatch. At 90% counter utilization, rotate to 8-char codes. Old codes stay valid forever.
Explore the Technologies
Dive deeper into the technologies and infrastructure patterns used in this design:
Core Technologies
| Technology | Role in This Design | Learn More |
|---|---|---|
| ScyllaDB | Primary URL storage, Cassandra-compatible shard-per-core, active-active DC replication, native TTL | ScyllaDB |
| Valkey | Regional redirect cache (150M hot URLs), rate limit counters | Redis/Valkey |
| ClickHouse | Click analytics storage, real-time materialized views | ClickHouse |
| Kafka | Async click event pipeline, decouples redirect from analytics | Kafka |
| Flink | Click event enrichment and batched ClickHouse inserts | Apache Flink |
Infrastructure Patterns
| Pattern | Relevance to This Design | Learn More |
|---|---|---|
| Caching Strategies | Three-layer caching (CDN, Valkey, Scylla) for sub-5ms redirects | Caching Strategies |
| Rate Limiting and Throttling | Per-API-key and per-IP rate limits using Valkey sliding windows | Rate Limiting |
| Message Queues and Event Streaming | Kafka decouples analytics from the redirect hot path | Event Streaming |
| CDN and Edge Computing | Edge caching for viral URL redirects | CDN and Edge Computing |
| Multi-Region Active-Active | Region-prefixed counter IDs + Scylla NetworkTopologyStrategy for worldwide writes | Multi-Region |
Practice this design: Try the URL Shortener interview question to test your understanding with hints and structured guidance.