CrackingWalnuts

System DesignApril 11, 2026· 42 min read

System Design: URL Shortener (10B Short URLs, 100K Redirects/sec)

Goal. Build a URL shortener for 10 billion stored URLs, 100K redirect lookups/sec, and 1K creates/sec. Custom aliases, expiration, click analytics, and cache-hit redirect latency under 5ms p99 server-side.

TL;DR. This is a read-heavy key-value lookup with an async analytics sidecar. Each region mints IDs from its own counter with zero cross-region coordination, shuffled and Base62-encoded into 7-character codes (62⁷ ≈ 2⁴² IDs). ScyllaDB stores the mappings, replicated active-active across three regions. The read path is CDN → local LRU → Valkey → Scylla, with each layer absorbing what the previous one missed. Kafka + Flink + ClickHouse run analytics asynchronously, off the critical path.

Tip

Pick your path

Time	Read this	Covers
2 min	TL;DR + §1 + §11.2 Zipf callout	The shape of the system and its most fragile assumption
15 min	§1–§13	Every core design decision, interview-grade
30 min	Full post	Production detail, ops, and appendices

1. Final Architecture

Three independent paths. The read path is fast (sub-5ms server-side on cache hit), the write path is regional and not latency-sensitive, the analytics path is fully off the critical path.

Write path:
  Client → Nearest region → Create Service
         → Regional counter (Scylla LWT) → shuffle → Base62
         → ScyllaDB (NetworkTopologyStrategy, RF=3) → Valkey warm

Read path:
  Client → CDN → Nearest region → Redirect Service
         → Local LRU → Valkey → ScyllaDB LOCAL_ONE (on miss) → 302

Analytics path:
  Redirect Service → Kafka (async) → Flink (5s batches) → ClickHouse

Every section below zooms into one piece of this picture.

2. Problem Statement

A URL shortener is an afternoon prototype and a months-long production system. Three things make it hard at scale:

Unique IDs with no coordination. At 1K creates/sec across regions, a single auto-increment row is a bottleneck. Random IDs need a DB read per write. Hash truncation collides at 42 bits (the birthday paradox bites hard around 10B rows).
Redirect latency under skewed load. 100K/sec of reads, a Zipf-shaped hot set, and a sub-5ms p99 target on cache hit. You can't touch the DB on the hot path.
Analytics without coupling. Every redirect produces a click event. Writing those synchronously adds 10–50ms to every 302. They have to go through a queue.

Scale numbers.

10B URLs stored
1K creates/sec (~86M/day)
100K redirects/sec (~8.6B/day), distributed across regions
100:1 read-to-write ratio
200 B average long URL; 7-character short code

3. Functional Requirements

ID	Requirement	Priority
FR-01	Create a short URL from a long URL, returning a unique 7-character code	P0
FR-02	Redirect short URL to original long URL via HTTP 301/302	P0
FR-03	Support custom aliases (user-chosen short codes)	P0
FR-04	URL expiration: optional TTL (1 day, 7 days, 30 days, 1 year, never)	P0
FR-05	Click analytics: total clicks, clicks over time, geographic distribution	P1
FR-06	Referrer and device tracking per click	P1
FR-07	Bulk URL creation via API (up to 1000 URLs per request)	P1
FR-08	URL deletion by owner	P1
FR-09	API key authentication for URL creation	P0
FR-10	Rate limiting per API key (100 creates/min default)	P0
FR-11	QR code generation for any short URL	P2
FR-12	Link preview metadata (title, description, image from target page)	P2

4. Non-Functional Requirements

ID	Requirement	Target
NFR-01a	Redirect latency, cache-hit path (server-side)	p50 < 2ms / p99 < 5ms
NFR-01b	Redirect latency, cache-miss path (server-side)	p50 < 8ms / p99 < 15ms
NFR-01c	Redirect latency, end-to-end (client-observed, intra-region)	p50 5–10ms / p99 20–40ms
NFR-02	Create latency (p50 / p99)	< 20ms / < 50ms
NFR-03	Redirect throughput	100K/sec sustained globally
NFR-04	Create throughput	1K/sec sustained (10K burst)
NFR-05	Availability	99.99% (52 min downtime/year)
NFR-06	URL durability	Zero data loss for non-expired URLs
NFR-07	Data retention	Expired URLs purged automatically via Scylla TTL; active URLs stored indefinitely
NFR-08	Analytics freshness	< 5 second lag from click to dashboard
NFR-09	Short code length	7 characters (Base62 = 3.5 trillion combinations)

NFR-01a/b are server-side — request arriving at the Redirect Service to response leaving it — and they're what we page on. NFR-01c is end-to-end and dominated by network round-trip, not server work (intra-region ~10–20ms, cross-region 40–150ms). CDN edge caching, not origin speed, is what keeps user-perceived time low worldwide.

5. Design Assumptions

A single box of non-negotiables. Every number and decision downstream inherits from here.

Read-to-write ratio ~100:1. A 10:1 or 1000:1 workload changes cache/DB sizing materially.
Zipf-shaped URL popularity. The top 1–2% of URLs get most clicks. If traffic is uniform, the cache collapses — see §11.2 for the failure math.
3 regions active-active. Virginia, Frankfurt, Tokyo. No China deployment (GFW/ICP out of scope).
No file uploads, no OAuth, no SEO-optimized variant. Different storage and auth story.
GDPR in scope; HIPAA/SOX out of scope.
Numbers are design targets, not benchmark results.

6. High-Level Architecture

This is a read-heavy key-value lookup with an async analytics sidecar. Every choice downstream reinforces that shape.

6.1 Layers

One stack shown; an identical stack runs in every region. Scylla replicates across DCs via NetworkTopologyStrategy; Valkey and the app tier are regional. §13 covers the multi-region topology.

Edge. CDN caches redirects with a 60s TTL so viral links don't reach origin. The L7 LB splits /api/create from /:shortCode.
Application. Three stateless services. Redirect handles the hot path (local LRU → Valkey → Scylla). Create allocates IDs and writes to Scylla. Analytics queries ClickHouse for dashboards.
Cache. Valkey holds ~150M hot URL mappings per region, cache-aside with a 24h TTL.
Storage. Scylla is the primary store, RF=3 per DC. ClickHouse holds click events, partitioned by day.
Async. Every redirect fires a click event into Kafka. Flink batches them and inserts into ClickHouse, keeping analytics off the hot path.

6.2 Store Selection

Store	Technology	Role
Primary store	ScyllaDB (Cassandra API)	URL mappings, custom aliases, API keys
Cache	Valkey 8 Cluster	Hot URL lookups (~150M per region)
Analytics store	ClickHouse	Click event aggregations
Event bus	Kafka	Click events, async processing
Counter service	ScyllaDB `counter_ranges` table	LWT range allocation per region
CDN	CloudFront / Cloudflare	Edge redirect caching for viral URLs

6.3 Why ScyllaDB

Four properties carry the choice for this specific workload:

Shard-per-core (Seastar/C++). Each CPU core owns its partition range with no global locks — p99 SELECT by partition key stays sub-millisecond at our read rate.
Cassandra-compatible multi-region. NetworkTopologyStrategy with RF=3 per DC gives active-active replication out of the box, with battle-tested hinted handoff, read repair, and anti-entropy repair.
Storage-engine primitives we actually need. Native per-insert TTL + TimeWindowCompactionStrategy handle URL expiration with no cleanup job; Paxos-backed LWT handles custom-alias uniqueness and counter range allocation on the one or two paths that need it.

Cost. ~$3K/mo self-managed on 3-year RI (18 × i4i.2xlarge across 3 regions) + ~$1.5K cross-region transfer; ~$8–10K/mo on Scylla Cloud. Operational tax is real: repair schedules, compaction tuning, consistency-level discipline, tombstone awareness (§19). Single-region RDS Postgres at ~$500–1K/mo is a legitimate alternative if you don't need active-active multi-region and your team already owns Postgres — we don't pick it here because the 4 TB partitioned-PG write rate plus active-active makes Scylla the cleaner fit.

7. Back-of-the-Envelope

7.1 Throughput

Writes: 1K creates/sec globally → ~333/sec per region
Reads:  100K redirects/sec globally
  CDN absorbs ~25% (viral URLs):  25K/sec at edge
  Valkey absorbs ~80% of origin:  60K/sec from cache
  ScyllaDB:                       ~15K reads/sec + 1K writes/sec globally
  Per region (of 3):              ~5K Scylla reads/sec

7.2 Storage

Bottom line: ~350 B per row × 10B rows × RF=3 per DC × 3 DCs = ~31.5 TB total replicated, landing at ~2.6 TB steady state per node (~70% of raw NVMe) after compaction overhead. The 70% ceiling is deliberate — it leaves headroom so recompaction doesn't hit 100% and stall reads.

Show full derivation

Per URL row in Scylla:
  short_code 7 B + long_url 200 B + created_at 8 B + expires_at 8 B
  + user_id 16 B + flags 1 B + region 1 B + metadata ~50 B
  + Scylla row overhead (column names, write ts, TTL metadata) ~60 B
  Effective: ~350 B per row

Primary table:  10B × 350 B           ≈ 3.5 TB per DC before replication
RF=3 per DC:    3.5 TB × 3            ≈ 10.5 TB per DC on disk
Across 3 DCs:                         ≈ 31.5 TB total replicated

Per node (6 × i4i.2xlarge, 2×1.875 TB NVMe = 3.75 TB):
  10.5 TB ÷ 6  ≈  1.75 TB per node  (~47% raw NVMe)
  + compaction overhead ~1.5×  ≈  2.6 TB steady state  (~70%)

Compaction overhead comes from merging SSTables — during a compaction, both the old and new SSTables coexist briefly.

7.3 Cache Sizing

Top 150M URLs (1.5%) cover ~80% of origin traffic (Zipf).
150M × 250 B raw = 37.5 GB
Valkey overhead (hash table, pointers, replication) ~1.6×
= ~60 GB per region → 6 shards × 10 GB

7.4 Analytics

100K events/sec × 300 B = 30 MB/sec = 2.6 TB/day raw
ClickHouse compression ~10× → ~260 GB/day
1-year hot retention + rollups → ~20 TB active

8. Data Model

8.1 ScyllaDB Schema (CQL)

Query-first design — each access pattern gets its own table instead of a secondary index. The primary urls table is inline below; the other five tables (click counts, custom aliases, counter ranges, user reverse-lookup, API keys) are collapsed for readability.

cql

CREATE KEYSPACE url_shortener
WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'us-east':      3,
    'eu-central':   3,
    'ap-northeast': 3
} AND durable_writes = true;

-- Primary URL mapping table. Immutable after create; TTL reaps expired rows.
CREATE TABLE url_shortener.urls (
    short_code       text,
    long_url         text,
    user_id          uuid,
    created_at       timestamp,
    expires_at       timestamp,
    is_custom_alias  boolean,
    region_id        tinyint,
    metadata         map<text, text>,
    PRIMARY KEY (short_code)
) WITH gc_grace_seconds = 864000  -- 10 days, must exceed repair interval
  AND compaction = {
      'class': 'TimeWindowCompactionStrategy',
      'compaction_window_unit': 'DAYS',
      'compaction_window_size': 30
  };

Show the other 5 tables (click counts, custom aliases, counter ranges, user reverse-lookup, API keys)

cql

-- Counter columns must be isolated from non-counter columns in Cassandra.
CREATE TABLE url_shortener.url_click_counts (
    short_code  text PRIMARY KEY,
    clicks      counter
) WITH compaction = {'class': 'SizeTieredCompactionStrategy'};

-- Custom alias uniqueness table. Lowercased alias as partition key.
-- Writes go through the leader region (§13.3) to avoid cross-DC LWT cost.
CREATE TABLE url_shortener.custom_aliases (
    alias           text PRIMARY KEY,
    short_code      text,
    owner_user_id   uuid,
    created_at      timestamp
);

-- Regional counter rows for ID allocation. Accessed once per ~100K URLs per pod.
CREATE TABLE url_shortener.counter_ranges (
    region_id       tinyint PRIMARY KEY,
    current_value   bigint
);

-- Reverse lookup: list a user's URLs, newest first.
CREATE TABLE url_shortener.urls_by_user (
    user_id         uuid,
    created_at      timestamp,
    short_code      text,
    PRIMARY KEY ((user_id), created_at, short_code)
) WITH CLUSTERING ORDER BY (created_at DESC);

-- API key metadata. key_hash is SHA-256 of the raw key.
CREATE TABLE url_shortener.api_keys (
    key_hash        blob PRIMARY KEY,
    user_id         uuid,
    name            text,
    rate_limit      int,
    created_at      timestamp,
    revoked_at      timestamp
);

Key schema decisions:

urls_by_user is a separate table, not a secondary index — Cassandra secondary indexes don't scale for high-cardinality columns.
TimeWindowCompactionStrategy on urls because rows are immutable. Whole SSTables drop when their time window ages out, so TTL reaping doesn't cost extra compaction work.
gc_grace_seconds = 10 days so scheduled repair can propagate tombstones before physical deletion. Must exceed the repair interval.
TTL is per-insert, not default_time_to_live, because most URLs don't expire.

8.2 Valkey Key Patterns

url:{short_code}              → {long_url}|{expires_at}    (24h TTL)
ratelimit:{api_key}:{minute}  → count                      (sliding window, 60s TTL)
counter:{pod_id}:current      → current counter value
counter:{pod_id}:max          → range end

8.3 ClickHouse Schema

Partitioned daily, MergeTree-ordered by (short_code, clicked_at), 1-year TTL. A SummingMergeTree materialized view rolls up hourly counts and unique-country counts per short code.

Show full DDL

sql

CREATE TABLE click_events (
    event_id        UUID DEFAULT generateUUIDv4(),
    short_code      String,
    clicked_at      DateTime64(3),
    referrer        String,
    user_agent      String,
    ip_country      LowCardinality(String),
    ip_city         String,
    device_type     LowCardinality(String),
    browser         LowCardinality(String),
    os              LowCardinality(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(clicked_at)
ORDER BY (short_code, clicked_at)
TTL clicked_at + INTERVAL 1 YEAR;

CREATE MATERIALIZED VIEW click_counts_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMMDD(period_start)
ORDER BY (short_code, period_start)
AS SELECT
    short_code,
    toStartOfHour(clicked_at) AS period_start,
    count() AS clicks,
    uniqExact(ip_country) AS unique_countries
FROM click_events
GROUP BY short_code, period_start;

9. API Design

9.1 Create Short URL

POST /api/v1/urls
Authorization: Bearer {api_key}
Content-Type: application/json
X-Idempotency-Key: {uuid}

Request:

json

{
    "url": "https://example.com/very/long/path/to/some/resource?param=value",
    "custom_alias": null,
    "expires_in": "30d",
    "metadata": {"campaign": "spring_sale", "source": "email"}
}

Response 201 Created:

json

{
    "short_code": "Ab3xK9f",
    "short_url": "https://sho.rt/Ab3xK9f",
    "preview_url": "https://sho.rt/Ab3xK9f+",
    "long_url": "https://example.com/very/long/path/to/some/resource?param=value",
    "created_at": "2026-03-25T10:30:00Z",
    "expires_at": "2026-04-24T10:30:00Z",
    "qr_code_url": "https://sho.rt/api/v1/qr/Ab3xK9f"
}

9.2 Redirect

GET /{short_code}

HTTP/1.1 302 Found
Location: https://example.com/very/long/path/...
Cache-Control: private, max-age=60
X-Short-Code: Ab3xK9f

We default to 302, not 301. A 301 tells browsers to cache the redirect forever, breaking future edits/expirations and killing click tracking (the browser never hits our server again). 301 is available as an opt-in per URL for users who want SEO link juice.

9.3 Preview Mode

GET /{short_code}+

A + suffix returns an HTML preview (destination, owner, created date, click count) without redirecting. Same convention as Bitly. For URLs flagged as borderline by the safety scanner, the preview page is served instead of the 302 and requires a "Continue to destination" click-through.

9.4 Other Endpoints

GET    /api/v1/urls/{short_code}/analytics?period=7d&granularity=hour
DELETE /api/v1/urls/{short_code}           -- owner only
GET    /api/v1/urls?user_id={id}&page=1    -- paginated list
POST   /api/v1/urls/bulk                   -- up to 1000
GET    /api/v1/qr/{short_code}             -- SVG QR code
PATCH  /api/v1/urls/{short_code}           -- update expiration or metadata

10. ID Generation

The create path needs a unique, non-guessable 7-character short code with zero per-write coordination. Random + collision-check needs a DB read per write. Hash truncation collides at 42 bits (birthday paradox, nearly certain at 10B). A region-prefixed counter with a bijective shuffle is collision-free by construction and mints new codes from local memory. That's the choice.

10.1 Why 42 Bits

We need exactly 42 bits of ID space because 62⁷ ≈ 2⁴² — that's the full capacity of a 7-character Base62 ([0-9a-zA-Z]) code, and we want every bit usable. The 42 bits split as:

  ┌────────────┬────────────────────────────────────┐
  │ 4 bits     │ 38 bits                            │
  │ region_id  │ per-region counter                 │
  └────────────┴────────────────────────────────────┘
                        42 bits  →  shuffle  →  Base62  →  7 chars

4-bit region prefix supports 16 regions — 3 today, 13 slots of headroom.
38-bit counter gives ~275B IDs per region. At 10B URLs across 16 regions we'd use ~625M per region: 440× headroom.
Regional counters are disjoint by construction (different top bits), so a bijective shuffle on the full 42 bits always produces disjoint outputs. No two regions can ever mint the same short code even without talking to each other.

10.2 Create Flow in One Region

Range allocation (once per ~100K URLs per pod, ~every 100 seconds at 1K creates/sec). The Create Service pod runs a Scylla LWT compare-and-set against its regional counter_ranges row:
cql
```
   SELECT current_value FROM counter_ranges WHERE region_id = ?;   -- LOCAL_SERIAL
   UPDATE counter_ranges
      SET current_value = ?                                         -- old + 100000
    WHERE region_id = ?
       IF current_value = ?;                                        -- LOCAL_SERIAL CAS
   
```
On success, the pod owns [old+1, old+100000]. On conflict it retries with the fresh value. Contention is rare because the row is touched ~once every 100s per pod.
Local increment. Each create bumps the local counter and constructs the 42-bit value: (region_id << 38) | counter. No DB call.
Bijective shuffle. The value is passed through a keyed bijection so sequential counters don't produce sequential codes (which would enable enumeration attacks). The shuffle is reversible with the server's key, so operators can decode a short code back to its counter and origin region. Pseudocode in Appendix A.
Base62 encode. The shuffled value maps to 7 characters from [0-9a-zA-Z]. Base64 includes URL-unsafe + and /; hex would need 11 characters.
Persist. Write the row to Scylla at LOCAL_QUORUM; warm the local Valkey entry with a 24h TTL.

10.3 Counter Durability

The counter_ranges row is RF=3 in the local DC — a single node loss still achieves LOCAL_SERIAL quorum. Each pod pre-fetches two ranges on startup (one active, one spare). When the active range is exhausted, the spare takes over and a new spare is fetched in the background. A pod keeps creating for ~200K URLs (~3 minutes at 1K/sec) even if Scylla is briefly unreachable. A full regional Scylla outage only blocks creates in that region — other regions keep minting codes from their own counters.

10.4 Lifecycle and Exhaustion Plan

At 275B IDs/region and ~625M needed per region, exhaustion is a decade-plus problem. The escape hatch triggers at ~90% utilization in any region:

Rotate to 8-character codes. Extend the counter to 44 bits (48 total with the 4-bit region prefix), shuffle across 48 bits, Base62-encode to 8 characters. 62⁸ ≈ 2⁴⁸ — another ~300× headroom.
Old 7-character codes remain valid forever. The Redirect Service handles both lengths; no retroactive migration.
Mixed-length minting during the rollover window — new URLs get 8-char codes while the 7-char space drains.

The same mechanism covers adding regions: reserve another prefix bit (5 bits → 32 regions), provision the new counter_ranges row, deploy. The ID space is append-only — no data migration.

Tip

Check yourself

Why does the design rotate to 8-char codes at ~90% counter utilization, not at 99% or 100%?

Answer

Two reasons. First, the last 10% of counter range buys years of runway at the current rate, so there's no urgency cost to rotating early. Second, running a rollover at 99%+ means any unexpected traffic spike (a batch importer, a sudden viral week) can exhaust the range before the rollover pipeline reaches steady state. 90% is the sweet spot: comfortable headroom, still a decade off, and the rollover can be rehearsed without emergency pressure.

11. Caching

The redirect path must serve 100K req/sec at sub-5ms server-side p99 on cache hit. A three-layer cache makes this possible.

11.1 Three Layers

CDN edge (~25% hit). Popular URLs cached at edge with Cache-Control: private, max-age=60. Short enough that deletions propagate quickly; long enough to absorb viral spikes.
Valkey (~80% of what reaches origin). Redirect Service checks the per-pod LRU, then Valkey. Cache-aside with a 24h TTL. ~150M hot URLs per region.
ScyllaDB. Whatever gets through reads at LOCAL_ONE (rows are immutable, stale reads don't matter). Shard-per-core keeps the long tail sub-millisecond.

Effective origin load:

100K req/sec total
- CDN  25%  →  25K/sec at edge
- Valkey 80% of remainder  →  60K/sec from cache
- ScyllaDB  →  ~15K SELECT/sec + 1K INSERT/sec globally (~5K reads/sec per region)

11.2 Cache Hit Rate Is a Zipf Assumption

⚠️ Design risk. The 80% Valkey and 25% CDN hit rates only hold if URL popularity follows a Zipf distribution. If traffic is uniform — for example, a batch importer creates 100M URLs that all get a burst of clicks — the top 150M URLs cover only ~30% of reads, origin traffic jumps from ~15K to ~52K Scylla reads/sec, and tail latency climbs from sub-millisecond to tens of ms. Monitor cache_hit_rate continuously. Page on sustained drops below 60% and scale Valkey (or Scylla) immediately.

Zipf is well-established for social and news URLs, but measure it, don't assume it. A change in traffic mix is the fastest way to push the working set outside the cache.

11.3 Hot Key Problem

One viral link can take 1M req/sec, and the entire load lands on one Valkey shard — one key, one CPU, instant saturation. This is the most common production failure mode for URL shorteners.

Four mitigations, in order of effectiveness:

Local in-process LRU. The Redirect Service keeps a ~10K-entry LRU with a 10s TTL per pod. Viral keys are served from process memory and never touch Valkey. Across 20 pods, this alone absorbs most of a viral burst.
CDN TTL with jitter. Base 60s TTL plus ±10s jitter so edge entries don't all expire at the same millisecond and stampede origin.
Request coalescing. On a cache miss, the first request acquires a short-lived lock (SET lock:url:{code} 1 NX EX 5) while fetching from Scylla; concurrent requests briefly wait-and-retry instead of all hitting the DB. "Coalescing" just means collapsing a thundering herd into one backend query. Pseudocode in Appendix B.
Hot-key replication. If a single Valkey shard is still overloaded, replicate known hot keys across all shards with a client-side aliasing scheme (url:{code}:{shard_hint}).

These activate automatically on the first 10–100 requests to a new URL — no prior knowledge of "which keys are hot" is needed.

11.4 Server-Side Latency Budget (Redirect Service only)

End-to-end latency is dominated by network, not server work. The numbers below are the server's contribution. §13 covers how multi-region deployment handles the network half of NFR-01c.

Cache-hit path (target: p99 < 5ms server-side):
  LB routing:          <1 ms
  Local LRU:           ~0.01 ms
    or Valkey GET:      0.1–0.3 ms
  Response build:      <0.1 ms
  Total:                1–2 ms

Cache-miss path (target: p99 < 15ms server-side):
  LB routing:              <1 ms
  Valkey GET (miss):        0.1 ms
  Scylla SELECT LOCAL_ONE:  0.5–4 ms  (shard-per-core, in-region)
  Valkey SET:               0.1 ms
  Response build:          <0.1 ms
  Total:                    2–6 ms

Tip

Check yourself

If URL popularity were uniform instead of Zipf, which number in this design collapses first?

Answer

The Valkey hit rate. With uniform traffic, the top 150M URLs cover only ~30% of reads instead of ~80%, which pushes Scylla reads from ~15K/sec to ~52K/sec globally and blows past the shard-per-core budget. p99 climbs from sub-millisecond into the tens of milliseconds and the cache-hit SLO starts burning error budget immediately. This is why §11.2 names the assumption explicitly and pages on hit rate dropping below 60%.

12. Click Analytics Pipeline

Every redirect emits a click event. 100K events/sec without touching redirect latency requires a fully async pipeline.

Design decisions:

Async Kafka producer, best-effort delivery. producer.send() without waiting for ack; buffered in 32 MB of producer memory; overflow drops silently. Analytics gap during a Kafka outage is acceptable — extra redirect latency is not. This is the right call for click analytics only. Billing, audit, and compliance streams need synchronous acks=all and at-least-once delivery.
Flink 5s tumbling windows. ClickHouse wants large batch inserts, not row-by-row. 100K–500K rows per batch keeps merge pressure low.
Enrichment in Flink. MaxMind GeoIP and user-agent parsing happen in the consumer, not on the redirect path.

Write amplification control — a naive pipeline overwhelms Kafka and ClickHouse:

Partition Kafka by hash(short_code), not by short_code — otherwise all events for a viral link land on one partition (instant hot spot).
Producer batching: linger.ms=50, batch.size=64KB. 100K events/sec becomes ~500 Kafka requests/sec at the broker.
LZ4 compression on topics cuts wire and disk ~4× at negligible CPU cost.
One ClickHouse part per batch — infrequent large inserts keep merges cheap.

Raw:                   100K events/sec × 300 B  = 30 MB/sec
After LZ4:             ~7.5 MB/sec on the wire
Producer batching:     ~500 Kafka requests/sec
ClickHouse inserts:    ~1–2/sec of 100K-row parts

13. Multi-Region Writes

Every region runs the full stack. Reads and creates both serve from the nearest region; there is no primary region. Regional counters make this safe for short codes, and leader-region routing handles the one operation that can't parallelize (custom aliases).

13.1 Regional Topology

A Tokyo user's create hits Tokyo's Create Service, allocates from Tokyo's counter row, writes to Tokyo's Scylla DC at LOCAL_QUORUM (2 of 3 local replicas), and replicates to other DCs asynchronously. A Tokyo user's redirect hits Tokyo's Valkey first, then Tokyo's Scylla at LOCAL_ONE on miss.

13.2 Why `LOCAL_QUORUM` Writes Are Safe

LOCAL_QUORUM + async cross-DC replication is the standard Cassandra multi-DC pattern, but it only works because this workload makes cross-region conflicts impossible:

Short codes can't conflict. Region-prefixed counters make regional write sets disjoint.
Custom aliases can't conflict. All custom-alias writes are routed to a leader region (see §13.3), so LWT Paxos stays local.
Click counts are in a counter table that's approximate-by-design; any drift is reconciled against ClickHouse.

The trade-off is explicit: we give up instant global consistency (a newly-created URL is visible in other regions within ~1s, not immediately) in exchange for zero cross-region latency on every write. For a URL shortener, that's the right call — creators don't share-and-click within one second.

13.3 Custom Alias Coordination

Custom aliases are the only operation that can't parallelize — two users could race on sho.rt/summer-sale across regions. All custom-alias writes go through a leader region (Virginia) via an L7 LB rule keyed on the custom_alias field. Non-US users pay ~100ms of extra WAN latency when creating custom aliases, acceptable because (a) creates aren't latency-sensitive and (b) custom aliases are a small fraction of traffic. Redirects still hit the nearest region — the custom_aliases table replicates to all DCs and reads are regional.

An alternative (cross-DC SERIAL LWT) exists for teams that want to avoid the single-region dependency; it's in Appendix C.

13.4 Cross-DC Replication Mechanics

Async cross-DC replication — writes commit at LOCAL_QUORUM; the coordinator fans out to other DCs in the background. Typical lag <1s.
Hinted handoff — brief unavailability captured as hints that replay on recovery (up to 3h default window).
Read repair — any read finding divergent replicas triggers async repair.
Scheduled anti-entropy — weekly nodetool repair guarantees convergence; details in §17.3.

13.5 Replication Lag Window

A URL created in Tokyo is visible in Virginia/Frankfurt within ~1 second. The narrow window where a just-created link 404s in a remote region is acceptable for the normal share-and-click flow. If a specific client needs stronger guarantees (e.g., a QR-code generator rendering immediately after creation), the Create Service can warm the originating region's Valkey entry synchronously and include a cache-warm hint in the response.

Tip

Check yourself

Why can we safely use LOCAL_QUORUM for short-code creates without cross-region coordination, but not for custom aliases?

Answer

Short-code creates can't conflict across regions because the region-prefixed counter makes every region's ID space disjoint — Virginia and Tokyo literally cannot mint the same code, so `LOCAL_QUORUM` is enough. Custom aliases share a single global key space (`my-brand` is the same string everywhere), so two users in different regions can race on the same alias within the ~1s replication lag. That race needs a single serialization point — either a leader region with `LOCAL_SERIAL` LWT (what we picked) or cross-DC `SERIAL` LWT (Appendix C). There's no way to keep it `LOCAL_QUORUM` and also guarantee uniqueness.

14. Custom Alias & Expiration

14.1 Custom Aliases

Custom aliases need global uniqueness. Two users requesting the same alias simultaneously is the race condition; without proper handling, one silently overwrites the other.

Case-insensitive conflicts. my-brand, My-Brand, and MY-BRAND are the same alias. The Create Service lowercases input before the uniqueness check and stores it lowercase. Normalize once at the edge — never rely on downstream components to handle casing consistently.

Solution: Scylla LWT in the leader region.

cql

INSERT INTO custom_aliases (alias, short_code, owner_user_id, created_at)
     VALUES (?, ?, ?, ?)
     IF NOT EXISTS;
-- LOCAL_SERIAL consistency

On [applied=false], the Create Service returns 409 Conflict. Because the write is routed to the leader region (§13.3), the Paxos round stays local and cheap (~5–10ms).

Reserved words. A blocklist (api, admin, login, help, about, static, …) is checked before the INSERT so users can't claim paths that collide with application routes.

Validation. 3–30 characters, alphanumeric plus hyphens, no leading/trailing hyphens.

14.2 URL Expiration

Soft expiration (instant). On every cache hit and Scylla read, the Redirect Service checks expires_at. If past, it returns 410 Gone. The row stays in the database until TTL reaping.
Hard deletion via TTL. URLs with a finite lifetime use INSERT ... USING TTL <seconds>. TimeWindowCompactionStrategy drops expired rows as whole SSTables age out — no cleanup job.
Cache entry carries expires_at. The Redirect Service checks it on every read; no proactive eviction needed.
Deletion propagation window. CDN entries use Cache-Control: max-age=60, so a deleted or expired URL stops redirecting worldwide within ~60 seconds. During that window, the CDN may still serve the cached 302 even though origin now returns 410 Gone — the 60s ceiling is deliberate, trading a brief stale window for the traffic absorption a longer TTL would give. A stronger SLA would require an active CDN purge on delete, which is extra operational surface we don't need here.

15. Security & Abuse

URL shorteners are abused for phishing, malware, and spam, and they handle user data under GDPR. The design-specific work is abuse prevention on create and the GDPR deletion flow. Everything else — TLS, encryption at rest, KMS, audit logging — is standard baseline covered in one paragraph at the end of this section.

15.1 Abuse Prevention on Create

Three clusters of controls:

Upfront validation.

URLs must be HTTP/HTTPS, max 2048 chars. Custom aliases: 3–30 chars, alphanumeric + hyphens, reserved-word checked, lowercased.
Destinations are checked against Google Safe Browsing, an internal blocklist, and homograph/suspicious-pattern matchers. Adds ~50ms per create, acceptable because creates aren't latency-sensitive.
The safety scanner issues up to 3 HEAD requests following any redirect the destination returns, and rejects if the final target is flagged or points back at us.

Rate limiting and authentication.

100 creates/minute per API key; 10/minute per IP for anonymous creates (sliding-window counters in Valkey).
No API key → CAPTCHA required.

Chain prevention.

Reject destinations pointing at our own shortener domains (direct, CNAME, or IP literal).
Reject destinations pointing at known external shorteners (bit.ly, tinyurl.com, t.co, goo.gl, is.gd, ow.ly, etc.). Chain hops across independent shorteners make safety scanning unreliable and are a classic phishing vector.

For URLs flagged as borderline, the /{code}+ preview page (§9.3) is served instead of the 302 and requires an explicit click-through.

Retroactive scanning. A background job re-scans existing URLs weekly; if a destination that was clean at creation turns malicious, the URL is disabled and the owner is notified.

Click-event IPs are salted-hashed (HMAC-SHA-256 with a rotating salt) before ClickHouse. Raw IPs are never stored; the hash still supports unique-visitor counting.
Device/browser/OS bucketed — no user-agent fingerprinting.
Delete-by-user flow (Article 15/17):
1. Query urls_by_user for all short_codes owned by the user.
2. DELETE FROM urls WHERE short_code = ? (app-level fanout also cleans urls_by_user).
3. DELETE FROM url_click_counts WHERE short_code = ?.
4. ALTER DELETE on the matching ClickHouse click events.
5. Audit the deletion with the request ID.
Tombstone awareness. Scylla deletes are logical — rows are physically removed after gc_grace_seconds (10 days). Disclosed in the GDPR response; still satisfies the law.

15.3 Standard Security Baselines

TLS 1.3 for client-facing traffic; mTLS service-to-service via the service mesh; Scylla internode and cross-DC gossip encryption; Transparent Data Encryption on SSTables and commit logs with a KMS-backed master key per DC; ClickHouse disk-level encryption; Kafka broker-side encryption; API keys SHA-256-hashed at rest; service secrets in Vault with automatic 90-day rotation; Scylla and application audit logs shipped to an immutable S3 bucket with 1-year retention. These are industry-standard, not design-specific — applied like any other production system.

One special case: the Feistel shuffle key (Appendix A) can't be rotated in place without changing future short-code outputs. Rotation requires a new epoch bit reserved from the counter prefix so old and new keys coexist until old URLs age out.

16. Failure Scenarios

16.1 Valkey Cluster Node Failure

SCENARIO. One of 6 Valkey shards in a region crashes.

Time	Event
T+0s	Shard 3 crashes. ~25M cached URLs unavailable on that shard.
T+0s	Redirect Service gets connection errors; circuit breaker opens; requests for shard 3 keys fall through to Scylla.
T+5s	Valkey Cluster promotes replica to primary.
T+10s	New primary available; circuit breaker closes; cache cold for shard 3.
T+60s	Cache warms organically from redirect traffic.

Impact. ~16% of regional redirects hit Scylla for ~60s. Regional Scylla reads spike from ~5K to ~10K/sec — comfortably within 6-node capacity. No data loss, no user-visible errors, slightly slower redirects (2–5ms vs <1ms) during warm-up.

Lesson. The whole point of sizing Scylla for 2× the cache-miss load is to absorb cold-cache moments like this silently. If a cache shard crash were enough to page on-call, you sized the DB wrong, not the cache.

16.2 ScyllaDB Node Failure

SCENARIO. One Scylla node fails in a 6-node DC.

Time	Event
T+0s	Node 3 fails. Gossip marks it down within ~2s.
T+2s	`LOCAL_QUORUM` writes require 2-of-2 remaining replicas. Reads at `LOCAL_ONE` stay fast.
T+2s	Coordinator starts writing hints for the down node.
T+Xs	Operator replaces node (`nodetool removenode` or bootstrap replacement).
T+X+min	Replacement streams its data share; hint replay catches it up.
T+X+hr	`nodetool repair -pr` guarantees full convergence.

Impact. Zero write loss (hinted handoff). Zero read impact. Write p99 may bump slightly during the handoff window. No user or on-call action needed during the hours it takes to replace the node.

Lesson. Hinted handoff makes single-node loss invisible to writes. If you're not seeing errors during a node outage, the system is working as designed — don't mistake the silence for a problem that needs investigating.

16.3 Regional DC Outage

SCENARIO. Full Scylla DC outage in us-east-1.

Time	Event
T+0s	us-east-1 Scylla DC unavailable.
T+0s	us-east-1 Redirect Service serves Valkey/LRU hits; cache misses fail.
T+0s	us-east-1 Create Service fails (can't reach counter row).
T+5s	Health checks detect unhealthy DC.
T+30s	Route 53 DNS failover removes us-east-1 from the redirect record set.
T+60s	Global traffic redistributes to eu-central-1 and ap-northeast-1.
T+hours	DC returns; Scylla Manager runs full repair to reconcile divergence.

Impact. us-east-1 clients see 30–60s of cache-miss errors during DNS failover. Creates route to the nearest healthy region. No data loss — other DCs have full copies via NetworkTopologyStrategy. Recovery cost is IO-heavy repair, scheduled off-peak.

Lesson. DNS is the slowest part of failover — the 30–60s floor comes from TTL propagation, not Scylla recovery. If you need faster DC failover, invest in client-side endpoint discovery (e.g., service mesh with active health checks), not in making the DB recover any faster.

16.4 Kafka Unavailable

SCENARIO. Kafka unavailable for 5 minutes.

Time	Event
T+0s	Brokers unreachable. Async producer buffers to 32 MB.
T+0s	Redirects continue — Kafka is off the critical path.
T+2m	Producer buffer fills; new events drop silently.
T+5m	Kafka recovers; producer flushes buffered events.

Impact. Zero impact on redirects. Analytics has a 5-minute gap; some events during buffer-overflow are permanently lost. Counter-column lag from ClickHouse reconciliation catches up within minutes.

Lesson. A best-effort async producer is a feature, not a bug. The moment you flip Kafka to acks=all to "not lose events," you've handed Kafka a veto over redirect availability — and redirect availability is the SLO that matters. Durability choices follow from what the downstream consumer actually needs, not from a vague desire for correctness.

16.5 Cross-DC Network Partition

SCENARIO. 10-minute partition between us-east-1 and the other two DCs.

Time	Event
T+0m	Cross-DC links drop. us-east-1 not partitioned from its clients.
T+0m	All DCs continue serving `LOCAL_QUORUM` writes and `LOCAL_ONE` reads independently.
T+0–10m	Each DC accumulates hints for the unreachable peers (3h default window).
T+10m	Network restores; gossip reconnects; hint replay pushes accumulated writes.
T+15–30m	Read repair and background anti-entropy clean up anything that didn't fit in hints.

Impact. No data loss — partition shorter than the hint window. Clients in each region saw their own writes immediately; other regions converged after recovery. This is the point of multi-region active-active: a full cross-DC partition doesn't degrade service for anyone.

Lesson. The 3-hour hint window is the real SLA on partition tolerance. A partition longer than that starts dropping hints and requires full anti-entropy repair to recover — so "multi-region active-active" isn't a free lunch, it's a bet that network partitions stay short. Monitor scylla_hints_pending and page before it approaches the hint-log capacity, not after.

17. Operational Playbook

A design that doesn't document its ops story isn't production-grade.

17.1 Deployment

Services are stateless Kubernetes containers. Rolling update with maxUnavailable: 0, maxSurge: 25%.
Canary uses LB weight: 5% for 10 min → 25% for 10 min → 100%. Auto-rollback if cache-hit p99 climbs >2ms, cache-miss p99 >5ms, or error rate >0.1%. Symmetric rollback takes ~5 minutes.
Flink job blue-green: new job joins the same consumer group, old job stops after the new one catches up.

17.2 Key Metrics and Alerts

Only design-specific alerts live in the main dashboard. Standard infrastructure metrics (error rate, throughput, pod CPU) live in the platform dashboard and aren't duplicated here. The three numbers that matter most are below; the full alert set is collapsed.

Metric	Alert Threshold
`cache_hit_rate`	< 60% for 10 minutes
`redirect_latency_cache_hit_p99`	> 10ms for 5 minutes
`counter_range_remaining`	< 10K remaining IDs

Show all 8 design-specific alerts

Metric	Alert Threshold
`cache_hit_rate`	< 60% for 10 minutes
`redirect_latency_cache_hit_p99`	> 10ms for 5 minutes
`redirect_latency_cache_miss_p99`	> 20ms for 5 minutes
`counter_range_remaining`	< 10K remaining IDs
`scylla_load_avg_per_shard`	> 0.7 for 15 minutes
`scylla_hints_pending`	> 100K for 10 minutes
`scylla_cross_dc_latency_p99`	> 2 seconds
`kafka_consumer_lag`	> 1M events for 5 minutes

17.3 Repair and Compaction

Scylla has three anti-entropy layers: read repair runs constantly on every divergent read (free), hinted handoff captures brief unavailability and replays on recovery, and scheduled repair is the backstop that catches anything the other two missed and reaps tombstones before gc_grace_seconds expires.

Schedule. Scylla Manager runs incremental repair daily and full repair weekly during off-peak hours. Full repair on the 18-node cluster with ~30 TB replicated takes 4–8 hours of background IO, capped at 30% of disk budget so foreground reads stay fast. Repairs must complete within gc_grace_seconds (10 days) or tombstones start resurrecting deleted data.

Compaction per table:

urls — TimeWindowCompactionStrategy, 30-day windows. Immutable rows, whole SSTables drop at once as TTLs age out.
custom_aliases, counter_ranges — SizeTieredCompactionStrategy. Small tables, low rate, compaction irrelevant.
url_click_counts — STCS with min_threshold=2 so counter updates consolidate quickly.

Throttle knobs: compaction_throughput_mb_per_sec = 64 baseline, dropped to 16 during traffic peaks via nodetool setcompactionthroughput. concurrent_compactors = 4 per node, one per NVMe drive.

Write amplification. With TWCS and 30-day windows, Scylla write amplification stays in the normal 3–5× range. Monitor scylla_commitlog_writes against client writes; a large delta signals heavy merges that may need a throttle adjustment.

Compaction storm (worked example). If multiple nodes run major compactions simultaneously and SSD IO saturates, coordinator read p99 climbs from <1ms to 5–8ms (still under NFR-01b 15ms). Mitigation: nodetool setcompactionthroughput 16 to throttle harder, then rebalance the schedule so future compactions stagger across nodes.

17.4 Backup and Recovery

Daily snapshots via Scylla Manager to S3 (hard-linked at the SSTable level, cheap).
Incremental SSTable backups every 4 hours between snapshots.
Cross-region S3 replication on backup artifacts.
Quarterly restore drill on a random node — no restore is real until you've done it.

Recovery Objectives by Failure Type:

Failure	RPO	RTO	Data Loss
Valkey node	N/A	<1 min	No (replicated)
Scylla node	0	2–6 h	No (hinted handoff + `LOCAL_QUORUM`)
Single DC	4 h	2–6 h	No (replicated to other DCs)
Kafka outage	5 min	<5 min	Analytics only (acceptable)
Full cluster rebuild from backup	4 h	6 h	No

Note

That covers the running-it story.

§17.1–§17.4 is what every engineer on the project should know. What follows (capacity planning, schema migrations, top 5 alerts) is on-call and senior-ops territory — skim unless you're paged.

17.5 Capacity Planning

When you need this: sizing the cluster for next quarter, or deciding whether a node needs adding.

Three leading indicators:

Disk per node. Warn 60%, scale 70%, page 80%. Scylla rebuild takes time — earlier warnings = calmer scaling.
scylla_load_avg_per_shard. Target <0.5 average, <0.7 peak. Sustained >0.7 signals a hot partition or that the DC needs more nodes.
Cross-DC replication lag. Target p99 <1s. Sustained lag means cross-region links are saturated.

Adding a node takes 2–6 hours to bootstrap (data share streams from replicas). Adding a DC is a days-long operation: provision, bootstrap, nodetool rebuild, update NetworkTopologyStrategy.

17.6 Schema Migrations

When you need this: shipping a schema change to the urls or custom_aliases table.

CQL DDL is online but the reality at 10B rows is subtler. ALTER TABLE ADD column is metadata-only and cheap. ALTER TABLE DROP column rewrites SSTables and stalls reads on hot nodes — schedule off-peak. Secondary indexes are avoided — we use query-pattern tables instead. All schema changes live in schema/migrations/ with a schema_versions table tracking applied IDs. Forward-only: a mistake requires a new compensating migration, never a rollback.

17.7 Top 5 Alerts and Mitigations

When you need this: 3 AM page.

Every on-call engineer should know these cold:

Hot partition (scylla_load_avg_per_shard > 0.7) — identify the hot short_code via query tracing; if legit (viral link), rely on LRU + CDN jitter; if bot, rate-limit the source.
Compaction starvation (scylla_compactions_pending > 10) — nodetool compactionstats; temporarily bump throughput if traffic is off-peak; investigate strategy mismatch.
Hint pile-up (scylla_hints_pending > 100K) — a replica is dropping writes; check nodetool status; if flapping, check NICs and disks.
Cross-DC lag (scylla_cross_dc_latency_p99 > 2s) — cross-region link saturated or remote DC overloaded; check network graphs and recent deploys.
Disk space warning (node >70%) — start capacity planning; short-term, check runaway tombstones with nodetool cfstats, force compaction on the worst offender, or clean old snapshots.

18. SLOs and Error Budgets

SLOs make the quality target concrete. Error budgets turn "be more careful" into "freeze deploys this week."

SLI	SLO	Monthly Error Budget
Redirect cache-hit p99 ≤ 5ms (server-side)	99.95%	21.6 min
Redirect availability (non-5xx)	99.99%	4.3 min
URL durability (non-expired)	100% over rolling year	Budget-less — any loss is a major postmortem

Show 3 more SLOs (cache-miss latency, create latency, analytics freshness)

SLI	SLO	Monthly Error Budget
Redirect cache-miss p99 ≤ 15ms (server-side)	99.5%	3.6 h
Create latency p99 ≤ 50ms (server-side)	99%	7.2 h
Analytics freshness ≤ 5s	99%	7.2 h

Error budget policy. Normal burn (≤1× over 30d): business as usual. Fast burn (>2× over 7d): freeze non-critical deploys, director sign-off for launches. Very fast burn (>4× over 1d): page on-call, freeze all non-rollback deploys, launch IR. Exhausted: next sprint goes to reliability.

Alert tiering.

Page-now. Availability burn, error-rate spike, Scylla UN→DN, hint pile-up, counter_range_remaining < 10K. 5-minute response.
Page-business-hours. Latency budget burn, cache hit rate dropping, compaction starvation, cross-DC lag.
Ticket-only. Capacity warnings, disk >60%, pending migrations, individual repair failures.

19. Appendix

A. Bijective Shuffle (Feistel)

The shuffle is a 2-round Feistel cipher on the 42-bit ID. Feistel is bijective by construction: every input maps to a unique output and the mapping is reversible with the key. Two rounds on a 42-bit block provides enough diffusion to make sequential counter values produce non-sequential codes, without cryptographic strength requirements (we're avoiding enumeration, not defending against a state actor).

Show pseudocode

shuffle(x, key):
    L = x >> 21                     # high 21 bits
    R = x & ((1 << 21) - 1)         # low  21 bits
    for i in [0, 1]:
        F = hash(R, key[i]) & ((1 << 21) - 1)
        L, R = R, L XOR F
    return (L << 21) | R

hash() is a fast non-cryptographic mixer (e.g., xxHash) truncated to 21 bits. The inverse runs the same structure in reverse, so any operator with the key can decode a short code back to its counter value (and therefore its origin region) for admin tooling.

B. Request Coalescing (Cache Miss)

The first request that misses both local LRU and Valkey acquires a short-lived Valkey lock, fetches from Scylla, and writes back to both caches. Concurrent requests wait briefly and retry against Valkey instead of stampeding Scylla.

Show Python implementation

python

def get_url(short_code):
    if val := local_cache.get(short_code):
        return val
    if val := valkey.get(f"url:{short_code}"):
        local_cache.set(short_code, val, ttl=10)
        return val

    lock_key = f"lock:url:{short_code}"
    if valkey.set(lock_key, "1", nx=True, ex=5):
        row = scylla_session.execute(
            SELECT_URL_PS, [short_code],
            consistency_level=ConsistencyLevel.LOCAL_ONE,
        ).one()
        if row:
            valkey.set(f"url:{short_code}", row.long_url, ex=86400)
            local_cache.set(short_code, row.long_url, ttl=10)
        valkey.delete(lock_key)
        return row.long_url if row else None
    else:
        time.sleep(0.01)
        return valkey.get(f"url:{short_code}")

C. Cross-DC LWT Alternative for Custom Aliases

Instead of leader-region routing, Scylla LWT can run at SERIAL (not LOCAL_SERIAL), forcing Paxos rounds across DCs. Per-write cost is ~100–150ms vs ~5–10ms local, but there's no single-region dependency and custom-alias throughput isn't bottlenecked by the leader region. Pick this when either (a) the extra WAN latency is unacceptable for custom-alias creates or (b) eliminating the single-region dependency matters more than the latency cost.

Note

If you only remember six things

ID. 42-bit region-prefixed counter → bijective shuffle → Base62 (7 chars). No cross-region coordination, 440× headroom.
Storage. ScyllaDB RF=3 per DC, active-active across Virginia, Frankfurt, Tokyo. LOCAL_QUORUM writes, LOCAL_ONE reads.
Cache. CDN → local LRU → Valkey → Scylla. Each layer absorbs what the previous missed; sub-5ms p99 server-side on hit.
Analytics. Kafka (best-effort async) → Flink (5s batches) → ClickHouse. Off the critical path.
Biggest risk. Zipf cache assumption. Monitor hit rate; page on sustained drops below 60%.
Escape hatch. At 90% counter utilization, rotate to 8-char codes. Old codes stay valid forever.

Explore the Technologies

Dive deeper into the technologies and infrastructure patterns used in this design:

Core Technologies

Technology	Role in This Design	Learn More
ScyllaDB	Primary URL storage, Cassandra-compatible shard-per-core, active-active DC replication, native TTL	ScyllaDB
Valkey	Regional redirect cache (150M hot URLs), rate limit counters	Redis/Valkey
ClickHouse	Click analytics storage, real-time materialized views	ClickHouse
Kafka	Async click event pipeline, decouples redirect from analytics	Kafka
Flink	Click event enrichment and batched ClickHouse inserts	Apache Flink

Infrastructure Patterns

Pattern	Relevance to This Design	Learn More
Caching Strategies	Three-layer caching (CDN, Valkey, Scylla) for sub-5ms redirects	Caching Strategies
Rate Limiting and Throttling	Per-API-key and per-IP rate limits using Valkey sliding windows	Rate Limiting
Message Queues and Event Streaming	Kafka decouples analytics from the redirect hot path	Event Streaming
CDN and Edge Computing	Edge caching for viral URL redirects	CDN and Edge Computing
Multi-Region Active-Active	Region-prefixed counter IDs + Scylla NetworkTopologyStrategy for worldwide writes	Multi-Region

Practice this design: Try the URL Shortener interview question to test your understanding with hints and structured guidance.

CrackingWalnuts

System Design: Ad Click Aggregator (10B Clicks/day, Lambda Architecture, Fraud Detection)

April 10, 2026 · 70 min read

System Design: Ad Exchange (Real-Time Bidding, Sub-100ms Auctions, DSP/SSP, Impression Serving)

April 10, 2026 · 59 min read

System Design: E-Commerce Flash Sales (10M Users, Coupon System, One-Per-User Enforcement)

April 5, 2026 · 106 min read

Continue Learning

Explore 30+ topics in System Design Interview Prep→

Deep dives, diagrams, and interview-ready knowledge.

CrackingWalnuts

System DesignApril 11, 2026· 42 min read

System Design: URL Shortener (10B Short URLs, 100K Redirects/sec)

Goal. Build a URL shortener for 10 billion stored URLs, 100K redirect lookups/sec, and 1K creates/sec. Custom aliases, expiration, click analytics, and cache-hit redirect latency under 5ms p99 server-side.

TL;DR. This is a read-heavy key-value lookup with an async analytics sidecar. Each region mints IDs from its own counter with zero cross-region coordination, shuffled and Base62-encoded into 7-character codes (62⁷ ≈ 2⁴² IDs). ScyllaDB stores the mappings, replicated active-active across three regions. The read path is CDN → local LRU → Valkey → Scylla, with each layer absorbing what the previous one missed. Kafka + Flink + ClickHouse run analytics asynchronously, off the critical path.

Tip

Pick your path

Time	Read this	Covers
2 min	TL;DR + §1 + §11.2 Zipf callout	The shape of the system and its most fragile assumption
15 min	§1–§13	Every core design decision, interview-grade
30 min	Full post	Production detail, ops, and appendices

1. Final Architecture

Three independent paths. The read path is fast (sub-5ms server-side on cache hit), the write path is regional and not latency-sensitive, the analytics path is fully off the critical path.

Write path:
  Client → Nearest region → Create Service
         → Regional counter (Scylla LWT) → shuffle → Base62
         → ScyllaDB (NetworkTopologyStrategy, RF=3) → Valkey warm

Read path:
  Client → CDN → Nearest region → Redirect Service
         → Local LRU → Valkey → ScyllaDB LOCAL_ONE (on miss) → 302

Analytics path:
  Redirect Service → Kafka (async) → Flink (5s batches) → ClickHouse

Every section below zooms into one piece of this picture.

2. Problem Statement

A URL shortener is an afternoon prototype and a months-long production system. Three things make it hard at scale:

Unique IDs with no coordination. At 1K creates/sec across regions, a single auto-increment row is a bottleneck. Random IDs need a DB read per write. Hash truncation collides at 42 bits (the birthday paradox bites hard around 10B rows).
Redirect latency under skewed load. 100K/sec of reads, a Zipf-shaped hot set, and a sub-5ms p99 target on cache hit. You can't touch the DB on the hot path.
Analytics without coupling. Every redirect produces a click event. Writing those synchronously adds 10–50ms to every 302. They have to go through a queue.

Scale numbers.

10B URLs stored
1K creates/sec (~86M/day)
100K redirects/sec (~8.6B/day), distributed across regions
100:1 read-to-write ratio
200 B average long URL; 7-character short code

3. Functional Requirements

ID	Requirement	Priority
FR-01	Create a short URL from a long URL, returning a unique 7-character code	P0
FR-02	Redirect short URL to original long URL via HTTP 301/302	P0
FR-03	Support custom aliases (user-chosen short codes)	P0
FR-04	URL expiration: optional TTL (1 day, 7 days, 30 days, 1 year, never)	P0
FR-05	Click analytics: total clicks, clicks over time, geographic distribution	P1
FR-06	Referrer and device tracking per click	P1
FR-07	Bulk URL creation via API (up to 1000 URLs per request)	P1
FR-08	URL deletion by owner	P1
FR-09	API key authentication for URL creation	P0
FR-10	Rate limiting per API key (100 creates/min default)	P0
FR-11	QR code generation for any short URL	P2
FR-12	Link preview metadata (title, description, image from target page)	P2

4. Non-Functional Requirements

ID	Requirement	Target
NFR-01a	Redirect latency, cache-hit path (server-side)	p50 < 2ms / p99 < 5ms
NFR-01b	Redirect latency, cache-miss path (server-side)	p50 < 8ms / p99 < 15ms
NFR-01c	Redirect latency, end-to-end (client-observed, intra-region)	p50 5–10ms / p99 20–40ms
NFR-02	Create latency (p50 / p99)	< 20ms / < 50ms
NFR-03	Redirect throughput	100K/sec sustained globally
NFR-04	Create throughput	1K/sec sustained (10K burst)
NFR-05	Availability	99.99% (52 min downtime/year)
NFR-06	URL durability	Zero data loss for non-expired URLs
NFR-07	Data retention	Expired URLs purged automatically via Scylla TTL; active URLs stored indefinitely
NFR-08	Analytics freshness	< 5 second lag from click to dashboard
NFR-09	Short code length	7 characters (Base62 = 3.5 trillion combinations)

5. Design Assumptions

A single box of non-negotiables. Every number and decision downstream inherits from here.

Read-to-write ratio ~100:1. A 10:1 or 1000:1 workload changes cache/DB sizing materially.
Zipf-shaped URL popularity. The top 1–2% of URLs get most clicks. If traffic is uniform, the cache collapses — see §11.2 for the failure math.
3 regions active-active. Virginia, Frankfurt, Tokyo. No China deployment (GFW/ICP out of scope).
No file uploads, no OAuth, no SEO-optimized variant. Different storage and auth story.
GDPR in scope; HIPAA/SOX out of scope.
Numbers are design targets, not benchmark results.

6. High-Level Architecture

This is a read-heavy key-value lookup with an async analytics sidecar. Every choice downstream reinforces that shape.

6.1 Layers

One stack shown; an identical stack runs in every region. Scylla replicates across DCs via NetworkTopologyStrategy; Valkey and the app tier are regional. §13 covers the multi-region topology.

Edge. CDN caches redirects with a 60s TTL so viral links don't reach origin. The L7 LB splits /api/create from /:shortCode.
Application. Three stateless services. Redirect handles the hot path (local LRU → Valkey → Scylla). Create allocates IDs and writes to Scylla. Analytics queries ClickHouse for dashboards.
Cache. Valkey holds ~150M hot URL mappings per region, cache-aside with a 24h TTL.
Storage. Scylla is the primary store, RF=3 per DC. ClickHouse holds click events, partitioned by day.
Async. Every redirect fires a click event into Kafka. Flink batches them and inserts into ClickHouse, keeping analytics off the hot path.

6.2 Store Selection

Store	Technology	Role
Primary store	ScyllaDB (Cassandra API)	URL mappings, custom aliases, API keys
Cache	Valkey 8 Cluster	Hot URL lookups (~150M per region)
Analytics store	ClickHouse	Click event aggregations
Event bus	Kafka	Click events, async processing
Counter service	ScyllaDB `counter_ranges` table	LWT range allocation per region
CDN	CloudFront / Cloudflare	Edge redirect caching for viral URLs

6.3 Why ScyllaDB

Four properties carry the choice for this specific workload:

Shard-per-core (Seastar/C++). Each CPU core owns its partition range with no global locks — p99 SELECT by partition key stays sub-millisecond at our read rate.
Cassandra-compatible multi-region. NetworkTopologyStrategy with RF=3 per DC gives active-active replication out of the box, with battle-tested hinted handoff, read repair, and anti-entropy repair.
Storage-engine primitives we actually need. Native per-insert TTL + TimeWindowCompactionStrategy handle URL expiration with no cleanup job; Paxos-backed LWT handles custom-alias uniqueness and counter range allocation on the one or two paths that need it.

7. Back-of-the-Envelope

7.1 Throughput

Writes: 1K creates/sec globally → ~333/sec per region
Reads:  100K redirects/sec globally
  CDN absorbs ~25% (viral URLs):  25K/sec at edge
  Valkey absorbs ~80% of origin:  60K/sec from cache
  ScyllaDB:                       ~15K reads/sec + 1K writes/sec globally
  Per region (of 3):              ~5K Scylla reads/sec

7.2 Storage

Show full derivation

Per URL row in Scylla:
  short_code 7 B + long_url 200 B + created_at 8 B + expires_at 8 B
  + user_id 16 B + flags 1 B + region 1 B + metadata ~50 B
  + Scylla row overhead (column names, write ts, TTL metadata) ~60 B
  Effective: ~350 B per row

Primary table:  10B × 350 B           ≈ 3.5 TB per DC before replication
RF=3 per DC:    3.5 TB × 3            ≈ 10.5 TB per DC on disk
Across 3 DCs:                         ≈ 31.5 TB total replicated

Per node (6 × i4i.2xlarge, 2×1.875 TB NVMe = 3.75 TB):
  10.5 TB ÷ 6  ≈  1.75 TB per node  (~47% raw NVMe)
  + compaction overhead ~1.5×  ≈  2.6 TB steady state  (~70%)

Compaction overhead comes from merging SSTables — during a compaction, both the old and new SSTables coexist briefly.

7.3 Cache Sizing

Top 150M URLs (1.5%) cover ~80% of origin traffic (Zipf).
150M × 250 B raw = 37.5 GB
Valkey overhead (hash table, pointers, replication) ~1.6×
= ~60 GB per region → 6 shards × 10 GB

7.4 Analytics

100K events/sec × 300 B = 30 MB/sec = 2.6 TB/day raw
ClickHouse compression ~10× → ~260 GB/day
1-year hot retention + rollups → ~20 TB active

8. Data Model

8.1 ScyllaDB Schema (CQL)

cql

CREATE KEYSPACE url_shortener
WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'us-east':      3,
    'eu-central':   3,
    'ap-northeast': 3
} AND durable_writes = true;

-- Primary URL mapping table. Immutable after create; TTL reaps expired rows.
CREATE TABLE url_shortener.urls (
    short_code       text,
    long_url         text,
    user_id          uuid,
    created_at       timestamp,
    expires_at       timestamp,
    is_custom_alias  boolean,
    region_id        tinyint,
    metadata         map<text, text>,
    PRIMARY KEY (short_code)
) WITH gc_grace_seconds = 864000  -- 10 days, must exceed repair interval
  AND compaction = {
      'class': 'TimeWindowCompactionStrategy',
      'compaction_window_unit': 'DAYS',
      'compaction_window_size': 30
  };

Show the other 5 tables (click counts, custom aliases, counter ranges, user reverse-lookup, API keys)

cql

-- Counter columns must be isolated from non-counter columns in Cassandra.
CREATE TABLE url_shortener.url_click_counts (
    short_code  text PRIMARY KEY,
    clicks      counter
) WITH compaction = {'class': 'SizeTieredCompactionStrategy'};

-- Custom alias uniqueness table. Lowercased alias as partition key.
-- Writes go through the leader region (§13.3) to avoid cross-DC LWT cost.
CREATE TABLE url_shortener.custom_aliases (
    alias           text PRIMARY KEY,
    short_code      text,
    owner_user_id   uuid,
    created_at      timestamp
);

-- Regional counter rows for ID allocation. Accessed once per ~100K URLs per pod.
CREATE TABLE url_shortener.counter_ranges (
    region_id       tinyint PRIMARY KEY,
    current_value   bigint
);

-- Reverse lookup: list a user's URLs, newest first.
CREATE TABLE url_shortener.urls_by_user (
    user_id         uuid,
    created_at      timestamp,
    short_code      text,
    PRIMARY KEY ((user_id), created_at, short_code)
) WITH CLUSTERING ORDER BY (created_at DESC);

-- API key metadata. key_hash is SHA-256 of the raw key.
CREATE TABLE url_shortener.api_keys (
    key_hash        blob PRIMARY KEY,
    user_id         uuid,
    name            text,
    rate_limit      int,
    created_at      timestamp,
    revoked_at      timestamp
);

Key schema decisions:

urls_by_user is a separate table, not a secondary index — Cassandra secondary indexes don't scale for high-cardinality columns.
TimeWindowCompactionStrategy on urls because rows are immutable. Whole SSTables drop when their time window ages out, so TTL reaping doesn't cost extra compaction work.
gc_grace_seconds = 10 days so scheduled repair can propagate tombstones before physical deletion. Must exceed the repair interval.
TTL is per-insert, not default_time_to_live, because most URLs don't expire.

8.2 Valkey Key Patterns

url:{short_code}              → {long_url}|{expires_at}    (24h TTL)
ratelimit:{api_key}:{minute}  → count                      (sliding window, 60s TTL)
counter:{pod_id}:current      → current counter value
counter:{pod_id}:max          → range end

8.3 ClickHouse Schema

Partitioned daily, MergeTree-ordered by (short_code, clicked_at), 1-year TTL. A SummingMergeTree materialized view rolls up hourly counts and unique-country counts per short code.

Show full DDL

sql

CREATE TABLE click_events (
    event_id        UUID DEFAULT generateUUIDv4(),
    short_code      String,
    clicked_at      DateTime64(3),
    referrer        String,
    user_agent      String,
    ip_country      LowCardinality(String),
    ip_city         String,
    device_type     LowCardinality(String),
    browser         LowCardinality(String),
    os              LowCardinality(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(clicked_at)
ORDER BY (short_code, clicked_at)
TTL clicked_at + INTERVAL 1 YEAR;

CREATE MATERIALIZED VIEW click_counts_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMMDD(period_start)
ORDER BY (short_code, period_start)
AS SELECT
    short_code,
    toStartOfHour(clicked_at) AS period_start,
    count() AS clicks,
    uniqExact(ip_country) AS unique_countries
FROM click_events
GROUP BY short_code, period_start;

9. API Design

9.1 Create Short URL

POST /api/v1/urls
Authorization: Bearer {api_key}
Content-Type: application/json
X-Idempotency-Key: {uuid}

Request:

json

{
    "url": "https://example.com/very/long/path/to/some/resource?param=value",
    "custom_alias": null,
    "expires_in": "30d",
    "metadata": {"campaign": "spring_sale", "source": "email"}
}

Response 201 Created:

json

{
    "short_code": "Ab3xK9f",
    "short_url": "https://sho.rt/Ab3xK9f",
    "preview_url": "https://sho.rt/Ab3xK9f+",
    "long_url": "https://example.com/very/long/path/to/some/resource?param=value",
    "created_at": "2026-03-25T10:30:00Z",
    "expires_at": "2026-04-24T10:30:00Z",
    "qr_code_url": "https://sho.rt/api/v1/qr/Ab3xK9f"
}

9.2 Redirect

GET /{short_code}

HTTP/1.1 302 Found
Location: https://example.com/very/long/path/...
Cache-Control: private, max-age=60
X-Short-Code: Ab3xK9f

9.3 Preview Mode

GET /{short_code}+

9.4 Other Endpoints

GET    /api/v1/urls/{short_code}/analytics?period=7d&granularity=hour
DELETE /api/v1/urls/{short_code}           -- owner only
GET    /api/v1/urls?user_id={id}&page=1    -- paginated list
POST   /api/v1/urls/bulk                   -- up to 1000
GET    /api/v1/qr/{short_code}             -- SVG QR code
PATCH  /api/v1/urls/{short_code}           -- update expiration or metadata

10. ID Generation

10.1 Why 42 Bits

We need exactly 42 bits of ID space because 62⁷ ≈ 2⁴² — that's the full capacity of a 7-character Base62 ([0-9a-zA-Z]) code, and we want every bit usable. The 42 bits split as:

  ┌────────────┬────────────────────────────────────┐
  │ 4 bits     │ 38 bits                            │
  │ region_id  │ per-region counter                 │
  └────────────┴────────────────────────────────────┘
                        42 bits  →  shuffle  →  Base62  →  7 chars

4-bit region prefix supports 16 regions — 3 today, 13 slots of headroom.
38-bit counter gives ~275B IDs per region. At 10B URLs across 16 regions we'd use ~625M per region: 440× headroom.
Regional counters are disjoint by construction (different top bits), so a bijective shuffle on the full 42 bits always produces disjoint outputs. No two regions can ever mint the same short code even without talking to each other.

10.2 Create Flow in One Region

Range allocation (once per ~100K URLs per pod, ~every 100 seconds at 1K creates/sec). The Create Service pod runs a Scylla LWT compare-and-set against its regional counter_ranges row:
cql
```
   SELECT current_value FROM counter_ranges WHERE region_id = ?;   -- LOCAL_SERIAL
   UPDATE counter_ranges
      SET current_value = ?                                         -- old + 100000
    WHERE region_id = ?
       IF current_value = ?;                                        -- LOCAL_SERIAL CAS
   
```
On success, the pod owns [old+1, old+100000]. On conflict it retries with the fresh value. Contention is rare because the row is touched ~once every 100s per pod.
Local increment. Each create bumps the local counter and constructs the 42-bit value: (region_id << 38) | counter. No DB call.
Bijective shuffle. The value is passed through a keyed bijection so sequential counters don't produce sequential codes (which would enable enumeration attacks). The shuffle is reversible with the server's key, so operators can decode a short code back to its counter and origin region. Pseudocode in Appendix A.
Base62 encode. The shuffled value maps to 7 characters from [0-9a-zA-Z]. Base64 includes URL-unsafe + and /; hex would need 11 characters.
Persist. Write the row to Scylla at LOCAL_QUORUM; warm the local Valkey entry with a 24h TTL.

10.3 Counter Durability

10.4 Lifecycle and Exhaustion Plan

At 275B IDs/region and ~625M needed per region, exhaustion is a decade-plus problem. The escape hatch triggers at ~90% utilization in any region:

Rotate to 8-character codes. Extend the counter to 44 bits (48 total with the 4-bit region prefix), shuffle across 48 bits, Base62-encode to 8 characters. 62⁸ ≈ 2⁴⁸ — another ~300× headroom.
Old 7-character codes remain valid forever. The Redirect Service handles both lengths; no retroactive migration.
Mixed-length minting during the rollover window — new URLs get 8-char codes while the 7-char space drains.

The same mechanism covers adding regions: reserve another prefix bit (5 bits → 32 regions), provision the new counter_ranges row, deploy. The ID space is append-only — no data migration.

Tip

Check yourself

Why does the design rotate to 8-char codes at ~90% counter utilization, not at 99% or 100%?

Answer

11. Caching

The redirect path must serve 100K req/sec at sub-5ms server-side p99 on cache hit. A three-layer cache makes this possible.

11.1 Three Layers

CDN edge (~25% hit). Popular URLs cached at edge with Cache-Control: private, max-age=60. Short enough that deletions propagate quickly; long enough to absorb viral spikes.
Valkey (~80% of what reaches origin). Redirect Service checks the per-pod LRU, then Valkey. Cache-aside with a 24h TTL. ~150M hot URLs per region.
ScyllaDB. Whatever gets through reads at LOCAL_ONE (rows are immutable, stale reads don't matter). Shard-per-core keeps the long tail sub-millisecond.

Effective origin load:

100K req/sec total
- CDN  25%  →  25K/sec at edge
- Valkey 80% of remainder  →  60K/sec from cache
- ScyllaDB  →  ~15K SELECT/sec + 1K INSERT/sec globally (~5K reads/sec per region)

11.2 Cache Hit Rate Is a Zipf Assumption

⚠️ Design risk. The 80% Valkey and 25% CDN hit rates only hold if URL popularity follows a Zipf distribution. If traffic is uniform — for example, a batch importer creates 100M URLs that all get a burst of clicks — the top 150M URLs cover only ~30% of reads, origin traffic jumps from ~15K to ~52K Scylla reads/sec, and tail latency climbs from sub-millisecond to tens of ms. Monitor cache_hit_rate continuously. Page on sustained drops below 60% and scale Valkey (or Scylla) immediately.

Zipf is well-established for social and news URLs, but measure it, don't assume it. A change in traffic mix is the fastest way to push the working set outside the cache.

11.3 Hot Key Problem

One viral link can take 1M req/sec, and the entire load lands on one Valkey shard — one key, one CPU, instant saturation. This is the most common production failure mode for URL shorteners.

Four mitigations, in order of effectiveness:

Local in-process LRU. The Redirect Service keeps a ~10K-entry LRU with a 10s TTL per pod. Viral keys are served from process memory and never touch Valkey. Across 20 pods, this alone absorbs most of a viral burst.
CDN TTL with jitter. Base 60s TTL plus ±10s jitter so edge entries don't all expire at the same millisecond and stampede origin.
Request coalescing. On a cache miss, the first request acquires a short-lived lock (SET lock:url:{code} 1 NX EX 5) while fetching from Scylla; concurrent requests briefly wait-and-retry instead of all hitting the DB. "Coalescing" just means collapsing a thundering herd into one backend query. Pseudocode in Appendix B.
Hot-key replication. If a single Valkey shard is still overloaded, replicate known hot keys across all shards with a client-side aliasing scheme (url:{code}:{shard_hint}).

These activate automatically on the first 10–100 requests to a new URL — no prior knowledge of "which keys are hot" is needed.

11.4 Server-Side Latency Budget (Redirect Service only)

End-to-end latency is dominated by network, not server work. The numbers below are the server's contribution. §13 covers how multi-region deployment handles the network half of NFR-01c.

Cache-hit path (target: p99 < 5ms server-side):
  LB routing:          <1 ms
  Local LRU:           ~0.01 ms
    or Valkey GET:      0.1–0.3 ms
  Response build:      <0.1 ms
  Total:                1–2 ms

Cache-miss path (target: p99 < 15ms server-side):
  LB routing:              <1 ms
  Valkey GET (miss):        0.1 ms
  Scylla SELECT LOCAL_ONE:  0.5–4 ms  (shard-per-core, in-region)
  Valkey SET:               0.1 ms
  Response build:          <0.1 ms
  Total:                    2–6 ms

Tip

Check yourself

If URL popularity were uniform instead of Zipf, which number in this design collapses first?

Answer

12. Click Analytics Pipeline

Every redirect emits a click event. 100K events/sec without touching redirect latency requires a fully async pipeline.

Design decisions:

Async Kafka producer, best-effort delivery. producer.send() without waiting for ack; buffered in 32 MB of producer memory; overflow drops silently. Analytics gap during a Kafka outage is acceptable — extra redirect latency is not. This is the right call for click analytics only. Billing, audit, and compliance streams need synchronous acks=all and at-least-once delivery.
Flink 5s tumbling windows. ClickHouse wants large batch inserts, not row-by-row. 100K–500K rows per batch keeps merge pressure low.
Enrichment in Flink. MaxMind GeoIP and user-agent parsing happen in the consumer, not on the redirect path.

Write amplification control — a naive pipeline overwhelms Kafka and ClickHouse:

Partition Kafka by hash(short_code), not by short_code — otherwise all events for a viral link land on one partition (instant hot spot).
Producer batching: linger.ms=50, batch.size=64KB. 100K events/sec becomes ~500 Kafka requests/sec at the broker.
LZ4 compression on topics cuts wire and disk ~4× at negligible CPU cost.
One ClickHouse part per batch — infrequent large inserts keep merges cheap.

Raw:                   100K events/sec × 300 B  = 30 MB/sec
After LZ4:             ~7.5 MB/sec on the wire
Producer batching:     ~500 Kafka requests/sec
ClickHouse inserts:    ~1–2/sec of 100K-row parts

13. Multi-Region Writes

13.1 Regional Topology

13.2 Why `LOCAL_QUORUM` Writes Are Safe

LOCAL_QUORUM + async cross-DC replication is the standard Cassandra multi-DC pattern, but it only works because this workload makes cross-region conflicts impossible:

Short codes can't conflict. Region-prefixed counters make regional write sets disjoint.
Custom aliases can't conflict. All custom-alias writes are routed to a leader region (see §13.3), so LWT Paxos stays local.
Click counts are in a counter table that's approximate-by-design; any drift is reconciled against ClickHouse.

13.3 Custom Alias Coordination

An alternative (cross-DC SERIAL LWT) exists for teams that want to avoid the single-region dependency; it's in Appendix C.

13.4 Cross-DC Replication Mechanics

Async cross-DC replication — writes commit at LOCAL_QUORUM; the coordinator fans out to other DCs in the background. Typical lag <1s.
Hinted handoff — brief unavailability captured as hints that replay on recovery (up to 3h default window).
Read repair — any read finding divergent replicas triggers async repair.
Scheduled anti-entropy — weekly nodetool repair guarantees convergence; details in §17.3.

13.5 Replication Lag Window

Tip

Check yourself

Why can we safely use LOCAL_QUORUM for short-code creates without cross-region coordination, but not for custom aliases?

Answer

14. Custom Alias & Expiration

14.1 Custom Aliases

Custom aliases need global uniqueness. Two users requesting the same alias simultaneously is the race condition; without proper handling, one silently overwrites the other.

Case-insensitive conflicts. my-brand, My-Brand, and MY-BRAND are the same alias. The Create Service lowercases input before the uniqueness check and stores it lowercase. Normalize once at the edge — never rely on downstream components to handle casing consistently.

Solution: Scylla LWT in the leader region.

cql

INSERT INTO custom_aliases (alias, short_code, owner_user_id, created_at)
     VALUES (?, ?, ?, ?)
     IF NOT EXISTS;
-- LOCAL_SERIAL consistency

On [applied=false], the Create Service returns 409 Conflict. Because the write is routed to the leader region (§13.3), the Paxos round stays local and cheap (~5–10ms).

Reserved words. A blocklist (api, admin, login, help, about, static, …) is checked before the INSERT so users can't claim paths that collide with application routes.

Validation. 3–30 characters, alphanumeric plus hyphens, no leading/trailing hyphens.

14.2 URL Expiration

Soft expiration (instant). On every cache hit and Scylla read, the Redirect Service checks expires_at. If past, it returns 410 Gone. The row stays in the database until TTL reaping.
Hard deletion via TTL. URLs with a finite lifetime use INSERT ... USING TTL <seconds>. TimeWindowCompactionStrategy drops expired rows as whole SSTables age out — no cleanup job.
Cache entry carries expires_at. The Redirect Service checks it on every read; no proactive eviction needed.
Deletion propagation window. CDN entries use Cache-Control: max-age=60, so a deleted or expired URL stops redirecting worldwide within ~60 seconds. During that window, the CDN may still serve the cached 302 even though origin now returns 410 Gone — the 60s ceiling is deliberate, trading a brief stale window for the traffic absorption a longer TTL would give. A stronger SLA would require an active CDN purge on delete, which is extra operational surface we don't need here.

15. Security & Abuse

15.1 Abuse Prevention on Create

Three clusters of controls:

Upfront validation.

URLs must be HTTP/HTTPS, max 2048 chars. Custom aliases: 3–30 chars, alphanumeric + hyphens, reserved-word checked, lowercased.
Destinations are checked against Google Safe Browsing, an internal blocklist, and homograph/suspicious-pattern matchers. Adds ~50ms per create, acceptable because creates aren't latency-sensitive.
The safety scanner issues up to 3 HEAD requests following any redirect the destination returns, and rejects if the final target is flagged or points back at us.

Rate limiting and authentication.

100 creates/minute per API key; 10/minute per IP for anonymous creates (sliding-window counters in Valkey).
No API key → CAPTCHA required.

Chain prevention.

Reject destinations pointing at our own shortener domains (direct, CNAME, or IP literal).
Reject destinations pointing at known external shorteners (bit.ly, tinyurl.com, t.co, goo.gl, is.gd, ow.ly, etc.). Chain hops across independent shorteners make safety scanning unreliable and are a classic phishing vector.

For URLs flagged as borderline, the /{code}+ preview page (§9.3) is served instead of the 302 and requires an explicit click-through.

Retroactive scanning. A background job re-scans existing URLs weekly; if a destination that was clean at creation turns malicious, the URL is disabled and the owner is notified.

Click-event IPs are salted-hashed (HMAC-SHA-256 with a rotating salt) before ClickHouse. Raw IPs are never stored; the hash still supports unique-visitor counting.
Device/browser/OS bucketed — no user-agent fingerprinting.
Delete-by-user flow (Article 15/17):
1. Query urls_by_user for all short_codes owned by the user.
2. DELETE FROM urls WHERE short_code = ? (app-level fanout also cleans urls_by_user).
3. DELETE FROM url_click_counts WHERE short_code = ?.
4. ALTER DELETE on the matching ClickHouse click events.
5. Audit the deletion with the request ID.
Tombstone awareness. Scylla deletes are logical — rows are physically removed after gc_grace_seconds (10 days). Disclosed in the GDPR response; still satisfies the law.

15.3 Standard Security Baselines

16. Failure Scenarios

16.1 Valkey Cluster Node Failure

SCENARIO. One of 6 Valkey shards in a region crashes.

Time	Event
T+0s	Shard 3 crashes. ~25M cached URLs unavailable on that shard.
T+0s	Redirect Service gets connection errors; circuit breaker opens; requests for shard 3 keys fall through to Scylla.
T+5s	Valkey Cluster promotes replica to primary.
T+10s	New primary available; circuit breaker closes; cache cold for shard 3.
T+60s	Cache warms organically from redirect traffic.

16.2 ScyllaDB Node Failure

SCENARIO. One Scylla node fails in a 6-node DC.

Time	Event
T+0s	Node 3 fails. Gossip marks it down within ~2s.
T+2s	`LOCAL_QUORUM` writes require 2-of-2 remaining replicas. Reads at `LOCAL_ONE` stay fast.
T+2s	Coordinator starts writing hints for the down node.
T+Xs	Operator replaces node (`nodetool removenode` or bootstrap replacement).
T+X+min	Replacement streams its data share; hint replay catches it up.
T+X+hr	`nodetool repair -pr` guarantees full convergence.

Impact. Zero write loss (hinted handoff). Zero read impact. Write p99 may bump slightly during the handoff window. No user or on-call action needed during the hours it takes to replace the node.

16.3 Regional DC Outage

SCENARIO. Full Scylla DC outage in us-east-1.

Time	Event
T+0s	us-east-1 Scylla DC unavailable.
T+0s	us-east-1 Redirect Service serves Valkey/LRU hits; cache misses fail.
T+0s	us-east-1 Create Service fails (can't reach counter row).
T+5s	Health checks detect unhealthy DC.
T+30s	Route 53 DNS failover removes us-east-1 from the redirect record set.
T+60s	Global traffic redistributes to eu-central-1 and ap-northeast-1.
T+hours	DC returns; Scylla Manager runs full repair to reconcile divergence.

16.4 Kafka Unavailable

SCENARIO. Kafka unavailable for 5 minutes.

Time	Event
T+0s	Brokers unreachable. Async producer buffers to 32 MB.
T+0s	Redirects continue — Kafka is off the critical path.
T+2m	Producer buffer fills; new events drop silently.
T+5m	Kafka recovers; producer flushes buffered events.

16.5 Cross-DC Network Partition

SCENARIO. 10-minute partition between us-east-1 and the other two DCs.

Time	Event
T+0m	Cross-DC links drop. us-east-1 not partitioned from its clients.
T+0m	All DCs continue serving `LOCAL_QUORUM` writes and `LOCAL_ONE` reads independently.
T+0–10m	Each DC accumulates hints for the unreachable peers (3h default window).
T+10m	Network restores; gossip reconnects; hint replay pushes accumulated writes.
T+15–30m	Read repair and background anti-entropy clean up anything that didn't fit in hints.

17. Operational Playbook

A design that doesn't document its ops story isn't production-grade.

17.1 Deployment

Services are stateless Kubernetes containers. Rolling update with maxUnavailable: 0, maxSurge: 25%.
Canary uses LB weight: 5% for 10 min → 25% for 10 min → 100%. Auto-rollback if cache-hit p99 climbs >2ms, cache-miss p99 >5ms, or error rate >0.1%. Symmetric rollback takes ~5 minutes.
Flink job blue-green: new job joins the same consumer group, old job stops after the new one catches up.

17.2 Key Metrics and Alerts

Metric	Alert Threshold
`cache_hit_rate`	< 60% for 10 minutes
`redirect_latency_cache_hit_p99`	> 10ms for 5 minutes
`counter_range_remaining`	< 10K remaining IDs

Show all 8 design-specific alerts

Metric	Alert Threshold
`cache_hit_rate`	< 60% for 10 minutes
`redirect_latency_cache_hit_p99`	> 10ms for 5 minutes
`redirect_latency_cache_miss_p99`	> 20ms for 5 minutes
`counter_range_remaining`	< 10K remaining IDs
`scylla_load_avg_per_shard`	> 0.7 for 15 minutes
`scylla_hints_pending`	> 100K for 10 minutes
`scylla_cross_dc_latency_p99`	> 2 seconds
`kafka_consumer_lag`	> 1M events for 5 minutes

17.3 Repair and Compaction

Compaction per table:

urls — TimeWindowCompactionStrategy, 30-day windows. Immutable rows, whole SSTables drop at once as TTLs age out.
custom_aliases, counter_ranges — SizeTieredCompactionStrategy. Small tables, low rate, compaction irrelevant.
url_click_counts — STCS with min_threshold=2 so counter updates consolidate quickly.

Throttle knobs: compaction_throughput_mb_per_sec = 64 baseline, dropped to 16 during traffic peaks via nodetool setcompactionthroughput. concurrent_compactors = 4 per node, one per NVMe drive.

17.4 Backup and Recovery

Daily snapshots via Scylla Manager to S3 (hard-linked at the SSTable level, cheap).
Incremental SSTable backups every 4 hours between snapshots.
Cross-region S3 replication on backup artifacts.
Quarterly restore drill on a random node — no restore is real until you've done it.

Recovery Objectives by Failure Type:

Failure	RPO	RTO	Data Loss
Valkey node	N/A	<1 min	No (replicated)
Scylla node	0	2–6 h	No (hinted handoff + `LOCAL_QUORUM`)
Single DC	4 h	2–6 h	No (replicated to other DCs)
Kafka outage	5 min	<5 min	Analytics only (acceptable)
Full cluster rebuild from backup	4 h	6 h	No

Note

That covers the running-it story.

§17.1–§17.4 is what every engineer on the project should know. What follows (capacity planning, schema migrations, top 5 alerts) is on-call and senior-ops territory — skim unless you're paged.

17.5 Capacity Planning

When you need this: sizing the cluster for next quarter, or deciding whether a node needs adding.

Three leading indicators:

Disk per node. Warn 60%, scale 70%, page 80%. Scylla rebuild takes time — earlier warnings = calmer scaling.
scylla_load_avg_per_shard. Target <0.5 average, <0.7 peak. Sustained >0.7 signals a hot partition or that the DC needs more nodes.
Cross-DC replication lag. Target p99 <1s. Sustained lag means cross-region links are saturated.

Adding a node takes 2–6 hours to bootstrap (data share streams from replicas). Adding a DC is a days-long operation: provision, bootstrap, nodetool rebuild, update NetworkTopologyStrategy.

17.6 Schema Migrations

When you need this: shipping a schema change to the urls or custom_aliases table.

17.7 Top 5 Alerts and Mitigations

When you need this: 3 AM page.

Every on-call engineer should know these cold:

Hot partition (scylla_load_avg_per_shard > 0.7) — identify the hot short_code via query tracing; if legit (viral link), rely on LRU + CDN jitter; if bot, rate-limit the source.
Compaction starvation (scylla_compactions_pending > 10) — nodetool compactionstats; temporarily bump throughput if traffic is off-peak; investigate strategy mismatch.
Hint pile-up (scylla_hints_pending > 100K) — a replica is dropping writes; check nodetool status; if flapping, check NICs and disks.
Cross-DC lag (scylla_cross_dc_latency_p99 > 2s) — cross-region link saturated or remote DC overloaded; check network graphs and recent deploys.
Disk space warning (node >70%) — start capacity planning; short-term, check runaway tombstones with nodetool cfstats, force compaction on the worst offender, or clean old snapshots.

18. SLOs and Error Budgets

SLOs make the quality target concrete. Error budgets turn "be more careful" into "freeze deploys this week."

SLI	SLO	Monthly Error Budget
Redirect cache-hit p99 ≤ 5ms (server-side)	99.95%	21.6 min
Redirect availability (non-5xx)	99.99%	4.3 min
URL durability (non-expired)	100% over rolling year	Budget-less — any loss is a major postmortem

Show 3 more SLOs (cache-miss latency, create latency, analytics freshness)

SLI	SLO	Monthly Error Budget
Redirect cache-miss p99 ≤ 15ms (server-side)	99.5%	3.6 h
Create latency p99 ≤ 50ms (server-side)	99%	7.2 h
Analytics freshness ≤ 5s	99%	7.2 h

Alert tiering.

Page-now. Availability burn, error-rate spike, Scylla UN→DN, hint pile-up, counter_range_remaining < 10K. 5-minute response.
Page-business-hours. Latency budget burn, cache hit rate dropping, compaction starvation, cross-DC lag.
Ticket-only. Capacity warnings, disk >60%, pending migrations, individual repair failures.

19. Appendix

A. Bijective Shuffle (Feistel)

Show pseudocode

shuffle(x, key):
    L = x >> 21                     # high 21 bits
    R = x & ((1 << 21) - 1)         # low  21 bits
    for i in [0, 1]:
        F = hash(R, key[i]) & ((1 << 21) - 1)
        L, R = R, L XOR F
    return (L << 21) | R

B. Request Coalescing (Cache Miss)

Show Python implementation

python

def get_url(short_code):
    if val := local_cache.get(short_code):
        return val
    if val := valkey.get(f"url:{short_code}"):
        local_cache.set(short_code, val, ttl=10)
        return val

    lock_key = f"lock:url:{short_code}"
    if valkey.set(lock_key, "1", nx=True, ex=5):
        row = scylla_session.execute(
            SELECT_URL_PS, [short_code],
            consistency_level=ConsistencyLevel.LOCAL_ONE,
        ).one()
        if row:
            valkey.set(f"url:{short_code}", row.long_url, ex=86400)
            local_cache.set(short_code, row.long_url, ttl=10)
        valkey.delete(lock_key)
        return row.long_url if row else None
    else:
        time.sleep(0.01)
        return valkey.get(f"url:{short_code}")

C. Cross-DC LWT Alternative for Custom Aliases

Note

If you only remember six things

ID. 42-bit region-prefixed counter → bijective shuffle → Base62 (7 chars). No cross-region coordination, 440× headroom.
Storage. ScyllaDB RF=3 per DC, active-active across Virginia, Frankfurt, Tokyo. LOCAL_QUORUM writes, LOCAL_ONE reads.
Cache. CDN → local LRU → Valkey → Scylla. Each layer absorbs what the previous missed; sub-5ms p99 server-side on hit.
Analytics. Kafka (best-effort async) → Flink (5s batches) → ClickHouse. Off the critical path.
Biggest risk. Zipf cache assumption. Monitor hit rate; page on sustained drops below 60%.
Escape hatch. At 90% counter utilization, rotate to 8-char codes. Old codes stay valid forever.

Explore the Technologies

Dive deeper into the technologies and infrastructure patterns used in this design:

Core Technologies

Technology	Role in This Design	Learn More
ScyllaDB	Primary URL storage, Cassandra-compatible shard-per-core, active-active DC replication, native TTL	ScyllaDB
Valkey	Regional redirect cache (150M hot URLs), rate limit counters	Redis/Valkey
ClickHouse	Click analytics storage, real-time materialized views	ClickHouse
Kafka	Async click event pipeline, decouples redirect from analytics	Kafka
Flink	Click event enrichment and batched ClickHouse inserts	Apache Flink

Infrastructure Patterns

Pattern	Relevance to This Design	Learn More
Caching Strategies	Three-layer caching (CDN, Valkey, Scylla) for sub-5ms redirects	Caching Strategies
Rate Limiting and Throttling	Per-API-key and per-IP rate limits using Valkey sliding windows	Rate Limiting
Message Queues and Event Streaming	Kafka decouples analytics from the redirect hot path	Event Streaming
CDN and Edge Computing	Edge caching for viral URL redirects	CDN and Edge Computing
Multi-Region Active-Active	Region-prefixed counter IDs + Scylla NetworkTopologyStrategy for worldwide writes	Multi-Region

Practice this design: Try the URL Shortener interview question to test your understanding with hints and structured guidance.

CrackingWalnuts

System Design: Ad Click Aggregator (10B Clicks/day, Lambda Architecture, Fraud Detection)

April 10, 2026 · 70 min read

System Design: Ad Exchange (Real-Time Bidding, Sub-100ms Auctions, DSP/SSP, Impression Serving)

April 10, 2026 · 59 min read

System Design: E-Commerce Flash Sales (10M Users, Coupon System, One-Per-User Enforcement)

April 5, 2026 · 106 min read

Continue Learning

Explore 30+ topics in System Design Interview Prep→

Deep dives, diagrams, and interview-ready knowledge.

System Design: URL Shortener (10B Short URLs, 100K Redirects/sec)

1. Final Architecture

2. Problem Statement

3. Functional Requirements

4. Non-Functional Requirements

5. Design Assumptions

6. High-Level Architecture

6.1 Layers

6.2 Store Selection

6.3 Why ScyllaDB

7. Back-of-the-Envelope

7.1 Throughput

7.2 Storage

7.3 Cache Sizing

7.4 Analytics

8. Data Model

8.1 ScyllaDB Schema (CQL)

8.2 Valkey Key Patterns

8.3 ClickHouse Schema

9. API Design

9.1 Create Short URL

9.2 Redirect

9.3 Preview Mode

9.4 Other Endpoints

10. ID Generation

10.1 Why 42 Bits

10.2 Create Flow in One Region

10.3 Counter Durability

10.4 Lifecycle and Exhaustion Plan

11. Caching

11.1 Three Layers

11.2 Cache Hit Rate Is a Zipf Assumption

11.3 Hot Key Problem

11.4 Server-Side Latency Budget (Redirect Service only)

12. Click Analytics Pipeline

13. Multi-Region Writes

13.1 Regional Topology

13.2 Why LOCAL_QUORUM Writes Are Safe

13.3 Custom Alias Coordination

13.4 Cross-DC Replication Mechanics

13.5 Replication Lag Window

14. Custom Alias & Expiration

14.1 Custom Aliases

14.2 URL Expiration

15. Security & Abuse

15.1 Abuse Prevention on Create

15.2 GDPR and PII Minimization

15.3 Standard Security Baselines

16. Failure Scenarios

16.1 Valkey Cluster Node Failure

16.2 ScyllaDB Node Failure

16.3 Regional DC Outage

16.4 Kafka Unavailable

16.5 Cross-DC Network Partition

17. Operational Playbook

17.1 Deployment

17.2 Key Metrics and Alerts

17.3 Repair and Compaction

17.4 Backup and Recovery

17.5 Capacity Planning

17.6 Schema Migrations

17.7 Top 5 Alerts and Mitigations

18. SLOs and Error Budgets

19. Appendix

A. Bijective Shuffle (Feistel)

B. Request Coalescing (Cache Miss)

C. Cross-DC LWT Alternative for Custom Aliases

Explore the Technologies

Core Technologies

Infrastructure Patterns

Related Posts

System Design: Ad Click Aggregator (10B Clicks/day, Lambda Architecture, Fraud Detection)

System Design: Ad Exchange (Real-Time Bidding, Sub-100ms Auctions, DSP/SSP, Impression Serving)

System Design: E-Commerce Flash Sales (10M Users, Coupon System, One-Per-User Enforcement)

Explore 30+ topics in System Design Interview Prep→

System Design: URL Shortener (10B Short URLs, 100K Redirects/sec)

1. Final Architecture

2. Problem Statement

3. Functional Requirements

4. Non-Functional Requirements

13.2 Why `LOCAL_QUORUM` Writes Are Safe

13.2 Why `LOCAL_QUORUM` Writes Are Safe