System Design: Ad Exchange (Real-Time Bidding, Sub-100ms Auctions, DSP/SSP, Impression Serving)
Goal: An ad exchange running real-time bidding for 1M ad requests/sec, top-5-of-50 smart DSP selection, first-price auctions inside 100ms, publisher floor prices + supply-chain rules (ads.txt, sellers.json, schain), CDN-served creatives, independent impression tracking, and $4B/month of spend reconciled between DSPs and publishers. That's ~25B impressions a day across three regions (us-east, eu-west, ap-south).
Reading guide:
- §1 — one ad served end-to-end, all 14 actors named
- §2–§4 — problem and requirements
- §5 — ecosystem + components inside the exchange
- §6–§7 — architecture and sizing
- §8–§9 — data model and APIs
- §10 — deep dives: auction flow, DSP selection, bid optimization, ad serving, spend tracking, fraud
- §11–§15 — bottlenecks, failures, deployment, observability, security
TL;DR: A user in Austin abandons $130 Nike shoes in their cart Saturday night. Sunday morning they open ESPN and see those exact shoes in a banner within ~110ms of page load. Fourteen companies pull that off: CDP, identity graph, campaign manager, DSP with an ML bidder, header bidder, four SSPs, an exchange, a publisher ad server, a CDN, verification, attribution. The exchange is the marketplace. It owns no campaigns, budgets, or creatives — those all live in DSPs. Its job: take supply from SSPs, pick the right DSPs to ask, run a fair first-price auction, return markup, track impressions on its own books, reconcile settlement. At 1M requests/sec the things that matter are: don't fan out to all 50 DSPs (pick the top 5), don't put a network hop in front of every enrichment lookup (cache in-process), don't log every losing bid to Kafka (sample), and never trust a DSP's own impression count for billing.
1. How One Ad Actually Gets Served
A user in Austin looked at $130 Nike running shoes on Saturday night, added them to the cart, closed the tab, and forgot. Sunday morning they open ESPN on their phone, tap into a Cowboys-Giants recap, and ~110ms later those exact shoes show up in a banner ad. They tap and buy.
Fourteen companies pull that off in the tenth of a second between page load and first paint. This post designs one of them — the ad exchange. The table below names the other thirteen so the rest of the post has something to point at.
The cast
| Phase | Actor | Role |
|---|---|---|
| Pre-auction setup | Nike | The advertiser |
| Pre-auction setup | WPP | Nike's agency, runs day-to-day media buying |
| Pre-auction setup | CM360 | Stores creatives, budgets, flight dates, Floodlight tracking pixels |
| Pre-auction setup | Segment (CDP) | Captures added_to_cart on nike.com; builds the cart_abandoners_shoes_7d audience |
| Pre-auction setup | LiveRamp (identity graph) | Ties the user's hashed email to a RampID that links laptop cookie ↔ iPhone IDFA |
| Pre-auction setup | The Trade Desk (DSP) | Ingests the audience; bids in the auction |
| Runtime — publisher | ESPN | Owns the page and the 300×250 slot |
| Runtime — publisher | Prebid.js | Header-bidding wrapper in the browser, calls 4 SSPs in parallel |
| Runtime — publisher | Magnite, PubMatic, Index, OpenX | The four SSPs Prebid hits |
| Runtime — publisher | Google Ad Manager | ESPN's ad server; picks between direct deals and the winning Prebid bid |
| Runtime — exchange | Google AdX | The exchange — this system |
| Runtime — delivery | CloudFront | Serves the banner image from an Austin edge POP |
| Measurement | IAS | Viewability beacon fired from inside the creative |
| Measurement | GA4 + Floodlight | Attribution pixels on the order-confirmation page |
Google shows up four times (Ads, DV360, AdX, GAM, CM360) but the products are distinct.
How the request actually flows
The night before
- Segment's JavaScript on nike.com sees the cart-add and the eventual session timeout, and writes the user into the
cart_abandoners_shoes_7daudience. - Every fifteen minutes Segment reverse-ETLs new audience members into The Trade Desk. By the time the user goes to bed, TTD already knows about them.
- LiveRamp attaches a RampID to the user's hashed email so the laptop cookie and the iPhone IDFA resolve to the same identity the next morning.
Sunday morning, the page loads
- The browser requests
espn.com/nfl/cowboys-giants-recap. ESPN returns HTML and JavaScript. - Prebid.js initializes and identifies the 300×250 slot in the article body:
<article>Cowboys quarterback Dak Prescott threw for 301 yards...</article>
<div id="div-gpt-ad-midarticle"></div>
<article>In the second half, the Giants...</article>
<script>
pbjs.addAdUnits([{
code: 'div-gpt-ad-midarticle',
mediaTypes: { banner: { sizes: [[300, 250]] } },
bids: [
{ bidder: 'magnite', params: { accountId: 'espn-001' } },
{ bidder: 'pubmatic', params: { publisherId: 'espn-002' } },
{ bidder: 'ix', params: { siteId: 'espn-003' } },
{ bidder: 'openx', params: { unit: 'espn-004' } }
]
}]);
</script>- Prebid fans out to all four SSPs in parallel.
- Each SSP enforces ESPN's $3 sports-vertical floor, attaches an
schainobject, and forwards into one or more exchanges. Magnite forwards into AdX.
Inside AdX, the auction
- Enrichment: an in-process LRU lookup returns consent state, a cookie-sync map (
{ttd: ttd_abc, dv360: dv_xyz, amzn: amzn_123}), and the user's RampID. Cache hit, no network. - Pre-bid fraud filter: IP reputation, ASN check, user-agent sanity. All clean.
- Smart DSP selection: 50 DSPs are registered but only five get asked. AdX scores each one on geo, format, vertical, win rate, and capacity, and picks TTD, DV360, Amazon DSP, Criteo, and Xandr.
- Fan-out: bid requests go out in parallel over HTTP/2 with a 60ms timeout.
TTD's bid (the winner)
- Audience match: user is in
cart_abandoners_shoes_7d. - Campaign match: Pegasus 41 Retargeting is eligible.
- Frequency cap: 0/3 today, allowed.
- ML model: pCTR 8% (very high — fresh cart abandoner looking at the same product), pCVR 12%, expected value ~$1.25 per impression, maximum CPM well over $1,000.
- Shading model: ESPN sports impressions usually clear around $8, so TTD bids $8.50.
- Response carries the price plus ready-to-render HTML for the Pegasus banner with a Floodlight pixel embedded.
The other four DSPs
- DV360: $5.00 (generic shoe retargeting).
- Criteo: $6.50 (also tracks nike.com).
- Amazon DSP: $4.00 (Nike sells on Amazon).
- Xandr: 204 no-bid.
Resolution and render
- AdX picks TTD at $8.50, substitutes the
${AUCTION_PRICE}macro with an encrypted price token, validates the creative against ESPN's blocked-categories list, and returns the markup to Magnite. - Prebid compares all four SSPs (PubMatic $7.00, Index $5.00, OpenX $3.50) and picks Magnite as the overall winner.
- GAM checks direct deals: Ford has a homepage sponsorship that doesn't apply to NFL articles, Progressive's guaranteed deal is desktop-only. Prebid's $8.50 beats the line-item stack.
- The browser fetches the banner from CloudFront's Austin POP (~12ms cache hit). IAS's
IntersectionObserverbeacon starts watching the slot. - Ad becomes visible roughly 110ms after the page started rendering.
After the impression
- Exchange impression pixel, Floodlight pixel, and IAS viewability beacon all fire. IAS confirms 60% of pixels in view for over a second, counts it as viewable per MRC.
- The user taps the banner two seconds later. The click hits the exchange's
/t/clickendpoint, gets logged to Kafka, and 302-redirects tonike.com/pegasus-41?utm_source=ttd&utm_medium=retargeting. - The cart is still there. The user buys.
- The order-confirmation page fires the Floodlight conversion pixel. CM360 attributes the $130 purchase back to the TTD click on ESPN.
Where the $8.50 actually goes
Of the $8.50 the advertiser pays, only $5.30 makes it to ESPN. The rest splits among the intermediaries:
| Actor | Cut |
|---|---|
| Nike (advertiser, gross) | $8.50 |
| The Trade Desk (DSP fee) | $1.00 |
| LiveRamp (identity match) | $0.15 |
| Google AdX (exchange take rate) | $0.80 |
| Magnite (SSP fee) | $1.15 |
| IAS (verification) | $0.10 |
| ESPN (publisher net) | $5.30 |
CloudFront bandwidth is rounded into ESPN's hosting bill. ~37% of the gross goes to ad-tech middlemen — the number publishers point at when they argue for Supply Path Optimization (collapsing redundant SSP and exchange hops).
Not counted: WPP's agency commission (separate, ~10–15% of media spend), Segment subscription, CM360 license, and any flat fees. Those are negotiated outside the auction.
Things people get wrong
| Myth | Reality |
|---|---|
| "Google Ads", DV360, AdX, GAM, and CM360 are the same product | Five different Google products. Ads is the SMB DSP, DV360 the enterprise DSP, AdX the exchange (this post), GAM the publisher ad server, CM360 the campaign manager. |
| The SSP runs the auction | It doesn't. The SSP packages publisher inventory and forwards to the exchange. GAM is both an SSP and an ad server; some SSPs (Magnite) run internal sub-auctions before forwarding downstream. |
| Campaigns, budgets, and frequency caps live in the exchange | They live in DSPs. The exchange only sees bids and no-bids. Creatives, pacing, and ML bid optimization all sit on the DSP side. |
| Prebid.js is an SSP | It's a header-bidding wrapper that runs in the browser and calls multiple SSPs in parallel. |
| A CDP (Segment) is the same as an identity graph (LiveRamp) | CDP = first-party events on a single property. Identity graph = cross-device resolution and (historically) third-party segments. |
2. Problem Statement
Online advertising pays for most of the open web. Every page load triggers an auction that has to resolve before content finishes rendering — roughly 100ms from when the request leaves the publisher to when winning markup comes back. Miss the budget and the slot stays empty: publisher loses revenue, advertiser loses reach.
The exchange sits in the middle. It takes supply from SSPs, picks DSPs to ask, runs a first-price auction, returns the winner, tracks the impression independently, and settles money at the end of the day. What makes it hard is the combination of latency, fan-out, and real money on every auction.
Latency is the dominant constraint. Naive fan-out to all 50 DSPs is 50KB × 1M QPS = 50 GB/sec outbound, and every auction waits on the slowest of fifty bidders. The realistic baseline is scoring each DSP per request and fanning out only to the top 5 most likely to bid competitively (see §10.2).
Scale matters even after that trim. 1M QPS sustained, 2M peak during US prime time, 5M outbound bid requests per second, ~100 auction-server pods across three regions. Every hot-path lookup must be sub-millisecond.
The money invariant: an auction bug that lets a $0.10 bid beat a $2.00 bid leaks $1.90 per impression. At 300K imps/sec that's $570/sec — $50M a day. Auction integrity is the product, not a nice-to-have. Winning bids log at 100%, losing bids are sampled, and DSP-reported numbers are reconciled nightly.
Underneath all of that sit the long-running concerns. Each has to be handled in under 2ms of combined pre-bid overhead:
- Bot traffic, click farms, datacenter-hosted "users" — pre-bid IP/ASN filters
- Domain spoofing — ads.txt, sellers.json, and the
schainobject close this loop - Headless browsers — WebGL/canvas fingerprints are the giveaway
- Consent (GDPR, CCPA) — PII stripped on demand; a single GDPR violation can run 4% of global revenue
Quick numbers to anchor the rest of the post:
| Metric | Target |
|---|---|
| Ad requests per second (sustained) | 1,000,000 |
| Ad requests per second (peak) | 2,000,000 |
| Impressions per second (30% fill) | ~300,000 |
| Impressions per day | ~25 billion |
| DSPs registered | 50+ |
| DSPs per auction (top-N) | 5 |
| Auction latency p99 | < 100 ms |
| DSP response timeout | 60 ms |
| Monthly spend through exchange | ~$4 billion |
| Exchange take rate | 10–15% |
What we're deliberately avoiding
- Don't fan out to every DSP. Top-5 smart selection cuts outbound traffic 10× for a <2% fill-rate hit.
- Don't put Valkey in the hot path for every enrichment lookup. An in-process LRU fronts it and absorbs >95% of reads.
- Don't log every losing bid to Kafka. Sample losing bids at 1%; tee the full stream to S3/Iceberg for cheap durable storage.
- Don't call DSPs sequentially. Parallel HTTP/2 fan-out with a 60ms deadline.
- Don't trust DSP-reported impression counts for money. The exchange tracks its own and reconciles nightly.
3. Functional Requirements
| ID | Requirement | Priority |
|---|---|---|
| FR-01 | Accept bid requests from SSPs via OpenRTB 2.6 and run first-price sealed-bid auctions in < 100 ms p99 | P0 |
| FR-02 | Smart-select top 5 eligible DSPs per auction and fan out in parallel with 60 ms timeout | P0 |
| FR-03 | Enforce publisher floor prices and block-list (categories/advertisers) per publisher config | P0 |
| FR-04 | Return winning ad markup (HTML for banner, VAST XML for video) to the SSP within budget | P0 |
| FR-05 | Serve ad creatives through CDN edge nodes with cache headers for efficient delivery | P0 |
| FR-06 | Track impressions via server-side pixel (1×1 GIF) with deduplication | P0 |
| FR-07 | Track clicks via redirect URL with destination validation | P0 |
| FR-08 | Enforce exchange-level creative dedup (max N impressions of same creative per user per hour) as an ad-quality measure. Per-campaign frequency capping is a DSP responsibility. | P1 |
| FR-09 | Track per-DSP spend in near-real-time via Flink streaming for credit-limit enforcement and settlement | P0 |
| FR-10 | Validate supply chain: ads.txt, sellers.json, and schain object on every request | P0 |
| FR-11 | Check consent (TCF / US Privacy string) and strip PII from bid requests when required | P0 |
| FR-12 | Publish tracking events (impressions, clicks, viewability, auction results) to Kafka for billing and analytics | P0 |
| FR-13 | Reconcile exchange-tracked impressions with DSP-reported impressions daily; flag discrepancies > 0.01% | P1 |
| FR-14 | Provide publisher and DSP management APIs (floor prices, DSP onboarding, settlement reports) | P1 |
| FR-15 | Support banner, video (VAST 4.2), and native ad formats | P0 |
| FR-16 | Pre-bid fraud filtering: IP reputation, user-agent signature, datacenter detection, ASN reputation | P0 |
4. Non-Functional Requirements
| Dimension | Target |
|---|---|
| Auction latency (p50) | < 50 ms |
| Auction latency (p99) | < 100 ms |
| Fill rate | > 30% (varies by publisher and market) |
| Availability | 99.95% (4.4 hours/year planned + unplanned downtime) |
| Tracking pipeline loss | < 0.01% event loss end-to-end |
| Billing accuracy (reconciled) | ±0.01% of DSP-reported impressions |
| CDN cache hit rate | > 95% |
| DSP connection pool warm starts | All DSPs kept warm via periodic health pings |
| Multi-region failover | < 60 seconds (DNS-based geo failover) |
| Deployment rollback | < 5 minutes for any component |
5. High-Level Approach & Technology Selection
5.1 The full ecosystem
| Layer | Role | Examples |
|---|---|---|
| Advertiser | Pays for ads. Sets goals, budgets, targeting. | Nike, P&G, a local dentist |
| Agency | Runs media buying on behalf of advertisers. Contracts with DSPs. | WPP/GroupM, Publicis, Omnicom |
| Campaign Manager | Stores creatives, flight dates, budget rules, attribution tags. Publishes campaigns to DSPs. | Google Campaign Manager 360 (CM360), Adobe Advertising |
| CDP | Captures first-party events from advertiser sites. Builds audiences. Syncs to DSPs. | Segment, mParticle, Treasure Data |
| Identity graph / DMP | Resolves user identity across devices and cookies. Provides stable cross-device IDs. | LiveRamp (RampID), Neustar Fabrick, ID5 |
| DSP | Receives bid requests from exchanges. Runs bid optimization ML. Decides to bid and at what price. Owns campaign budgets and frequency caps. | The Trade Desk, DV360, Amazon DSP, Criteo, Xandr |
| Ad Exchange | This system. Runs the auction. Receives supply from SSPs, selects and fans out to DSPs, picks a winner. | Google AdX, OpenX, Index Exchange, PubMatic, Magnite |
| SSP | Packages publisher inventory. Enforces floor prices, brand safety rules. Forwards bid requests to exchanges and direct DSPs. | Magnite, PubMatic, Index Exchange, OpenX, Xandr Monetize |
| Header bidder | Client-side JavaScript that calls multiple SSPs in parallel from the browser, then picks the highest bid. | Prebid.js, Amazon TAM |
| Publisher ad server | Owns the ad slot. Decides between direct deals, guaranteed deals, and programmatic (Prebid) bids. | Google Ad Manager (GAM), Kevel, FreeWheel |
| Publisher | Owns the website or app. Gets paid per impression. | ESPN, CNN, NYT, mobile game developers |
| Verification | Measures viewability, brand safety, invalid traffic. Runs JavaScript beacons. | Integral Ad Science (IAS), DoubleVerify, MOAT |
| Attribution & analytics | Tracks conversions. Attributes them to impressions and clicks. | Google Analytics 4, Floodlight (CM360), Adjust (mobile), AppsFlyer (mobile) |
| CDN | Serves creative assets from edge POPs. | CloudFront, Fastly, Akamai, Cloudflare |
Boundary this post uses: campaigns, budgets, creatives, bid optimization, frequency capping, and conversion attribution all live inside DSPs. The exchange is a stateless marketplace that sees only bid requests, bids, wins, impressions, and clicks. It earns a 10–15% take rate on each cleared auction.
5.2 First-price auctions
The industry shifted from second-price to first-price around 2017–2019 because second-price invited "last look" abuse: exchanges that saw all bids could let favored buyers re-bid one cent above the clearing price. First-price killed that ambiguity — the winner pays what the winner said, and there's nothing for the exchange to manipulate.
The one tradeoff: DSPs now have to bid below their true valuation (bid shading) to avoid overpaying. Shading lives inside the DSP; the exchange doesn't see it.
| Second-price (legacy) | First-price (current) | |
|---|---|---|
| Winner pays | Second-highest + $0.01 | Their own bid |
| DSP strategy | Bid truthfully | Bid shade (0.5–0.85 × true value) |
| Exchange complexity | Higher (track top-2 bids) | Lower (track max bid) |
| Transparency | Low (exchanges could manipulate) | High |
| 2024+ adoption | Declining | Dominant |
5.3 OpenRTB 2.6
Every DSP, SSP, and exchange speaks OpenRTB 2.6. The protocol defines the wire format for bid requests, bid responses, win notices, and the supporting objects (Site, App, User, Device, Imp, Bid). Real payloads in §9.1 and §9.2.
One detail that matters downstream: the adm field in a bid response carries render-ready markup (HTML for banner, VAST XML for video), not a URL. The DSP provides executable markup; the exchange substitutes a few macros (auction price, click URL, impression URL) before forwarding.
| Object | Purpose | Key Fields |
|---|---|---|
BidRequest | Top-level request from exchange to DSP | id, imp[], site/app, user, device, regs, tmax |
Imp | One ad slot | id, banner/video/native, bidfloor, pmp |
Site/App | Publisher context | domain, page, cat[], publisher |
User | User targeting | id, buyeruid, geo, data[], consent |
Device | Device info | ua, ip, geo, devicetype, os |
BidResponse | DSP's response | id, seatbid[], cur |
Bid | Individual bid | id, impid, price, adm (creative markup), crid, adomain[] |
5.4 The components inside the exchange
Eight services. Only the auction server is on the hot path; everything else is asynchronous.
| Service | Role | Hot path? | Stack |
|---|---|---|---|
| Auction server | Accepts OpenRTB from SSPs, enriches, runs DSP selection + fan-out + first-price auction, returns winning markup | ✅ | Go + in-process LRU |
| Ad server | Macro substitution on winning markup; assembles VAST XML for video | ✅ | Go (same binary or sidecar) |
| Tracking endpoint | Impression pixels, click redirects, viewability beacons. Dedupes against Valkey, produces to Kafka async, returns the pixel fast | ✅ | Go (separate pod group) |
| Creative dedup | Valkey counter that caps the same creative to N/user/hour (ad-quality, not per-campaign freq capping) | 🟡 | Valkey |
| Flink spend aggregator | Consumes impressions keyed by DSP; writes running spend totals to Valkey for credit-limit checks; computes rolling win rates feeding DSP selection | ❌ | Flink + Valkey |
| Daily reconciliation | Aggregates ClickHouse per DSP and per publisher; reconciles against DSP-reported numbers; writes settlement records | ❌ | Batch job |
| Management API | CRUD for publisher configs, DSP onboarding, settlement queries | ❌ | Go + Postgres |
| DSP selection | Not a separate service — a library inside the auction server that scores DSPs on geo, format, vertical, win rate, capacity | ✅ | Go library |
Per-campaign frequency capping and ML bid optimization live in DSPs, not here.
5.5 Storage
| Store | Technology | Rationale |
|---|---|---|
| Enrichment hot-path cache | In-process LRU (ristretto) | 5-second TTL. Serves > 95% of enrichment reads with zero network hops. |
| Enrichment cold-path | Valkey Cluster | Sub-ms reads on cache miss. Sharded by user ID. Background-synced from PostgreSQL + event streams. |
| Auction event log (sampled) | Kafka | Durable event stream. 100% of impressions and winning bids; 1% sample of losing bids. |
| Real-time analytics & billing | ClickHouse | Columnar analytics on billions of rows. Sub-second aggregation for dashboards. |
| Exchange configuration | PostgreSQL | DSP configs, publisher settings, SSP registrations, settlement records. Read replicas for auction servers. |
| Bid-level archive | S3 / Iceberg (Parquet) | Long-term storage of winning bids and sampled losses. For billing disputes and ML training. |
| Creative assets | S3 + CloudFront | Originless serving via CDN with > 95% cache hit rate. |
5.6 Why Go
Go fits the fan-out pattern: each DSP call is a goroutine, the timeout is a context, and the auction starts as soon as the last bid arrives or the deadline fires. Sub-millisecond GC pauses (Go 1.22+) stay under the auction budget. The stdlib HTTP/2 client has built-in connection pooling and multiplexing — no third-party dependency for the hot path.
Rust gives zero GC and slightly better tails but costs hiring velocity. Java with Netty and virtual threads runs fine at scale; G1GC tuning at tail percentiles is fiddly but several large exchanges ship on the JVM.
5.7 Why ClickHouse
At 25 billion impressions a day, dashboard queries aggregate over 100B+ rows ("spend by DSP by publisher by hour for the last 7 days"). ClickHouse handles that in single-digit seconds. Druid is the other serious option but adds operational complexity; BigQuery's per-query cost gets painful at this scale; Postgres can't move enough rows.
Kicker: ClickHouse's native Kafka table engine makes ingestion zero-code — a MergeTree table materializes from a Kafka topic automatically.
6. High-Level Architecture
6.1 Multi-region bird's eye
Three regions: us-east-1, eu-west-1, ap-south-1. Geo-DNS or Anycast routes each SSP request to the nearest region. Inside a region everything is stateless or regionally-sharded. Cross-region state (DSP configs, publisher settings, billing records) lives in PostgreSQL with logical replication from a single primary in us-east-1.
6.2 Load-bearing decisions
| Decision | Why | What breaks without it |
|---|---|---|
| Stateless auction servers | Config cached with a 30s refresh, plus an async-refreshed LRU. Any pod serves any request. | Horizontal scaling becomes session-affinity hell. |
| L4 balancing, TLS in-pod | TLS termination at the LB is too expensive at 1M QPS. L4 hands TCP via consistent hashing; TLS decodes inside the pod, parallelized across cores. | LB becomes the single hottest box. |
| In-process LRU in front of Valkey | >95% hit rate at peak (same users appear in many simultaneous auctions). 5s TTL keeps staleness in check. Cache misses fall through to Valkey. | Valkey would have to handle 5M ops/sec and wire latency alone would blow the auction budget. |
| Fan-out inside the auction server | Persistent HTTP/2 pools to every registered DSP. Saves a network hop vs. a separate fan-out tier; tighter control over per-DSP timeouts and circuit breakers. | Extra hop eats 2–5ms of the budget. |
| Kafka as the universal event bus | Winning bids, impressions, clicks, viewability, config updates all flow through Kafka. Auction server produces async, never waits for ack on the hot path. | Mixed sync/async paths multiply failure modes. |
| Sampled bid logging | 100% of impressions + winning bids; 1% of losing bids. Full stream tees to S3/Iceberg via Kafka Connect. | Kafka ingests ~15 GB/sec (3× replication); operationally painful. |
| CDN-first creative delivery | Creatives live on S3 + CloudFront. Auction server never touches creative bytes; it returns markup pointing at a CDN URL. | Origin melts during creative rotation. |
The load-bearing row is the LRU. Without it, the auction budget is spent entirely on Valkey round-trips — everything else in the table is cheaper to fix than that.
6.3 Auction flow, happy path
6.4 Auction flow, timeout
When the 60ms DSP timeout fires, the auction server runs the auction with whatever bids have arrived. Zero bids above floor means a no-fill response back to the SSP. Slow DSPs feed into a per-DSP circuit breaker, covered in §12.2.
7. Back-of-the-Envelope Sizing
7.1 Request volume
Sustained: 1,000,000 QPS
Peak: 2,000,000 QPS (US evening prime time)
Design for: 1,500,000 QPS with headroom
Per day: 1M × 86,400 ≈ 86 billion bid requests/day
Fill rate: 30%
Impressions: 86B × 0.30 ≈ 26 billion/day
≈ 300,000 impressions/sec7.2 DSP fan-out
Naive (fan out to all 50 DSPs): 1M × 50 = 50M bid requests/sec
Top-5 smart selection: 1M × 5 = 5M bid requests/sec
Bid request: ~1 KB (OpenRTB JSON, gzipped on the wire)
Bid response: ~0.5 KB
Outbound: 5M × 1 KB = 5 GB/sec
Inbound: 5M × 0.5 KB = 2.5 GB/sec
Total: ~7.5 GB/sec across all auction serversThe top-5 selection is what makes the bandwidth (and the per-DSP cost) tractable. Without it the exchange is wasting an order of magnitude on requests no DSP would have bid on anyway. §10.2 has the scoring algorithm.
7.3 Auction server sizing
Per-auction latency budget:
LRU cache hit (95%): 0.1 ms
Valkey cold (5%): 1.0 ms (amortized 0.05 ms)
Pre-bid filter: 1.0 ms
DSP selection: 1.0 ms
DSP fan-out (parallel, 60 ms timeout): 40 ms avg
Auction logic: 0.5 ms
Macro sub + response: 1.0 ms
Total p50: ~45 ms
Total p99: ~80 ms
Per pod (c6g.4xlarge, 16 vCPU, 32 GB RAM):
Concurrent in-flight auctions: ~2,000
QPS per pod: ~25,000
Pods needed:
Sustained 1M / 25K = 40 pods
Peak 2M / 25K = 80 pods
Deploy 100 pods across 3 regions (40 US + 30 EU + 30 APAC) with HPA to 2x7.4 Cache and Valkey
Enrichment lookups per auction: 3 logical keys
- user consent + cookie-sync (1 hash)
- IP/UA fraud flags (1 set membership)
- DSP credit-limit flags (1 hash)
At 1M QPS:
LRU hits (95%): 2.85M logical lookups/sec in-process, zero network
Valkey cold (5%): 150K ops/sec, trivial for a Valkey cluster
Valkey working set:
Active users (30-day): ~200 million
Per-user entry: ~150 bytes (consent + cookie sync)
Total users: 200M × 150 = 30 GB
Fraud lists (IPs + UAs): ~1 GB
DSP credit state: negligible
Creative dedup counters: ~10 GB
Total: ~42 GB
Valkey cluster: 3 primaries (16 GB each) + 3 replicas = 6 nodes per region.7.5 Kafka (with sampling)
Event topics:
impressions: 300K/sec × 500 bytes = 150 MB/sec
clicks: 3K/sec × 300 bytes = 1 MB/sec
viewability: 300K/sec × 200 bytes = 60 MB/sec
winning_bids: 300K/sec × 800 bytes = 240 MB/sec
losing_bids (1%): 50K/sec × 800 bytes = 40 MB/sec
Total: ~490 MB/sec
× replication 3 = 1.5 GB/sec write throughput
Per day:
490 MB/sec × 86,400 = 42 TB/day ingested
× zstd 4x compression ≈ 10 TB/day on disk
Hot retention 3 days = 30 TB
Kafka cluster (per region): 10 brokers × 4 TB NVMe = 40 TB
~500 MB/sec ingest per region fits at <30% capacity.The unsampled bid stream gets teed directly to S3/Iceberg through Kafka Connect, so durable long-term storage doesn't sit on hot Kafka brokers.
7.6 ClickHouse
Ingest rate:
impressions: 300K rows/sec
clicks: 3K rows/sec
winning_bids: 300K rows/sec (separate table)
Total: ~600K rows/sec
Row sizes (after compression):
impression row: ~60 bytes compressed
Per day: 26B × 60 = 1.5 TB/day compressed
90-day hot: 135 TB
Cluster (per region):
4 shards × 3 replicas = 12 nodes
Each: r6g.4xlarge, 4 TB NVMe
Total: 48 TB per region
TTL moves > 30-day data to S3 tiered storage.7.7 CDN
300K impressions/sec × 200 KB avg creative = 60 GB/sec egress
CDN cache hit rate > 95% → origin pulls < 3 GB/sec
Daily egress: 60 GB/sec × 86,400 ≈ 5 PB/day
Unique creatives: ~500K (top 1% serve 60% of requests, heavy head and long tail)
Creative total storage on origin: 500K × 200 KB = 100 GB
CDN POP cache: ~10 GB hot working set per POP7.8 Summary
| Resource | Number |
|---|---|
| Auction server pods (global) | 100 |
| Valkey nodes (global, 3 × 6) | 18 |
| Kafka brokers (global, 3 × 10) | 30 |
| ClickHouse nodes (global, 3 × 12) | 36 |
| Outbound DSP bandwidth | ~7.5 GB/sec |
| CDN egress | ~60 GB/sec |
| Monthly AWS + CDN bill (rough) | $8–12M |
| Monthly revenue at 10% take rate | ~$400M |
A reasonable check on the economics: at $4B/month gross spend through the exchange and a 10% take rate, that's about $400M/month in revenue against $8–12M/month in infrastructure. Around 3% of revenue going to compute and bandwidth is what makes the business work. Smart DSP selection, in-process caching, and Kafka sampling are the three things that keep the cost line that low.
8. Data Model
8.1 Auction state machine
8.2 Core tables (PostgreSQL)
The exchange stores publisher settings, DSP configs, SSP registrations, and billing records. Campaigns, budgets, creatives, and frequency caps don't appear here; those live in DSPs.
CREATE TABLE dsp_configurations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
dsp_name VARCHAR(100) NOT NULL UNIQUE,
bid_endpoint TEXT NOT NULL,
win_notice_endpoint TEXT,
max_qps INT NOT NULL DEFAULT 100000,
timeout_ms INT NOT NULL DEFAULT 60,
allowed_categories TEXT[],
allowed_geos TEXT[],
allowed_formats TEXT[],
seat_id VARCHAR(50),
circuit_breaker JSONB NOT NULL DEFAULT '{"err_threshold": 0.5, "timeout_threshold": 0.3, "window_sec": 60, "cooldown_sec": 30}',
historical_win_rate DECIMAL(5,4) DEFAULT 0,
enabled BOOLEAN NOT NULL DEFAULT true,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE publishers (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(255) NOT NULL,
domain VARCHAR(255) NOT NULL UNIQUE,
ssp_id UUID REFERENCES ssp_configurations(id),
floor_price_cents INT NOT NULL DEFAULT 50,
blocked_categories TEXT[],
blocked_advertisers TEXT[],
ads_txt_verified BOOLEAN NOT NULL DEFAULT false,
revenue_share_pct DECIMAL(5,2) NOT NULL DEFAULT 85.00,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE billing_settlements (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
settlement_date DATE NOT NULL,
dsp_id UUID NOT NULL REFERENCES dsp_configurations(id),
publisher_id UUID REFERENCES publishers(id),
impressions BIGINT NOT NULL,
clicks BIGINT NOT NULL,
gross_spend_cents BIGINT NOT NULL,
exchange_fee_cents BIGINT NOT NULL,
publisher_payout_cents BIGINT NOT NULL,
dsp_reported_impressions BIGINT,
discrepancy_pct DECIMAL(5,4),
status VARCHAR(20) NOT NULL DEFAULT 'PENDING',
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_settlements_dsp ON billing_settlements(dsp_id, settlement_date);
CREATE INDEX idx_settlements_pub ON billing_settlements(publisher_id, settlement_date);ssp_configurations and creative_audit_log follow the same pattern.
8.3 Event schemas (Kafka → ClickHouse)
The impressions table is the billing source of truth.
CREATE TABLE impressions (
impression_id String,
auction_id String,
timestamp DateTime64(3),
dsp_id String,
publisher_id String,
publisher_domain String,
creative_id String,
advertiser_domain String,
price_cpm Float64,
user_id Nullable(String),
device_type Enum8('desktop'=1, 'mobile'=2, 'tablet'=3, 'ctv'=4),
geo_country LowCardinality(String),
geo_region LowCardinality(String),
viewable Nullable(UInt8),
viewability_pct Nullable(Float32),
time_in_view_ms Nullable(UInt32),
is_click UInt8 DEFAULT 0,
click_timestamp Nullable(DateTime64(3))
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (dsp_id, publisher_id, timestamp)
TTL timestamp + INTERVAL 90 DAY;The auction_results table follows the same shape with winning_price_cpm, num_bids_received, num_dsps_selected, and num_dsps_timeout. Writes are 100% for winners and 1% for losers.
8.4 Valkey keyspace
| Key | Structure | TTL | Purpose |
|---|---|---|---|
user:{uid}:consent | Hash | 30d | TCF string + consent status |
user:{uid}:cookiesync | Hash | 30d | Exchange UID ↔ DSP buyer UIDs |
user:{uid}:creative:{crid} | Counter | 1h | Exchange-level creative dedup |
dsp:{dsp_id}:spend_today | Hash | until midnight | Spend, credit limit, credit remaining |
dsp:{dsp_id}:circuit | Hash | 5m | Circuit breaker state |
dsp:{dsp_id}:winrate:{vertical} | Float | 1h | Rolling win rate feeding smart selection |
ivt:ip_blocklist | Set | 1h | Known bot IPs |
ivt:asn_reputation | Hash | 1h | ASN reputation scores (datacenter, residential) |
ivt:ua_patterns | Set | 1h | Suspicious UA regex hits |
pub:{pub_id}:config | Hash | 5m | Publisher config cache |
pub:{domain}:adstxt | Hash | 24h | Cached ads.txt entries |
Per-campaign frequency caps, campaign budgets, and advertiser targeting rules don't appear here. Those live in DSPs.
9. API Design
9.1 Bid request (exchange to DSP)
POST /openrtb/2.6/bid
Content-Type: application/json
X-OpenRTB-Version: 2.6{
"id": "auc_01HXYZ123",
"imp": [{
"id": "imp_001",
"banner": {"w": 300, "h": 250, "pos": 1},
"bidfloor": 0.50,
"bidfloorcur": "USD"
}],
"site": {
"domain": "espn.com",
"page": "https://espn.com/nfl/story/cowboys-giants-recap",
"cat": ["IAB17"],
"publisher": {"id": "pub_espn", "domain": "espn.com"}
},
"user": {
"id": "uid_user_abc",
"buyeruid": "ttd_user_xyz",
"geo": {"country": "USA", "region": "TX", "city": "Austin"},
"consent": "CPXxRfAPXxRfAAfKAB..."
},
"device": {
"ua": "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X)...",
"ip": "198.51.100.42",
"devicetype": 4,
"os": "iOS"
},
"regs": {"coppa": 0, "gdpr": 0},
"tmax": 60,
"at": 1,
"cur": ["USD"],
"source": {
"ext": {
"schain": {
"ver": "1.0",
"complete": 1,
"nodes": [{"asi": "adx.google.com", "sid": "pub_espn", "hp": 1}]
}
}
}
}DSP bid response (200 OK):
{
"id": "auc_01HXYZ123",
"seatbid": [{
"bid": [{
"id": "bid_ttd_001",
"impid": "imp_001",
"price": 8.50,
"adm": "<div id='ad-${AUCTION_ID}'><a href='${CLICK_URL}https://nike.com/pegasus-41'><img src='https://cdn.nike.com/cr/pegasus_300x250.jpg' width='300' height='250'/></a><img src='${IMPRESSION_URL}' style='display:none'/></div>",
"crid": "cr_nike_pegasus_01",
"w": 300,
"h": 250,
"adomain": ["nike.com"],
"cat": ["IAB18"]
}],
"seat": "seat_nike"
}],
"cur": "USD"
}DSP no-bid: 204 No Content.
9.2 Win notice (exchange to DSP)
POST /win
{
"auction_id": "auc_01HXYZ123",
"bid_id": "bid_ttd_001",
"imp_id": "imp_001",
"price": 8.50,
"currency": "USD",
"timestamp": "2026-05-29T14:32:00.045Z"
}9.3 Impression tracking
GET /t/imp?auc=auc_01HXYZ123&imp=imp_001&price=enc_xyz&dsp=ttd&pub=pub_espn&crid=cr_nike_pegasus_01
→ 200 OK, Content-Type: image/gif, 43-byte transparent GIFThe endpoint parses query params, decrypts the price token, writes to Valkey for dedup, produces to Kafka asynchronously, and returns the pixel. Target p99 under 5ms.
9.4 Click redirect
GET /t/click?auc=auc_01HXYZ123&imp=imp_001&dest=https%3A%2F%2Fnike.com%2Fpegasus-41
→ 302 Found
Location: https://nike.com/pegasus-41Destination URLs are validated against a whitelist pattern (allowed schemes, no open-redirect loops) before the 302 is emitted.
9.5 Publisher config API
PUT /v1/publishers/{publisher_id}/config
{
"floor_price_cents": 75,
"blocked_categories": ["IAB25", "IAB26"],
"blocked_advertiser_domains": ["competitor.com"],
"revenue_share_pct": 85.00
}9.6 DSP onboarding
POST /v1/dsps
{
"dsp_name": "Example DSP",
"bid_endpoint": "https://dsp.example.com/bid",
"win_notice_endpoint": "https://dsp.example.com/win",
"max_qps": 100000,
"timeout_ms": 60,
"allowed_categories": ["IAB1", "IAB17"],
"allowed_geos": ["US", "CA"],
"seat_id": "seat_example_001"
}
→ 201 Created { "id": "...", "status": "SANDBOX", "mtls_cert_url": "..." }9.7 Settlement report
GET /v1/settlements?dsp_id=ttd&start=2026-06-01&end=2026-06-07
{
"dsp_id": "ttd",
"period": {"start": "2026-06-01", "end": "2026-06-07"},
"totals": {
"impressions": 1750000000,
"clicks": 17500000,
"gross_spend_cents": 14000000000,
"exchange_fee_cents": 1400000000,
"publisher_payout_cents": 12600000000,
"dsp_reported_impressions": 1749640000,
"discrepancy_pct": 0.0002
}
}10. Deep Dives
10.1 RTB auction flow, end to end
The example from §1 maps onto this timing breakdown. The exchange's share is steps 4–8 — 43ms of actual work between receiving the request from Magnite and returning the winning markup.
| Step | Component | Time | Running |
|---|---|---|---|
| 1 | ESPN page HTML loads, Prebid.js executes | 20 ms | 20 ms |
| 2 | Prebid → Magnite SSP bid request | 3 ms | 23 ms |
| 3 | Magnite → AdX network hop | 3 ms | 26 ms |
| 4 | LRU cache hit: consent, cookie-sync, DSP flags | 0.1 ms | 26.1 ms |
| 5 | Pre-bid fraud filter + ads.txt validate | 1 ms | 27.1 ms |
| 6 | Smart DSP selection (top 5 of 50) | 1 ms | 28.1 ms |
| 7 | DSP fan-out parallel (60 ms timeout, arrives ~40 ms) | 40 ms | 68.1 ms |
| 8 | First-price auction + floor + macro sub | 1 ms | 69.1 ms |
| 9 | AdX → Magnite network hop | 3 ms | 72.1 ms |
| 10 | Magnite → Prebid (selected as winner across SSPs) | 3 ms | 75.1 ms |
| 11 | GAM direct-deal check + render decision | 5 ms | 80.1 ms |
| 12 | CloudFront creative fetch (Austin POP cache hit) | 12 ms | 92.1 ms |
| 13 | Browser renders 300×250 | 15 ms | 107.1 ms |
| 14 | Impression pixel fires (async) | 2 ms | 109.1 ms |
Page-load-to-ad-visible is ~110ms. Only 43ms of that is the exchange; the other 67ms is browser, SSP round-trips, GAM decisioning, and rendering.
The Go fan-out code:
func (e *Exchange) runAuction(ctx context.Context, req *openrtb.BidRequest) (*AuctionResult, error) {
ctx, cancel := context.WithTimeout(ctx, 60*time.Millisecond)
defer cancel()
// Top-5 smart selection (see §10.2)
dsps := e.dspSelector.SelectTopN(req, 5)
bidChan := make(chan *DSPBid, len(dsps))
for _, dsp := range dsps {
go func(d *DSPConfig) {
bid, err := e.sendBidRequest(ctx, d, req)
if err != nil {
e.metrics.DSPError(d.ID, err)
bidChan <- nil
return
}
bidChan <- bid
}(dsp)
}
var bids []*DSPBid
received := 0
Loop:
for received < len(dsps) {
select {
case bid := <-bidChan:
received++
if bid != nil && bid.Price > 0 {
bids = append(bids, bid)
}
case <-ctx.Done():
e.metrics.AuctionTimeout(len(bids), len(dsps)-received)
break Loop
}
}
if len(bids) == 0 {
return &AuctionResult{Filled: false}, nil
}
return e.firstPriceAuction(bids, req.Imp[0].BidFloor), nil
}
func (e *Exchange) firstPriceAuction(bids []*DSPBid, floor float64) *AuctionResult {
var winner *DSPBid
for _, b := range bids {
if b.Price < floor {
continue
}
if winner == nil || b.Price > winner.Price {
winner = b
}
}
if winner == nil {
return &AuctionResult{Filled: false}
}
return &AuctionResult{Filled: true, Winner: winner, Price: winner.Price}
}10.2 Smart DSP selection
The 10× trick: picking the 5 DSPs most likely to bid competitively, instead of asking all 50, cuts outbound bandwidth 10× for a <2% fill-rate hit. This single decision is what makes the economics work.
The scoring combines hard filters (geo, format, category, capacity, circuit-breaker state) with two continuous signals (historical win rate for this segment, and a pacing factor based on current spend velocity). Hard filters drop any DSP that can't or shouldn't bid. The continuous score ranks the survivors.
type DSPScore struct {
DSPID string
Score float64
}
func (s *DSPSelector) SelectTopN(req *BidRequest, n int) []*DSPConfig {
geo := req.User.Geo.Country
format := req.Imp[0].Format()
vertical := req.Site.Cat[0]
var candidates []DSPScore
for _, dsp := range s.registry.All() {
// Hard filters
if !dsp.AcceptsGeo(geo) { continue }
if !dsp.AcceptsFormat(format) { continue }
if !dsp.AcceptsCategory(vertical) { continue }
if !dsp.CapacityAvailable() { continue }
if dsp.CircuitBreakerOpen() { continue }
// Continuous score
winRate := dsp.HistoricalWinRate(geo, vertical, format)
pacing := dsp.PacingFactor()
candidates = append(candidates, DSPScore{
DSPID: dsp.ID,
Score: winRate * pacing,
})
}
sort.Slice(candidates, func(i, j int) bool {
return candidates[i].Score > candidates[j].Score
})
if len(candidates) > n {
candidates = candidates[:n]
}
// 10% exploration: occasionally include a non-top DSP to discover new demand
if rand.Float64() < 0.10 && len(s.registry.All()) > n {
explorer := s.registry.RandomExploration(candidates)
if explorer != nil {
candidates[n-1] = DSPScore{DSPID: explorer.ID, Score: 0}
}
}
return s.registry.Resolve(candidates)
}The exploration bonus matters more than it looks. Without it, any DSP that starts with a zero win rate stays at zero forever. The fix: replace the lowest-ranked top-5 slot with a random non-top DSP 10% of the time. New DSPs get a fair shot, win rates update, the feedback loop promotes them when they earn it.
Win rates come from a Flink job (or a plain Kafka consumer) that aggregates the last 60 minutes keyed by (dsp_id, geo, vertical, format) and writes to Valkey. Auction servers cache it in-process with a 1-minute TTL.
| Strategy | DSP requests/sec | Outbound bandwidth | DSPs touched |
|---|---|---|---|
| All 50 (naive) | 50M | 50 GB/sec | 50 per auction |
| Top 5 (smart) | 5M | 5 GB/sec | 5 per auction |
| Fill-rate delta | — | — | < 2% drop |
10.3 Auction types
First-price is what the industry runs now. Winner pays their bid, the math is trivial, nothing for the exchange to manipulate.
The earlier second-price model (winner pays $0.01 above second-highest) encouraged truthful bidding in theory but invited "last look" abuse — exchanges knowing all bids could let favored buyers re-bid one cent over the clearing price. First-price killed that ambiguity.
Header bidding vs. waterfall is a separate question. Waterfall called exchanges sequentially (A, then B, then C) — slow, and higher bids in later exchanges never saw daylight. Header bidding (Prebid.js in the browser, or server-side) calls all exchanges in parallel and picks the highest bid overall. Mature publishers run server-side header bidding today: parallel competition without browser-side latency cost.
10.4 DSP bid optimization
This is opaque to the exchange but shapes fill rate and per-auction latency, so it's worth seeing the shape. A typical DSP bidder runs something like:
class BidOptimizer:
def compute_bid(self, req: BidRequest, campaign: Campaign) -> Optional[float]:
features = self.extract_features(req, campaign)
# ML: LightGBM or deep learning trained on historical data
pctr = self.ctr_model.predict(features)
pcvr = self.cvr_model.predict(features)
if campaign.bid_strategy == "CPA":
expected_value = pctr * pcvr * campaign.target_cpa
elif campaign.bid_strategy == "CPC":
expected_value = pctr * campaign.max_cpc
else: # CPM
expected_value = campaign.max_cpm / 1000
# Bid shading for first-price auctions
shading = self.shading_model.predict(features) # 0.5-0.85
bid = expected_value * shading
# Internal checks invisible to the exchange
if not self.budget_allows(campaign, bid): return None
if not self.frequency_allows(req.user_id, campaign.id): return None
if bid < req.imp[0].bidfloor: return None
return bidFor the §1 example, TTD's pCTR was ~8% (very high — fresh cart abandoner looking at the same product), pCVR ~12%, expected value $1.25/impression, maximum CPM well over $1,000. The shading model knew ESPN sports impressions clear around $8 and shaded down to $8.50 — comfortably above the predicted clearing price and far below the maximum.
10.5 Ad server and creative delivery
Creatives are uploaded by advertisers into their DSP, not the exchange. DSPs push them to a creative CDN of their own choosing. The exchange runs an async malware + policy scan on first-seen creative_id values and caches the verdict — first-time creatives pass optimistically and get flagged for backfill scanning, with a fast block path on failure.
At runtime the DSP returns HTML (banner) or VAST XML (video) in the adm field. The exchange substitutes a small set of macros before forwarding:
| Macro | Replaced With |
|---|---|
${AUCTION_ID} | Auction identifier |
${AUCTION_PRICE} | Encrypted price token (AES-256-GCM) |
${CLICK_URL} | Exchange click-tracking URL |
${IMPRESSION_URL} | Exchange impression-pixel URL |
${CACHE_BUSTER} | Random number to defeat pixel caching |
Substitution is a single pass over the markup string — negligible cost. The encrypted price token is the load-bearing part for billing: AES-256-GCM ciphertext wrapping the clearing price + timestamp. Encryption stops the publisher, SSP, or any browser extension from reading or forging the price (which would let them reverse-engineer bid patterns or tamper with billing).
Viewability uses the IAB IntersectionObserver beacon: ≥50% of pixels in viewport for ≥1s (display) or 2s (video). The beacon writes to Kafka via the tracking endpoint. IAS and DoubleVerify publish their own; the exchange's is a backup, not the primary measurement.
Video returns VAST 4.2 XML. The DSP supplies media-file URLs; the exchange wraps them with impression, click, and quartile events pointing at its own tracking endpoint; the video player fires those as the video plays.
10.6 Per-DSP spend tracking and settlement
The exchange is the financial middleman: money flows advertiser → DSP → exchange → publisher, and the exchange deducts its take rate at settlement rather than per auction. Real-time per-DSP spend tracking exists for two reasons:
- Credit limits. Many DSPs run on prepaid balances. A DSP over its balance has to be cut off from auctions within seconds, not hours.
- Anomaly detection. A sudden 10× spike in a DSP's spend velocity is usually a compromised account or a runaway campaign.
class DSPSpendAggregator:
def process_impression(self, dsp_id: str, imp: ImpressionEvent):
state = self.get_state(dsp_id)
state.spend_today_cents += int(imp.price_cpm * 100 / 1000)
state.impressions_today += 1
self.valkey.hset(
f"dsp:{dsp_id}:spend_today",
mapping={
"spend_cents": state.spend_today_cents,
"impressions": state.impressions_today,
"credit_limit": state.credit_limit_cents,
"credit_remaining": state.credit_limit_cents - state.spend_today_cents,
"last_update": datetime.utcnow().isoformat(),
}
)
if state.spend_today_cents >= state.credit_limit_cents:
self.valkey.set(f"dsp:{dsp_id}:credit_blocked", "1", ex=3600)
self.alert(f"DSP {dsp_id} exceeded credit limit")
if state.spend_today_cents > state.expected_daily * 1.5:
self.alert(f"DSP {dsp_id} spend anomaly")The auction server's check is a single in-process cache lookup for dsp:{dsp_id}:credit_blocked. End-to-end lag is Flink tumbling window (1s) + Valkey write (1ms) + in-process cache TTL (5s) = ~6s worst case of unchecked spend after a DSP crosses its limit. At a $100K daily limit that's ~$7 of overshoot per second of lag — acceptable in exchange for never blocking the auction path on a synchronous credit check.
Daily settlement runs at 02:00 UTC: query ClickHouse for spend/impressions/clicks grouped by (dsp_id, publisher_id), apply the take rate, write a row per pair to billing_settlements. Rows with discrepancy >0.01% go to manual review. Payment batches and DSP invoices send once review clears.
10.7 Impression and click tracking
The tracking pipeline is the source of truth for billing. Lose events → exchange under-bills (revenue gone). Double-count → over-bills (trust gone). Design rule: dedupe in Valkey, Kafka produce async, return the pixel fast, fail open rather than closed.
Pixels fire more than once in the wild — browsers retry on flaky connections, ad slots refresh and re-fire, buggy ad tags double-fire. Without dedupe, an impression gets billed twice and the advertiser (reasonably) gets upset.
Dedupe key: impression:{auction_id}:{imp_id} with a 1-hour TTL, written via SETNX. On Valkey errors the code intentionally fails open and treats the impression as new — over-counting by a fraction of a percent is less harmful than under-counting and losing real revenue.
func (t *TrackingService) handleImpression(w http.ResponseWriter, r *http.Request) {
auctionID := r.URL.Query().Get("auc")
impID := r.URL.Query().Get("imp")
encPrice := r.URL.Query().Get("price")
price, err := t.decryptPrice(encPrice)
if err != nil {
t.metrics.InvalidPrice.Inc()
t.servePixel(w); return
}
dedupKey := fmt.Sprintf("impression:%s:%s", auctionID, impID)
isNew, err := t.valkey.SetNX(r.Context(), dedupKey, "1", time.Hour).Result()
if err != nil {
// Fail open: better to slightly over-count than lose revenue
isNew = true
}
if isNew {
t.kafka.ProduceAsync("impressions", auctionID, &ImpressionEvent{
AuctionID: auctionID,
ImpID: impID,
Price: price,
DSPID: r.URL.Query().Get("dsp"),
PublisherID: r.URL.Query().Get("pub"),
CreativeID: r.URL.Query().Get("crid"),
Timestamp: time.Now(),
UserAgent: r.UserAgent(),
IP: extractIP(r),
})
}
t.servePixel(w)
}Click handling is structurally identical, with a 302 Location header instead of a transparent GIF and a destination-URL whitelist check to stop open-redirect abuse.
10.8 Fraud detection beyond IP blocklists
Basic fraud detection (known bot IPs, obvious UA patterns) catches ~30% of invalid traffic on a good day. The rest needs a layered stack: sub-millisecond signals run pre-bid in the auction path; slower enrichment runs post-bid in a Flink job that feeds reputation scores back into the next round of pre-bid filters.
Pre-bid signals (must be sub-millisecond):
- IP reputation — Valkey set refreshed hourly from threat-intel feeds
- ASN / datacenter detection — MaxMind lookup; AWS/GCP/Azure/DO ASNs flagged as datacenter traffic, which catches server-rented bot pools
- User-agent entropy — UAs that are statistically too common for real browsers flag botnets running a single UA across thousands of requests
- Headless browser fingerprints — missing WebGL, missing canvas, Chrome's headless canary flags (when they survive
schain.ext)
Post-bid signals (JavaScript beacon after the impression renders):
- Cursor entropy and scroll velocity — bots move in straight lines
- Time-in-view duration — catches ad stacking (multiple ads layered on top of each other)
- Click-after-impression delta — a click fired 200ms after impression is almost certainly automated
- Per-publisher rolling viewability rate — catches inventory that's quietly degrading
Backstop: supply-chain validation (ads.txt, sellers.json, schain — see §15.4) closes the domain-spoofing loop, and daily DSP reconciliation compares exchange-tracked impressions with DSP-reported numbers; large discrepancies go into the manual review queue.
| Layer | Share of fraud caught |
|---|---|
| Pre-bid filters | ~70% |
| Post-bid beacons | ~20% |
| Daily reconciliation | ~10% |
One more category that's easy to forget: malicious creative markup. A DSP can embed JavaScript in its adm field. The fix is the async creative scan from §10.5 — first-seen creatives are allowed optimistically, scanned in the background, blocked on failure, and the DSP's circuit breaker increments.
11. Bottlenecks
Eight candidates under load. Most are theoretical — the design already has the mitigation baked in. Three actually bite in practice and need deliberate attention.
| # | Bottleneck | Mitigation | Bites? |
|---|---|---|---|
| 1 | DSP fan-out (50M req/sec naive) | Top-5 smart selection cuts to 5M/sec (§10.2) | Theoretical |
| 2 | Valkey hot keys (celebrity/viral pages) | In-process LRU absorbs it; hit rate >95% even at peak | Theoretical |
| 3 | Kafka ingestion (15 GB/sec naive) | Sample losing bids at 1%; tee full stream to S3/Iceberg | ✅ Bites |
| 4 | ClickHouse query vs. ingest contention | Two clusters (ingest-only + query-only) + materialized views | Theoretical |
| 5 | Tracking endpoint (300K pixels/sec, 5ms budget) | Separate pod group, no DB writes, async Kafka produce | ✅ Bites |
| 6 | CDN origin pulls during creative rotation | DSPs pre-warm POPs; auction server deprioritizes uncached creative URLs for 60s | Theoretical |
| 7 | DSP spend tracking lag (~6s worst case) | Accept small credit-limit overshoot; alert if window >10s | Minor |
| 8 | Slowest DSP in the top-5 (bounds per-auction latency) | Adaptive timeouts + circuit breaker (§12.2) | ✅ Bites |
The three that actually bite:
Kafka ingestion. At 1M QPS, logging every bid (win + loss) is ~5 GB/sec, tripled by replication to 15 GB/sec written. Sampling (100% impressions + winning bids, 1% losing bids) brings it into the budget. The full bid stream tees directly to S3/Iceberg via Kafka Connect — much cheaper durable storage for billing disputes without burning hot Kafka capacity.
Tracking endpoint. ~300K pixels/sec with a 5ms p99 budget or page rendering blocks. Runs as a separate pod group with a deliberately tiny code path: parse query string → decrypt price → dedupe in Valkey → produce to Kafka async → return the GIF. No DB writes on the hot path, no synchronous downstream calls.
Slowest DSP in the top-5. Once everything else is healthy, per-auction latency is bounded by whichever DSP is slowest. Adaptive timeouts rebalance: fast DSPs get a generous 55ms budget, slow DSPs get a strict 35ms. Slow DSPs lose win rate and drop out of the top-5 naturally. Harder failure modes fall through to the circuit breaker in §12.2.
12. Failure Scenarios
12.1 Valkey cluster failure
Valkey going down is unpleasant but not fatal. The in-process LRU keeps serving for its 5-second TTL, buying breathing room. Cache misses fall through to degraded mode:
- Strip all PII (consent state is unknown)
- Skip fraud lookups (blocklists unreachable)
- Skip DSP credit checks
- Serve only from LRU until the TTL expires
Fill rate drops because DSPs see less data and bid lower. Revenue keeps flowing and the failure isn't customer-visible.
Detection uses a 3-second health check window. When the breaker opens, every auction server flips into degraded mode within seconds. On-call gets paged and the strict-PII flag is enabled as a belt-and-suspenders.
12.2 DSP unresponsive
Far more common than Valkey failure. The fix is a per-DSP circuit breaker with three states:
- Closed (normal) — requests flow
- Open — all requests rejected immediately; DSP excluded from selection
- Half-open — 1% probe requests to detect recovery
Transition rules:
- Closed → Open: error rate >50% or timeout rate >30% in a 60-second sliding window
- Open → Half-open: after 30 seconds of cooldown
- Half-open → Closed: 10 consecutive probes succeed
- Half-open → Open: any probe fails
Circuit-breaker state lives in Valkey under dsp:{dsp_id}:circuit and refreshes into the in-process cache once a second. When a DSP is open, smart selection skips it and the next-best DSP slides into the top-5 — the auction barely notices. The one alert that matters: if excluding a top-5-by-revenue DSP drops total revenue by >5%, on-call gets paged for a manual look.
12.3 Kafka degradation
Slow or partially unavailable Kafka means the auction server can't produce events at the normal rate. Each pod buffers up to 100K events in a 150 MB in-memory ring and drains it on recovery; if the ring fills, events spill to local disk as WAL files and replay after Kafka comes back.
DSP spend tracking falls back to the last-known Valkey values during the outage, so some DSPs may slightly exceed their credit limits — corrected on daily reconciliation.
12.4 CDN origin failure
S3 origin going down means CDN edges can't pull cache misses. stale-while-revalidate headers keep already-cached creatives serving past their TTL, which covers most demand. The auction server checks a per-creative health flag before returning a bid pointing at an uncached creative — if origin has been down >5 minutes, that creative is excluded and the auction picks the next-best bid. New campaigns launching during the outage are delayed.
12.5 Flink spend aggregator crash
Flink restarts lose in-memory per-DSP spend state. S3 checkpoints every 30 seconds keep most restarts cheap — the latest checkpoint comes back almost immediately. If the checkpoint is >5 minutes stale, the bootstrap path rebuilds the day's totals from ClickHouse:
SELECT dsp_id, sum(price_cpm)/1000 AS spend_usd
FROM impressions
WHERE timestamp >= today()
GROUP BY dsp_idDuring the ~30-second bootstrap window, auction servers rely on the last DSP credit flags Valkey had. Some DSPs may overspend by a few dollars; daily reconciliation catches and bills it.
12.6 Auction server OOM
Traffic spikes (breaking news, a viral sports moment) can drive QPS past pod capacity. Each pod enforces max_concurrent_requests = 5,000 and returns 503 with Retry-After: 1 past that. HPA scales on CPU (target 60%) and a custom concurrent_auctions_per_pod metric. SSPs retry with exponential backoff and route to other exchanges if 503s persist.
Pre-provisioned headroom (100 pods at 60% utilization) absorbs ~67% spikes without scaling at all.
12.7 Shedding load when things back up
Shed early, shed 203. Drop low-value work before the queue backs up. Never return 5xx — SSPs interpret 5xx as "exchange broken" and route away; 203 is "no fill, normal outcome" and the SSP just moves on.
The shedding order drops the work that matters least first:
- Floor price < $0.50 CPM — can't produce meaningful revenue; latency budget better spent elsewhere
- Tier-3 publishers (lowest revenue-share contracts) — premium publishers stay protected
- Kill the 10% DSP exploration bonus — fan out to top-N in pure rank order only
- Reduce fan-out from top-5 to top-3 — only under sustained overload
DSP timeouts adapt the same way. A worker checks p95 response time per DSP every 60 seconds: DSPs averaging <30ms get a 55ms budget, ones over 50ms get 35ms. Slow DSPs drop out of the top-5 naturally as their win rate decays.
13. Deployment
13.1 Multi-region layout
| Region | Auction pods | Valkey | Kafka | ClickHouse | Purpose |
|---|---|---|---|---|---|
| us-east-1 | 40 | 6 | 10 | 12 | Primary, North America |
| eu-west-1 | 30 | 6 | 10 | 12 | Europe (GDPR strict mode) |
| ap-south-1 | 30 | 6 | 10 | 12 | Asia-Pacific |
| Global (S3, CDN, PG) | — | — | — | — | Shared storage, config, creative CDN |
- Routing. Geo-DNS does latency-based routing per SSP request. Cross-region failover is DNS TTL 30s when the regional health check goes red.
- Postgres. One global primary in us-east-1 with read replicas in each region. Config changes propagate via logical replication with ~200ms lag; auction servers always read from their local replica.
- Settlement & billing. Runs only out of us-east-1, consuming all three regions' Kafka topics via MirrorMaker. Financial reports have a single source of truth.
13.2 Pipeline
Canary fail thresholds (any one rolls the canary back):
- Auction p99 > 110ms
- Fill-rate drop > 2%
- DSP timeout rate up > 5 percentage points
- 5xx rate > 0.1%
- RPM down > 3%
13.3 Rollback
| Component | Method | Time |
|---|---|---|
| Auction server code | k8s rolling update to previous image | < 5 min |
| DSP config | Revert in PG, publish Kafka config event | < 30 sec |
| Flink spend job | Redeploy previous JAR from S3 checkpoint | < 2 min |
| Tracking endpoint | k8s rolling update | < 3 min |
| ClickHouse schema | Forward-only, columns added backward-compatible | N/A |
| Publisher config | Revert via API | Immediate |
14. Observability
14.1 Key metrics
| Metric | Type | Alert threshold |
|---|---|---|
auction.qps | Counter | < 700K (30% below baseline) or > 1.8M (spike) |
auction.latency.p50 | Histogram | > 50 ms |
auction.latency.p99 | Histogram | > 100 ms |
auction.fill_rate | Gauge | < 25% |
auction.revenue_per_1k | Gauge | > 10% drop from 1h MA |
dsp_selection.top_n_time | Histogram | > 2 ms |
dsp.{id}.response_time.p99 | Histogram | > 55 ms |
dsp.{id}.nobid_rate | Gauge | > 95% |
dsp.{id}.circuit_breaker | Gauge | state = OPEN |
dsp.{id}.spend_today | Gauge | > 90% of credit limit |
tracking.impression.qps | Counter | < 250K |
tracking.dedup.rate | Gauge | > 5% |
lru.hit_rate | Gauge | < 90% |
valkey.ops_per_sec | Counter | > 1M (capacity alarm) |
valkey.latency.p99 | Histogram | > 2 ms |
kafka.consumer_lag.impressions | Gauge | > 100K events |
clickhouse.query.p99 | Histogram | > 10 s |
cdn.cache_hit_rate | Gauge | < 90% |
settlement.discrepancy | Gauge | > 0.01% |
ivt.blocked_rate | Gauge | > 15% |
load_shed.rate | Gauge | > 1% (load-shedding kicking in) |
14.2 Dashboard
┌────────────────────────────────────────────────────────┐
│ Auction QPS │ Latency (p50/p99) │
│ 1.05M [live] │ 42ms / 78ms │
├────────────────────────────────────────────────────────┤
│ Fill Rate │ Revenue $/hour │
│ 31.8% [24h] │ $4.2M [24h] │
├────────────────────────────────────────────────────────┤
│ DSP Response Matrix (top 10) │
│ TTD: 32ms ok │ DV360: 38ms ok │ Amazon: 44ms ok │
│ Criteo: 41ms ok │ Xandr: OPEN │ Magnite: 28ms ok │
├────────────────────────────────────────────────────────┤
│ DSP Spend & Credit │
│ TTD: $1.2M / $5M (24%) [healthy] │
│ DV360: $4.8M / $5M (96%) [approaching limit] │
├────────────────────────────────────────────────────────┤
│ Tracking: imp 310K/s, click 3.1K/s, dedup 1.1% │
│ LRU hit rate: 96.3% │ Load shed rate: 0.0% │
└────────────────────────────────────────────────────────┘14.3 Distributed tracing
Every auction carries a trace ID through the whole lifecycle. OTel spans:
Trace: auc_01HXYZ123 (42ms total)
├── LRU enrichment (0.1ms) [hit]
├── Pre-bid filter (1ms)
├── DSP selection (0.8ms) → [ttd, dv360, amazon, criteo, xandr]
├── DSP fan-out (38ms)
│ ├── ttd req (32ms) ok bid $8.50
│ ├── dv360 req (38ms) ok bid $5.00
│ ├── amazon req (44ms) ok bid $4.00
│ ├── criteo req (41ms) ok bid $6.50
│ └── xandr req (circuit_open)
├── Auction logic (0.5ms)
├── Macro sub + response (1ms)
├── Kafka publish (async, 2ms after response)
└── [later] Impression pixel received (t+112ms)14.4 Alerting tiers
| Tier | Trigger | Action |
|---|---|---|
| P0 (page now) | QPS drop > 50%, fill rate drop > 50%, all DSP circuits open | Page on-call + eng lead |
| P1 (page 15m) | p99 > 150 ms for 5 m, top-5 DSP circuit open, Kafka lag > 1M | Page on-call |
| P2 (Slack) | DSP credit > 90%, IVT rate > 15%, CDN hit < 85%, discrepancy > 0.005% | #exchange-ops |
| P3 (daily) | DSP no-bid rate shift > 10%, fill drop > 5%, creative rejection > 3% | Daily ops review |
15. Security
15.1 Data classification
| Data | Class | At Rest | In Transit |
|---|---|---|---|
| User IDs (exchange) | Pseudonymous PII | AES-256 | TLS 1.3 |
| IP addresses | PII | AES-256, hashed after 30d | TLS 1.3 |
| Consent strings (TCF) | Regulated PII | AES-256 | TLS 1.3 |
| Auction bid data | Confidential | AES-256 | TLS 1.3 |
| Clearing prices (in markup) | Confidential | AES-256-GCM | TLS 1.3 |
| Creative assets | Public | S3 SSE | TLS 1.3 |
| Configurations | Internal | PG TDE | TLS 1.3 |
| Billing records | Restricted | AES-256 | TLS 1.3 + mTLS |
15.2 Authentication and authorization
| Actor | Auth | Scope |
|---|---|---|
| SSPs | mTLS + API key | Bid requests |
| DSPs | mTLS certificates | Receive bid requests, submit bids, receive win notices |
| Publishers | OAuth 2.0 + MFA | Config API, revenue dashboards |
| Internal services | mTLS | Service-to-service |
| Ops | SSO + MFA (Okta) | Dashboards, DSP config, incident response |
| Billing / Finance | SSO + MFA + role restriction | Settlement reports, payments |
15.3 Price encryption
The clearing price embedded in the impression pixel is encrypted with AES-256-GCM. Without that, intermediaries (publisher, SSP, a browser extension) could read or forge it — letting them reverse-engineer bid patterns or tamper with billing.
Plaintext = price + Unix timestamp. The timestamp lets the tracking endpoint reject any token older than 24 hours as a replay. Each token gets a fresh nonce, base64-encodes to ~40 characters, and decodes inside the tracking endpoint before the impression event hits Kafka.
15.4 Supply chain: ads.txt, sellers.json, schain
Domain spoofing is an old ad-tech fraud: a fraudster claims to sell nytimes.com inventory while actually owning a parked low-quality domain. Three standards plug the loop:
- ads.txt — published at the root of every legitimate publisher domain, listing authorized sellers. The exchange crawls these daily and refuses requests from sellers not in the matching ads.txt.
- sellers.json — the exchange's own published list of every seller it accepts, used by DSPs to verify the exchange's claims.
schainobject — attached to every bid request, listing every hop from publisher to exchange. Any unauthorized node in the chain causes a reject.
Validation is straightforward: cached ads.txt entries (refreshed daily) are keyed by publisher domain. Each request is checked for seller ID + a DIRECT or RESELLER relationship. A cache miss falls back to the configured policy — strict (reject) or permissive (allow with a flag) depending on publisher tier.
15.5 Network policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: auction-server-policy
spec:
podSelector:
matchLabels: { app: auction-server }
policyTypes: [Ingress, Egress]
ingress:
- from: [{ namespaceSelector: { matchLabels: { name: load-balancer } } }]
ports: [{ port: 8080 }]
egress:
- to: [{ namespaceSelector: { matchLabels: { name: data-plane } } }]
ports: [{ port: 6379 }, { port: 9092 }] # Valkey + Kafka
- to: [{ ipBlock: { cidr: 0.0.0.0/0 } }] # DSPs (external)
ports: [{ port: 443 }]15.6 Audit logging
Every auction result, DSP bid, impression, and billing event writes to Kafka with append-only semantics. Audit topics use min.insync.replicas=3 and acks=all so writes survive broker failures.
The audit log archives to a separate AWS account with S3 Object Lock enabled — data is write-once and can't be altered or deleted by anyone in the main account, admins included. Retention is 7 years for billing records, 2 years for bid-level logs.
Explore the Technologies
Dive deeper into the technologies and infrastructure patterns used in this design:
Core Technologies
| Technology | Role in This Design | Learn More |
|---|---|---|
| Valkey | Cold-path enrichment, DSP credit state, fraud lists, creative dedup, circuit breakers | Redis/Valkey |
| Kafka | Impressions, clicks, winning bids, sampled losing bids, DSP config distribution | Kafka |
| ClickHouse | Real-time spend dashboards, billing aggregation, analytics | ClickHouse |
| PostgreSQL | DSP config, publisher settings, SSP registrations, settlement records | PostgreSQL |
| Flink | Per-DSP spend aggregation, rolling win-rate computation for smart selection | Flink |
Infrastructure Patterns
| Pattern | Relevance | Learn More |
|---|---|---|
| CDN and edge caching | Creative delivery via CloudFront with > 95% hit rate | CDN |
| Circuit breaker | Per-DSP isolation of slow or failing demand partners | Circuit Breakers |
| Message queues | Kafka as universal event bus, sampled writes | Message Queues |
| Load balancing | L4 LB across auction pods at 1M QPS | Load Balancing |
| Load shedding | Drop low-value auctions to preserve core path | Load Shedding |
Further Reading
- OpenRTB 2.6 Specification (IAB Tech Lab) — Industry-standard protocol for programmatic ad bidding
- VAST 4.2 Specification — Video Ad Serving Template
- MRC Viewability Standards — Viewability measurement
- ads.txt and sellers.json (IAB) — Supply chain transparency
- Google Ad Manager Architecture — Reference architecture for large-scale ad serving