System Design: Online Auction (50K Bids/sec, Effectively-Once Settlement, Anti-Sniping)
Goal
A real-time online auction platform.
Scale:
- 10M active listings
- 50K bids/sec at peak, 1.7K bids/sec average
- 1M concurrent WebSocket watchers
- Sub-200ms regional p99 bid confirmation and broadcast, 99.99% availability (cross-region readers see +100 ms)
Features:
- English, Dutch, and sealed-bid auction types
- Proxy (auto) bidding
- Anti-sniping extension
- Effectively-once settlement converging on a single committed winner
TL;DR
- The API validates cheaply and writes bids to Kafka partitioned by
auction_id, returning 202. - A per-partition bid processor runs an atomic Valkey Lua CAS that accepts or rejects the bid, dedups by
bid_id, and assigns a per-auction sequence number. - Accepted bids fan out over Valkey sharded Pub/Sub to WebSocket gateways in under 200 ms.
- Flink keyed-timers fire at auction end into a settlement consumer that uses a fencing token plus a stable payment idempotency key to make settlement effectively-once.
- Postgres is the source of truth; Valkey is the hot-path coordinator; Kafka is the delivery bus.
Pick a path
| Time | Read | Covers |
|---|---|---|
| ~10 min | TL;DR + §4 | End-to-end flow, who enforces what, where the race conditions live |
| ~30 min | TL;DR, §4, §5, §9, §10 | Stack tradeoffs, effectively-once settlement, bid processing model |
| ~60 min | Full post | Every decision plus anti-sniping, proxy resolution, multi-region, ops |
Architecture at a glance
Four flows share one state plane. The write side serializes per auction in Valkey. The read side fans out over Pub/Sub. The timer side settles at auction end.
Correctness lives in two places only: the Valkey CAS (who wins a bid) and the fencing token on the auction row (whose settlement commits). Everything else is delivery and fan-out.
1. Problem Statement
An auction platform sounds simple until the last ten seconds of a popular listing.
A rare sneaker auction is ending. 500 users are watching. In the final 30 seconds, 50 bids arrive within a 2-second window. Each must be validated against a price that is changing bid by bid, processed in strict arrival order, and broadcast to all 500 watchers within 200 ms. If two users bid $105 when the current price is $100, only one commits and the other is told "outbid, the price is now $105." Not both. Not neither.
That is the central challenge: strongly serialized writes per auction combined with low-latency fan-out to thousands of readers, all while settlement at auction end is idempotent in the face of crashes.
Four problems drive the design.
Concurrent bids on the same auction. Two users click "Bid $105" at the same instant when the current price is $100. Without concurrency control, both bids pass the "$105 > $100" check and both get accepted. The fix is optimistic concurrency: every bid carries the price it expected to see, and acceptance is conditional on that value still being current. Under Valkey's single-threaded execution, an atomic Lua script gives per-auction serialization for free.
Bid sniping. A user places a bid in the final second, leaving no one time to respond. Some platforms accept this as legitimate strategy. Fairer-outcome platforms (eBay Live, Catawiki) extend the auction by a short window when a bid lands near the end. Anti-sniping is configurable per auction.
Settlement must be effectively-once. When time runs out, the system must converge on a single committed winner after retries settle, and the payment provider must end up with a single captured charge once it acks. A settlement job that crashes between "write SOLD" and "call Stripe" must restart without double-charging. The solution is a fencing token plus target-side idempotency. Standard pattern for any job that terminates with an external side effect.
Real-time broadcast at scale. 1M concurrent WebSocket connections across a fleet of stateless gateway pods. Every accepted bid must reach every watcher of that auction within 200 ms. Polling is not an option. The fan-out path is Valkey Pub/Sub, with each gateway pod only subscribing to channels for auctions its users care about.
Scale targets.
- 10M active listings at any time
- 50K bids/sec at peak, 1.7K bids/sec average (30× ratio driven by evening prime-time)
- 1M concurrent WebSocket watchers
- Average auction duration 7 days; minimum 1 hour; maximum 30 days
- Bid confirmation and broadcast latency: <200 ms p99
- 99.99% availability for bid processing
2. Functional Requirements
| ID | Requirement | Priority |
|---|---|---|
| FR-01 | Create auction listings: title, description, images, starting price, reserve price, bid increment, start and end times, auction type | P0 |
| FR-02 | Place bids on active auctions with real-time validation against current highest | P0 |
| FR-03 | Real-time bid updates pushed to watchers over WebSocket within 200 ms | P0 |
| FR-04 | Anti-sniping: extend auction end time by a configurable amount when a bid arrives within the final window | P0 |
| FR-05 | Effectively-once settlement: winner determination, reserve check, payment capture | P0 |
| FR-06 | English auction: ascending bids, highest wins | P0 |
| FR-07 | Dutch auction: price drops on a schedule, first to accept wins | P1 |
| FR-08 | Sealed-bid auction: blind bids, revealed at close, highest wins | P1 |
| FR-09 | Proxy bidding: user sets a max, system auto-bids the minimum increment on their behalf | P1 |
| FR-10 | Watchlist: users subscribe to auctions and receive notifications on key events | P1 |
| FR-11 | Bid history: full audit trail of bids per auction | P0 |
| FR-12 | Reserve price: sale only completes if final bid meets the seller's hidden minimum | P0 |
| FR-13 | Search and browse by category, price range, ending soon, newly listed | P1 |
| FR-14 | Bid retraction within policy window | P2 |
3. Non-Functional Requirements
| ID | Requirement | Target |
|---|---|---|
| NFR-01 | Bid processing throughput | 50K bids/sec peak, 1.7K average |
| NFR-02 | Bid confirmation latency (regional p50 / p99) | 60 ms / 200 ms (cross-region readers see +100 ms) |
| NFR-03 | Bid broadcast latency (regional p99, acceptance to watcher frame) | <200 ms |
| NFR-04 | Active concurrent auctions | 10M |
| NFR-05 | Concurrent WebSocket connections | 1M |
| NFR-06 | Bid processing availability | 99.99% (52 min/year) |
| NFR-07 | Settlement guarantee | Effectively-once (one SOLD row, one captured charge) |
| NFR-08 | Bid data durability | Zero loss once the API returns 202 |
| NFR-09 | Anti-sniping timer precision | <1 s drift |
| NFR-10 | Recovery Time Objective | <30 s for bid processor partition rebalance |
| NFR-11 | Recovery Point Objective | 0 for accepted bids |
| NFR-12 | Retention | Bids: hot 90 days in Postgres, archive to S3, drop after 2 years |
| NFR-13 | Geography | Multi-region active reads; bid writes pinned per-auction to a single region |
| NFR-14 | Search latency | <500 ms p99 |
[3.1] Traffic and workload assumptions
- Median bids per auction ~15; mean ~105. The distribution is long-tailed: most listings end quiet, a small fraction of hot listings pull the mean up sharply. Downstream math (§6.1) uses the mean.
- 3% of auctions end in any given hour during evening prime time.
- Hot auctions (top 0.01%) can take 100-500 bids/sec in the final minute.
- Payment provider (Stripe-equivalent) supports idempotency keys and 2xx/4xx responses within 1 s p99.
- Watchers per auction: average 30, hot auction up to 5K.
- Clients resolve a regional endpoint via DNS; the chosen region processes the bid (auction is pinned to its region).
4. End-to-End Architecture
Shape: a per-key-serialized transactional write path with an event-driven fan-out sidecar for reads. Four flows:
- Submit (bid write path)
- Process (per-auction serial consumer)
- Broadcast (WebSocket fan-out)
- Settle (auction end to payment)
Each part does one thing. Correctness lives in two places only: the Valkey CAS (who wins a bid) and the fencing token on the auction row (whose settlement commits).
Each flow gets its own diagram under the subsection that describes it. Start with the write path.
[4.1] Submit (write path)
Client → API → Kafka → Bid Processor → Valkey + Postgres → Kafka bids.accepted
When a bid request arrives, the API does a small set of cheap checks and gets out of the way:
- Auth, per-user rate limits (10/sec, 200/min), and risk-tier check with payment hold sized to item value (§9.11).
- Load the auction summary from Valkey:
HMGET auction:{id} status current_end_time auction_type. If missing, fall back to Postgres. - Reject early if
status != ACTIVEornow > current_end_time. These are fast rejects that do not enter Kafka. - Produce to Kafka topic
bids.incoming, partition key =auction_id. Message body:{bid_id, auction_id, bidder_id, amount, expected_price, idempotency_key, client_ts, server_ts}. - Return
202 Acceptedwith{bid_id, status: "QUEUED"}.
Important: the API does not validate the bid amount. It does not read the current price. That check happens inside the bid processor, under the Valkey CAS. Validating at the API introduces a race window: by the time the bid reaches the processor, the price has moved.
Kafka is not the source of truth. Bid acceptance is decided by Valkey; durability lives in Postgres.
[4.2] Process (bid processor fleet)
Per-partition consumers. Under steady state, one active processor instance owns each Kafka partition. During a rebalance the assignment can briefly overlap; the Valkey CAS makes the overlap safe (the second attempt sees a moved price and rejects). With auction_id as the partition key, every bid for a given auction lands on the same partition and is processed in arrival order.
For each Kafka message:
-
Read the message. Do not commit the offset yet.
-
Run an atomic Lua script against Valkey with two keys:
auction:{id}(state hash) andbid_result:{bid_id}(dedup cache). Full script in Appendix A. In outline:SET bid_result:{bid_id} <placeholder> NX EX <ttl>. If the key already exists, return the cached result. That is the redelivery dedup.- Check auction status and end time; reject
AUCTION_CLOSEDif closed. - Check
expected_pricematches current; rejectSTALE_EXPECTED_PRICEif not. - Check bid ≥ current + min_increment; reject
BID_TOO_LOWif not. - On acceptance:
HINCRBY sequence_num, updatecurrent_price+high_bidder, and if inside the anti-snipe window extendcurrent_end_time. - Cache the final result into
bid_result:{bid_id}before returning.
The script is atomic under Valkey's single-threaded execution. No two scripts race on the same key.
-
Accepted path. a. Write the bid row to Postgres with
status = 'ACCEPTED'and the assignedsequence_num. The partial unique index on accepted bids (§7.2) keeps sequence_num gap-free. b. Publish tobids.acceptedon Kafka. Downstream consumers are the broadcast gateway, proxy-bid resolver, search indexer, analytics pipe, and notification service. c. If the script returnedEXTENDED, also publish toauctions.end_time_changedso the Flink timer service re-arms. -
Rejected path. Write the bid row with
status = 'REJECTED',sequence_num = NULL, and the rejection reason. Emit abid_resultevent over the client's WebSocket carrying{bid_id, status: "REJECTED", reason, current_price, end_time}. The HTTP POST already returned 202 at ingress (§8.1); the terminal outcome always rides the WebSocket. -
Commit the Kafka offset. Only after the Postgres write succeeds.
Redelivery dedup. The bid_id and idempotency_key come from the API. If Kafka redelivers the same message after the CAS already ran, the bid_result:{bid_id} NX check short-circuits the script and returns the cached outcome. Without it, the second attempt would see a moved current_price and reject a bid that was actually accepted. TTL is auction_end + 48 h so the cache outlives settlement retries (§17.1).
[4.3] Broadcast (WebSocket fan-out)
Accepted bids have to reach watchers in under 200 ms. Pipeline:
- Bid processor publishes
bids.acceptedto Kafka. - A small fan-out service consumes
bids.acceptedand does aPUBLISH auction:{id}:updates <payload>on Valkey Pub/Sub. Payload is the bid summary:{sequence_num, current_price, high_bidder_masked, end_time, time_remaining}. - WebSocket gateway pods subscribe to
auction:{id}:updatesonly for auctions their connected users are watching. Each pod keeps aSUBSCRIBEper active auction in its connection pool. - On
PUBLISH, each subscribed pod pushes a frame to every local connection watching that auction.
Why Valkey Pub/Sub and not Kafka consumers per pod? Pub/Sub is sub-millisecond per hop, and each gateway pod only subscribes to the ~500-5000 auctions its users actually care about. With Kafka, every pod would consume the full bids.accepted stream and filter client-side, burning CPU and bandwidth.
Cluster-mode note. On Valkey Cluster, plain PUBLISH broadcasts to every node in the cluster, which defeats the point. Use sharded pub/sub (SPUBLISH/SSUBSCRIBE, Valkey 7+) so the message stays on the shard that owns auction:{id}. In a multi-cluster deployment, the pub/sub bus can also run on a separate single-shard Valkey instance to decouple broadcast load from the CAS cluster.
Reconnection story: every push carries sequence_num. On reconnect, the client sends last_seen_seq and the gateway fetches any missing bids from Postgres (SELECT ... WHERE auction_id = ? AND sequence_num > ?) before resuming the live stream. No bids skipped, no duplicates at the client.
Presence and stale subscriptions. Each WebSocket gateway pod tracks its active subscriptions in ws:{pod_id}:subscriptions (§7.5) with a 60 s TTL refreshed on every heartbeat. A reaper runs every 30 s per pod and issues SUNSUBSCRIBE for any channel with no live local connection. On pod crash, the TTL on the pod's subscription set expires within a minute and the broadcast gateway stops publishing to channels no pod serves. Without this, idle SPUBLISH traffic grows unbounded as users navigate away without clean disconnects.
[4.4] Settle (auction end)
Settlement is where duplicates hurt: a double settlement charges the winner twice or picks two winners. The full guarantee chain is covered in §9.
Flink runs a keyed timer service. The key is auction_id. When an auction is accepted or its end time changes, a corresponding timer is re-armed in Flink state. At firing time, Flink emits an auctions.ending event. A settlement consumer picks it up and:
- Atomically increment the fencing token in Valkey:
token = INCR fence:auction:{id}. - Read the winning bid from Postgres: the highest-amount
ACCEPTEDbid with the lowestsequence_numas tiebreaker. - Validate reserve price. If not met, mark the auction
UNSOLDand stop. - Conditional Postgres write, guarded by the fencing token:
If zero rows update, a later attempt has already won. Stop.sql
UPDATE auctions SET status = 'SOLD', winner_id = $winner, final_price = $price, settlement_fence = $token WHERE id = $auction_id AND (settlement_fence IS NULL OR settlement_fence < $token) AND status = 'CLOSED'; - Call the payment provider with
Idempotency-Key: settle-{auction_id}. The key is deliberately tied to the auction, not the attempt: a stable key is what lets Stripe / Adyen return the original response on retry. See §9.2 for why including the fencing token in the key breaks the guarantee. - Update
settlement_status = 'PAYMENT_CAPTURED'. Emitauctions.sold.
A crashed settlement re-fires. The fencing token blocks stale writes. The idempotency key blocks duplicate charges. Both together give effectively-once settlement.
[4.5] Trace a bid
To anchor the abstract flow, here is one real bid in wall-clock time. A rare watch auction, current price $12,400. User jan clicks "Bid $12,450" at t=0.
| Time | Layer | Event |
|---|---|---|
| 0 ms | Browser | Client sends POST /auctions/a1b2/bids with {amount: 12450, expected_price: 12400, Idempotency-Key: bid-...}. |
| 4 ms | API gateway | Auth cache hit, rate-limit OK, risk-tier A, preauth hold fires asynchronously (bid is $12,450, above the $1K threshold). Gateway does not block on it for Tier A. |
| 5 ms | API gateway | Produces to bids.incoming, partition 137 (hash("a1b2") % 400 = 137). Returns 202 {bid_id, status: QUEUED}. |
| 18 ms | Bid processor (partition 137) | Consumes message. Runs Lua CAS on auction:a1b2. Script reads current_price=12400, status=ACTIVE, expected_price matches. Increments sequence_num to 847, writes new current_price=12450, high_bidder=jan. Returns {1, 847, 12450, end_time, OK}. Time remaining 42 s, outside anti-snipe window. |
| 22 ms | Bid processor | Inserts into bids table. Partial unique index (auction_id, sequence_num) WHERE status='ACCEPTED' (§7.2) confirms first write. |
| 28 ms | Bid processor | Produces to bids.accepted. Commits Kafka offset. |
| 30 ms | Broadcast gateway | Consumes bids.accepted. PUBLISH auction:a1b2:updates with payload {seq: 847, price: 12450, high: "jan", end_time, time_left: 42}. |
| 32 ms | Valkey Pub/Sub | Fans out to 17 WebSocket gateway pods that have subscribed to this auction's channel. |
| 35 ms | Each gateway pod | Writes a frame to every local connection watching this auction. ~500 total watchers, ~30 per pod avg. |
| 65 ms | Watcher client | Receives frame, updates UI. "Outbid" notification fires for the previous high bidder. |
| 140 ms | Client (jan) | Browser receives bid_result: ACCEPTED, seq: 847 on its WebSocket. UI confirms the bid is live. |
The p99 path is 200 ms. This one was 65 ms end-to-end because Valkey, Postgres, and Kafka were all warm and the user was in the auction's home region.
[4.6] Correctness guarantees
Postgres is where truth lives. Valkey is the hot-path coordinator; if it vanishes, a new Valkey is hydrated from Postgres (current_price, high_bidder, current_end_time, sequence_num are derivable from the bids table with a MAX). Hot-start hydrate takes minutes for 10M auctions and is gated by Postgres scan throughput; during that window, new bids reject with 503 and watchers stay on the last cached state. Kafka is the delivery layer and holds no state that isn't also in Postgres.
Protection layers, in order of the bid's lifetime:
- API rate limit prevents one user from burying a partition.
- Valkey CAS script serializes bids per auction and rejects stale
expected_price. - Postgres UNIQUE
(auction_id, sequence_num)dedupes Kafka redelivery. - Fencing token on
auctions.settlement_fenceprevents duplicate settlement commits. - Idempotency key at the payment provider prevents duplicate charges.
The result: effectively-once settlement. Exactly-once is not guaranteed across the payment boundary (the payment provider is the authority on that). What is guaranteed is that only one SOLD row exists per auction and only one capture call is ever committed as "charged."
[4.7] Retraction and cancellation (cross-cutting)
Bid retraction is a legal requirement on many platforms (eBay allows it within rules). It invalidates an ACCEPTED bid and, if that bid is the current highest, forces the auction state to recompute.
Flow:
- API writes
UPDATE bids SET status = 'RETRACTED' WHERE id = ? AND bidder_id = ?and emitsbids.retracted. - A retraction handler runs an atomic Lua script on the auction's Valkey state (same per-key serialization as bid acceptance), reads the top two bids from Postgres, and if the retracted bid was the current high, rolls
current_priceandhigh_bidderback within the same script. - Broadcast the correction:
PUBLISH auction:{id}:updates <retraction+new_high>.
Retraction rules are business policy, not infrastructure: time windows, max retractions per auction, mandatory reason. Enforcement lives in the API validator.
Auction cancellation (seller withdraws a listing before bids arrive):
UPDATE auctions SET status = 'CANCELLED'- Valkey state hash deleted
- Flink timer cancelled
- Any in-flight bids reject on the next CAS attempt (the Lua script checks status)
[4.8] What is a "bid"?
A bid is just an intent to pay a price. The system does not care whether it came from a human clicking a button or a proxy agent cascading an auto-bid. From Postgres's view every bid has the same row shape: (auction_id, bidder_id, amount, sequence_num, status, bid_type, created_at).
bid_type routes the bid to one of three origin modes the processor knows:
| Mode | Origin | Notes |
|---|---|---|
| manual (default) | Human click via API | Carries expected_price from client UI |
| proxy | Auto-bid fired by proxy resolver | Triggered by another user's bid crossing a standing max |
| dutch_accept | Dutch-auction "accept current price" click | No expected_price; price is taken from the scheduled drop |
All three modes go through the same Kafka topic, the same CAS script, and the same Postgres writes. Only the caller differs.
Proxy bid cascade. When a bid is accepted, a proxy-bid-resolver consumer reads the proxy_bids table for the auction. If another user's standing max is above the new price, the resolver submits the next bid (current_price + min_increment) on that user's behalf through the same API path. Cascading proxies terminate when only one active max remains above the current price.
From the bid processor's view, every bid is the same row shape. The origin mode only decides who issued it.
[4.9] What this design intentionally avoids
Every system-design deep dive picks a scope. Being explicit about what's out of scope sharpens what's in:
- Sub-50 ms bid confirmation. Not an SLO. 200 ms p99 regional is the bar. Going lower means sacrificing durable bid persistence or crossing to a HFT-style architecture, neither of which fits the product.
- Global real-time search. Browse is eventually consistent (Elasticsearch via CDC, <60 s lag). A user who places a bid and immediately searches by title may not find the listing for up to a minute. Acceptable.
- Peer-to-peer bidding or on-chain settlement. Escrow, KYC, regulatory obligations require a central authority. The platform is the party to every transaction.
- Exactly-once across the payment boundary. The payment provider is an independent authority. The platform sends idempotency keys; the provider's contract is what makes captures effectively-once.
- Cross-region active-active writes. Auctions are pinned to one region for their lifetime. Multi-region failover is a 5-10 min manual operation, not automatic.
- Live bid streaming to anonymous browsers. Watchers authenticate. No WebSocket without an account. Keeps abuse and bot scraping bounded.
[4.10] Store roles
| Store | Technology | What it holds | Why it fits |
|---|---|---|---|
| Source of truth | Postgres 17 | Auctions, bids, settlements, users | ACID, partitioning, proven at this write volume |
| Hot state | Valkey 8 | Per-auction state hash, CAS target, sequence counter, fencing counter, Pub/Sub channels | Single-threaded Lua = free per-key serialization, sub-ms latency |
| Event bus | Kafka 4.0 (KRaft) | bids.incoming, bids.accepted, auctions.ending, auctions.sold | Partition ordering per auction_id, durable replay, mature ecosystem |
| Timer service | Flink 1.19 | Keyed-state timers per auction, settlement pipeline | Exactly-once for internal state and timer firing; payment side effects made effectively-once via fencing + idempotency, not by Flink |
| Coordination | Postgres advisory lock | Settlement coordinator leader election | No extra service; etcd is the upgrade path if multi-region coordination is needed |
| Analytics | ClickHouse | Bid history aggregations, seller dashboards, trending | Columnar, fast over billions of rows |
| Search | Elasticsearch | Auction browse, faceted search, ending-soon lists | Full-text, geo, faceting |
| Objects | S3 | Auction images, archived bid logs | Durable, cheap, CDN-friendly |
5. Technology Selection
[5.1] What shape is this system?
The workload is a real-time transactional system with event-driven fan-out. The write path needs strong serialization per auction. The read path needs horizontal broadcast to thousands of watchers. Both together map naturally to CQRS: one write model (bid processor) owns the canonical state; many read models (WebSocket, search, analytics) derive from a single event stream.
"Serialized per auction" does not mean "serialized globally." With 10M auctions active, the global bid rate is 50K/sec, but a given auction sees at most 500/sec. Per-auction serialization is cheap; per-auction database locks across a shared row are not. Valkey's single-threaded execution is the right primitive.
[5.2] The simpler version (don't skip this)
Before building Kafka + Valkey + Flink, ask whether the scale requires it.
Postgres-only variant. Works up to ~500 bids/sec across all auctions.
- Accept bids through an API that does
SELECT ... FOR UPDATEon the auction row. - Validate the bid against
current_priceandcurrent_end_time. - Insert into
bids, updateauctions.current_price, commit. - Broadcast via Postgres
LISTEN/NOTIFYto a small fan-out service that pushes over WebSocket. - Settlement via
pg_cronfiring a SQL function atcurrent_end_time.
Everything in one database. No Valkey. No Kafka. No Flink. The FOR UPDATE row lock serializes per-auction the same way Valkey's single-threadedness does, just with higher latency and a cap on concurrency.
When to graduate. The Postgres-only path falls over when:
- Hot auctions exceed ~50 bids/sec (lock contention + connection pool saturation).
- WebSocket watchers exceed ~10K (NOTIFY fan-out isn't designed for this).
- Peak bid rate exceeds ~500/sec total (database becomes the bottleneck).
At that point, the staged path makes sense: add Valkey for the hot path first, keep Postgres as source of truth. Add Kafka to decouple API latency from processor throughput. Add Flink when settlement complexity outgrows pg_cron.
The rest of this post describes the full-scale version. Most teams building this will not need it on day one.
[5.3] Store selection
| Concern | Chosen | Rejected |
|---|---|---|
| Source of truth | Postgres 17 | CockroachDB (unnecessary global consistency overhead), MySQL (weaker partitioning story) |
| Hot auction state | Valkey 8 | DynamoDB conditional write (5-10 ms vs sub-ms), Redis (license + Valkey is the forked OSS continuation) |
| Event bus | Kafka 4.0 KRaft | RabbitMQ (no partition ordering at this scale), Pulsar (viable alternative; see note below) |
| Timer service | Flink 1.19 | Quartz (single-node, doesn't survive a crash), pg_cron (doesn't scale past the simpler variant) |
| Settlement coordinator leader | Postgres advisory lock | ZooKeeper (heavier), etcd (great, but unnecessary second service for single-region) |
Pulsar as alternative to Kafka. Pulsar's per-message ack and shared subscriptions remove the per-partition bottleneck. Any number of consumers can share a single topic. Per-auction ordering is still required, which Pulsar's Key_Shared subscription provides without the partition count constraint. The cost is operational weight: BookKeeper dependency, smaller ecosystem, thinner managed offerings. Kafka wins on ecosystem maturity and production track record.
[5.4] Build vs buy
- API gateway: build. Off-the-shelf gateways do not enforce the exact validation + idempotency + Kafka produce semantics required here.
- Bid processor: build. Core of the system; no vendor substitute exists.
- WebSocket gateway: build on a proven framework (Go + gorilla/websocket, or Rust + tokio-tungstenite). Do not hand-roll TCP framing.
- Payment: buy. Stripe, Adyen, or equivalent. Never build a card-data system unless payments is the product.
- Search: buy. Elasticsearch managed (Elastic Cloud, Opensearch on AWS).
- Analytics: buy-or-self-host ClickHouse. ClickHouse Cloud for pure pain-avoidance; self-host when cost dominates.
6. Back-of-the-Envelope
[6.1] Throughput
Active auctions: 10,000,000
Avg auction duration: 7 days
Completed per day: 10M / 7 ≈ 1.43M
Mean bids per auction: ~105 (median is ~15; hot listings pull the mean up)
Avg bid rate: 1.43M × 105 / 86400 ≈ 1,740 bids/sec → round to 1,700 bids/sec
Peak: 30× average driven by evening end-of-auction clustering.
Peak bid rate: 50,000 bids/sec
Hot-auction rate: top 0.01% of auctions in their final minute = 100-500 bids/sec per auction.
Daily volume: 1,700 × 86400 ≈ 147M bids/day, matching the 55B rows/year used in §6.3.
[6.2] Bid processor sizing
One consumer per Kafka partition. Per-message work, sequential: Kafka consume (1 ms) + Valkey Lua script (0.5 ms) + Postgres insert (3 ms) + Kafka produce (1 ms) = 5.5 ms, or ~180 bids/sec per consumer.
Target peak: 50,000 bids/sec
Per-consumer capacity (sequential): 180 bids/sec
Consumers needed: 50,000 / 180 ≈ 280
Round up for headroom and rebalance buffer: 400 partitions, 400 consumers
With batched Postgres inserts (§15.5, 10-50 bids per batch), per-consumer throughput rises past 500/sec. The 400-partition count is a hedge for partition-bound parallelism (§15.4) and rebalance tolerance, not the raw throughput floor. Each pod is tiny: 1 vCPU, 512 MB RAM. KEDA scales on consumer lag.
[6.3] Postgres storage
bids table:
150M bids/day × 365 days = ~55B rows/year
Row size: ~250 bytes (ids, amount, fencing/seq, timestamps, status)
Hot 90 days: 13.5B rows × 250 B = 3.4 TB
With indexes (2.5×): ~8.5 TB
Monthly partitions: ~1 TB each
Archive to S3 after 90 days; drop partition after 2 years
auctions table:
10M active + ~50M archived per quarter = 60M rows
Row size: ~2 KB with images JSONB reference (not content)
Total: ~120 GB. Small relative to bids.
settlements table:
1M settlements/day × 365 = 365M/year
Row size: ~400 B
Annual: 150 GB
users table: 50M × 500 B = 25 GB.
[6.4] Valkey memory
Active auction hash (per auction):
Fields: current_price, high_bidder, min_increment, current_end_time, status, bid_count, reserve_price, auction_type, anti_snipe_*, sequence_num
Size: ~500 B per hash
10M × 500 B = 5 GB
Fencing counters: ~16 B per auction × 10M = 160 MB
Proxy sorted sets: ~500K auctions with active proxies × ~1 KB = 500 MB
Pub/Sub: ephemeral, ~no memory for idle channels
Other overhead: 500 MB
Total working set: ~7 GB
Cluster: 3 primaries + 3 replicas × 16 GB = 96 GB raw, 48 GB primary-side. ~6× headroom on primaries; replicas give the availability target.
[6.5] Kafka
bids.incoming:
50K msg/sec × 500 B = 25 MB/sec
400 partitions (one per processor)
Retention: 24 hours
Daily volume: 2.2 TB (pre-replication)
bids.accepted:
~40K msg/sec × 400 B = 16 MB/sec
100 partitions (consumed in parallel by broadcast, search, analytics)
Retention: 7 days
Weekly volume: ~10 TB
auctions.ending, auctions.sold:
~1K msg/sec peak, low volume
10 partitions each, 24 h retention
Cluster: 6 brokers, 3 TB NVMe each, RF=3
[6.6] WebSocket sizing
1M concurrent connections
Per-pod capacity: 50K connections. The 200K+ figure often quoted for Go + epoll
assumes light payloads, kernel tuning (somaxconn, file descriptors, tcp_mem),
terminated TLS at a sidecar, and ~10-20 KB memory per idle connection. Real
headroom depends on TLS in-process, frame size, and per-connection subscription
fan-out.
Pods needed: 1M / 50K = 20
Round up for headroom and rolling deploys: 40 pods
Per-bid fan-out cost:
Avg 30 watchers per auction
Accepted bid → Valkey PUBLISH → 1-40 subscribed pods → local push to watchers on each pod
~40K accepted/sec × 30 avg = 1.2M WebSocket frames/sec across the fleet
Per pod: 1.2M / 40 = 30K frames/sec. Well within Go's easy envelope.
Hot auction edge case:
5K watchers on one auction, distributed across all 40 pods
→ 125 watchers/pod avg
→ each pod pushes 125 frames per bid
At 500 bids/sec on that one auction, each pod pushes 62.5K frames/sec from it.
Total per pod: 100K frames/sec. Approaching the limit. See §15.2.
[6.7] Growth projections
The design above is sized for today. Three growth scenarios worth planning for:
| Horizon | Multiplier | What breaks first | Mitigation |
|---|---|---|---|
| 18 months | 2× (100K bids/sec peak) | Kafka partition count (400 saturated). Valkey hot-key CPU on top 10 auctions. | Double partitions to 800 (provision up-front, rebalance painful). Shard Valkey cluster from 3 nodes to 6. |
| 3 years | 5× (250K bids/sec peak) | Postgres single-primary write rate on bids insert. WebSocket fan-out at 5M concurrent connections. | Shard bids by auction_id range across 4 Postgres primaries. Move WebSocket gateway to edge runtimes that hold stateful TCP (Cloudflare Durable Objects, Fly.io regional VMs). Standard CDN workers do not hold long-lived WebSockets. |
| 5 years | 10× (500K bids/sec peak) | The CQRS architecture itself: managing 10+ Postgres shards, 1600 partitions, 10M+ watchers exceeds what a single team can operate. | Split the platform by auction category (electronics, collectibles, vehicles), each a semi-autonomous deployment. Shared user and payment layer. |
What to build in now vs later.
- Build in: partition count hedge (800 partitions instead of 400), Valkey Cluster (not single-node), structured logging and tracing across all stages.
- Defer: Postgres sharding, edge-deployed WebSocket, category splits. All three are >12 months of work; do them when the pain shows up, not on speculation.
The scariest graph in capacity planning is peak bid rate over time. Track it weekly. When the 90th percentile of weekly peaks crosses 70% of current capacity, the 18-month mitigations need to be in flight.
7. Data Model
[7.1] auctions (source of truth)
CREATE TABLE auctions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
seller_id UUID NOT NULL REFERENCES users(id),
title VARCHAR(255) NOT NULL,
description TEXT,
category_id INT NOT NULL,
auction_type VARCHAR(20) NOT NULL DEFAULT 'english',
-- Pricing
starting_price DECIMAL(12,2) NOT NULL,
reserve_price DECIMAL(12,2),
min_bid_increment DECIMAL(12,2) NOT NULL DEFAULT 1.00,
current_price DECIMAL(12,2) NOT NULL,
high_bidder_id UUID,
bid_count INT NOT NULL DEFAULT 0,
-- Timing
start_time TIMESTAMPTZ NOT NULL,
original_end_time TIMESTAMPTZ NOT NULL,
current_end_time TIMESTAMPTZ NOT NULL,
anti_snipe_seconds INT NOT NULL DEFAULT 30,
anti_snipe_extend INT NOT NULL DEFAULT 120,
-- Status + settlement
status VARCHAR(20) NOT NULL DEFAULT 'DRAFT',
settlement_status VARCHAR(20) DEFAULT 'PENDING',
settlement_fence BIGINT, -- fencing token of latest settlement attempt
winner_id UUID,
final_price DECIMAL(12,2),
-- Dutch-specific
dutch_start_price DECIMAL(12,2),
dutch_decrement DECIMAL(12,2),
dutch_interval_sec INT,
-- Metadata
region VARCHAR(16) NOT NULL, -- write-pinned region
currency CHAR(3) NOT NULL DEFAULT 'USD', -- pinned at creation; no cross-currency bids
image_urls JSONB DEFAULT '[]'::JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT valid_auction_type CHECK (auction_type IN ('english','dutch','sealed_bid')),
CONSTRAINT valid_status CHECK (status IN (
'DRAFT','SCHEDULED','ACTIVE','ENDING_SOON','EXTENDED',
'CLOSED','SETTLING','SOLD','UNSOLD','CANCELLED')),
CONSTRAINT valid_settlement CHECK (settlement_status IN (
'PENDING','IN_PROGRESS','COMPLETED','FAILED','NO_SALE')),
CONSTRAINT valid_time_range CHECK (start_time < original_end_time)
) PARTITION BY RANGE (created_at);
CREATE INDEX idx_auctions_status_end ON auctions (status, current_end_time)
WHERE status IN ('ACTIVE','ENDING_SOON','EXTENDED');
CREATE INDEX idx_auctions_settlement ON auctions (settlement_status)
WHERE settlement_status = 'PENDING' AND status = 'CLOSED';
CREATE INDEX idx_auctions_seller ON auctions (seller_id, status);[7.2] bids (one row per bid, partitioned)
CREATE TABLE bids (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
auction_id UUID NOT NULL, -- FK enforced in app layer (cross-partition cost)
bidder_id UUID NOT NULL,
amount DECIMAL(12,2) NOT NULL,
previous_price DECIMAL(12,2) NOT NULL, -- price seen at CAS time
sequence_num BIGINT, -- per-auction monotonic; NULL for rejected bids
status VARCHAR(20) NOT NULL DEFAULT 'ACCEPTED',
bid_type VARCHAR(20) NOT NULL DEFAULT 'manual',
rejection_reason VARCHAR(32),
idempotency_key VARCHAR(128), -- client-supplied
server_ts TIMESTAMPTZ NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
is_proxy BOOLEAN NOT NULL DEFAULT false,
proxy_max DECIMAL(12,2),
CONSTRAINT valid_bid_status CHECK (status IN ('ACCEPTED','REJECTED','RETRACTED')),
CONSTRAINT valid_bid_type CHECK (bid_type IN ('manual','proxy','dutch_accept')),
CONSTRAINT positive_amount CHECK (amount > 0),
-- Partition key included so the constraint holds across partitions (Postgres requirement).
CONSTRAINT uq_seq UNIQUE (auction_id, sequence_num, created_at),
CONSTRAINT uq_idem UNIQUE (bidder_id, idempotency_key, created_at)
) PARTITION BY RANGE (created_at);
CREATE INDEX idx_bids_auction ON bids (auction_id, sequence_num DESC);
CREATE INDEX idx_bids_bidder ON bids (bidder_id, created_at DESC);
CREATE INDEX idx_bids_accepted ON bids (auction_id, amount DESC)
WHERE status = 'ACCEPTED';
-- Per-partition partial unique index on accepted bids. A single auction never straddles
-- more than two weekly partitions in practice (max 30-day duration), so gap-free
-- sequence_num is enforced at the app layer by the Valkey CAS and verified by this index.
CREATE UNIQUE INDEX idx_bids_accepted_seq ON bids (auction_id, sequence_num)
WHERE status = 'ACCEPTED' AND sequence_num IS NOT NULL;Weekly partitions. Drop old partitions to archive path after 90 days. Rejected bids carry sequence_num = NULL (the Valkey CAS only assigns a sequence on acceptance).
[7.3] proxy_bids
CREATE TABLE proxy_bids (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
auction_id UUID NOT NULL,
bidder_id UUID NOT NULL,
max_amount DECIMAL(12,2) NOT NULL,
is_active BOOLEAN NOT NULL DEFAULT true,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
deactivated_at TIMESTAMPTZ,
CONSTRAINT positive_max CHECK (max_amount > 0)
);
-- Partial unique index: one active proxy per (auction, bidder). Withdrawn proxies
-- (is_active = false) do not block the user from setting a new one.
CREATE UNIQUE INDEX uq_active_proxy ON proxy_bids (auction_id, bidder_id)
WHERE is_active = true;
CREATE INDEX idx_proxy_active ON proxy_bids (auction_id)
WHERE is_active = true;[7.4] auction_settlements
CREATE TABLE auction_settlements (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
auction_id UUID NOT NULL UNIQUE,
winner_id UUID,
final_price DECIMAL(12,2),
reserve_met BOOLEAN NOT NULL DEFAULT false,
fencing_token BIGINT NOT NULL, -- INCR fence:auction:{id}
status VARCHAR(20) NOT NULL DEFAULT 'INITIATED',
payment_id VARCHAR(128),
payment_status VARCHAR(20) DEFAULT 'PENDING',
idempotency_key VARCHAR(128) NOT NULL, -- settle-{auction_id} (stable across retries; see §9.6)
auction_ended_at TIMESTAMPTZ NOT NULL,
settled_at TIMESTAMPTZ,
payment_at TIMESTAMPTZ,
CONSTRAINT valid_settlement_status CHECK (status IN (
'INITIATED','WINNER_CONFIRMED','PAYMENT_AUTHORIZED',
'PAYMENT_CAPTURED','COMPLETED','FAILED','NO_SALE'))
);
CREATE INDEX idx_settlements_status ON auction_settlements (status)
WHERE status NOT IN ('COMPLETED','NO_SALE');[7.5] Valkey key patterns
auction:{id} HASH current_price, high_bidder, min_increment,
current_end_time, status, bid_count,
reserve_price, auction_type,
anti_snipe_seconds, anti_snipe_extend,
sequence_num
fence:auction:{id} INT settlement fencing counter
bid_result:{bid_id} STRING cached CAS result for Kafka-redelivery dedup;
TTL = auction_end + 48 h (see §17.1)
auction:{id}:proxies ZSET score=max_amount, member=bidder_id
auction:{id}:updates PUB/SUB channel for bid broadcast
(SPUBLISH/SSUBSCRIBE on Valkey Cluster)
user:{id}:watching SET auction_ids the user is watching
ws:{pod_id}:subscriptions SET auction_ids this gateway pod is actively subscribed to
(written by the gateway on subscribe, read by ops tooling)
rate:bid:{user_id} STRING counter with TTL (rate limit)
[7.6] Entity-relationship diagram
[7.7] Auction lifecycle
8. API Design
[8.1] Place a bid
POST /api/v1/auctions/{auction_id}/bids
Authorization: Bearer <token>
Idempotency-Key: bid-20260419-user123-a1b2c3d4-10500
Content-Type: application/json
{ "amount": 105.00, "expected_price": 100.00 }
The API returns immediately with {bid_id, status: "QUEUED"}. The client subscribes to its WebSocket to receive the final ACCEPTED or REJECTED event referenced by bid_id. Acceptance also gates on bidder risk tier and required hold (§9.11); failure returns REQUIRES_DEPOSIT.
On rejection:
{ "bid_id": "...", "status": "REJECTED",
"reason": "STALE_EXPECTED_PRICE",
"current_price": 107.00, "min_next_bid": 108.00,
"end_time": "2026-04-19T20:00:30Z" }[8.2] Set a proxy bid
POST /api/v1/auctions/{auction_id}/proxy-bids
{ "max_amount": 500.00 }
DELETE /api/v1/auctions/{auction_id}/proxy-bids # withdraw
Proxy bid submission immediately fires a real bid if the current price × min_increment is below max_amount.
[8.3] Create auction
POST /api/v1/auctions
{
"title": "...", "description": "...", "category_id": 42,
"auction_type": "english",
"starting_price": 1.00, "reserve_price": 100.00,
"min_bid_increment": 1.00,
"start_time": "...", "end_time": "...",
"anti_snipe_seconds": 30, "anti_snipe_extend": 120,
"images": ["..."]
}
[8.4] WebSocket protocol
Client connects to wss://<region>.auction.example/v1/ws, authenticates, and subscribes to channels:
→ { "op": "subscribe", "auction_ids": ["a1", "a2"] }
→ { "op": "unsubscribe", "auction_ids": ["a1"] }
→ { "op": "resume", "auction_id": "a1", "last_seen_seq": 23 }
← { "type": "bid", "auction_id": "a1", "seq": 24, "price": 105.00,
"high_bidder": "jan", "end_time": "...", "extended": false }
← { "type": "bid_result", "bid_id": "b1", "status": "ACCEPTED", "seq": 24 }
← { "type": "bid_result", "bid_id": "b1", "status": "REJECTED",
"reason": "STALE_EXPECTED_PRICE", "current_price": 107.00 }
← { "type": "auction_closed", "auction_id": "a1", "result": "SOLD",
"winner": "jan", "final_price": 240.00 }
On reconnect, the client sends resume with the last sequence it saw. The gateway fetches missing bids from Postgres and replays them before resuming the live stream.
[8.5] Ops endpoints
GET /api/v1/auctions/{id} current state (served from Valkey with Postgres fallback)
GET /api/v1/auctions/{id}/bids?cursor=... paged bid history (Postgres)
GET /api/v1/auctions?category=&ending=soon browse via Elasticsearch
POST /api/v1/auctions/{id}/bids/{bid_id}/retract policy-gated
[8.6] Image and media pipeline
Auction images dominate the object-store footprint and the CDN bill. Pipeline:
- Seller uploads direct to S3 via presigned PUT. The API only returns the URL; bytes never touch application servers.
- On
s3:ObjectCreated, a Lambda generates three thumbnail sizes (200px, 600px, 1200px WebP) and a blurhash string. Thumbnails write to a public bucket behind CloudFront. - Moderation runs in parallel: an async worker calls a vision model (AWS Rekognition or equivalent) for NSFW + weapon + known-counterfeit signals. Flagged images block the auction from publishing until human review.
- A perceptual-hash (pHash) is computed and indexed. On create, the hash is compared against a stolen-listing denylist and against the seller's own past listings (duplicate-image reuse is a common fraud signal).
auctions.image_urlsstores the S3 keys; the client builds CDN URLs with a signed short-TTL token for listings under legal hold.
Retention: auction images are kept 7 years to satisfy dispute windows; archived to Glacier after 90 days post-settlement.
9. Settlement, Payouts, and Risk
Settlement correctness (§9.1-9.10), deposit policy (§9.11), account lifecycle (§9.12), and seller payouts (§9.13) all hang off the same auction-end event.
[9.1] Core idea
Retries are fine. Double charging is not. Settlement is allowed to run more than once; it is not allowed to commit more than once. One SOLD row after retries settle, one captured payment once the provider acks. Three layers make duplicate runs harmless: fencing tokens, conditional writes, and provider-side idempotency keys.
[9.2] Real-world duplicate scenario
A settlement consumer fires on auctions.ending. It:
- Acquires fencing token 42 via
INCR fence:auction:{id}. - Writes the winner +
settlement_fence = 42to Postgres. - Calls Stripe with
Idempotency-Key: settle-abc-42. Stripe captures $240. - Updates
settlement_status = 'PAYMENT_CAPTURED'.
Between step 3 and step 4, the consumer pod gets evicted. Kafka redelivers. A second consumer picks up:
- Acquires fencing token 43.
- Writes the winner +
settlement_fence = 43to Postgres (42 < 43, so the conditional UPDATE succeeds). - Calls Stripe with
Idempotency-Key: settle-abc-43. New idempotency key. Stripe would capture again.
This is the subtle trap. Using the fencing token in the idempotency key breaks the guarantee. The key must be stable across retries, tied to the auction, not the attempt.
[9.3] Why this isn't exactly-once
Exactly-once across the payment boundary is impossible: the payment provider is an independent system with its own retry semantics. The platform can guarantee that only one capture is ever considered settled from its side of the boundary, and that the idempotency key is stable so Stripe returns the original charge on re-attempt.
[9.4] Settlement flow (order matters)
Correct version of the flow:
token = INCR fence:auction:{id}in Valkey. This is the tiebreaker.- Read the winning bid from Postgres:
SELECT id, bidder_id, amount FROM bids WHERE auction_id = ? AND status = 'ACCEPTED' ORDER BY amount DESC, sequence_num ASC LIMIT 1. - If no winning bid or
amount < reserve_price, markUNSOLDand stop. No payment call. INSERT INTO auction_settlements (auction_id, winner_id, final_price, fencing_token, status, idempotency_key) VALUES (?, ?, ?, $token, 'INITIATED', 'settle-{auction_id}')withON CONFLICT (auction_id) DO UPDATE SET fencing_token = EXCLUDED.fencing_token, status = 'INITIATED' WHERE auction_settlements.fencing_token < EXCLUDED.fencing_token. The idempotency key is derived fromauction_idonly. Stable across retries.- Conditional UPDATE on
auctions:Zero rows updated means a newer attempt has already committed. Stop immediately.sqlUPDATE auctions SET status = 'SOLD', winner_id = $winner, final_price = $price, settlement_fence = $token WHERE id = $auction_id AND (settlement_fence IS NULL OR settlement_fence < $token) AND status = 'CLOSED'; - Call the payment provider with
Idempotency-Key: settle-{auction_id}. Stripe returns the original capture on repeat. - On provider success,
UPDATE auction_settlements SET status = 'PAYMENT_CAPTURED', payment_id = ?, payment_at = NOW() WHERE auction_id = ? AND fencing_token = $token. - Emit
auctions.sold.
The fencing token orders the writes. The idempotency key is independent of the token.
[9.5] Why the CAS alone isn't enough
Settlement runs across Postgres and the payment provider, not inside Valkey. Between the "read winner" and "write SOLD" steps, an earlier frozen coordinator can wake up and race the current one. The Valkey CAS protects the auction state hash, not the Postgres settlement state. Fencing is the monotonic counter that makes "later wins" deterministic across systems.
[9.6] External idempotency (payment provider)
Stripe, Adyen, and similar systems accept an Idempotency-Key HTTP header and remember responses for 24 hours. Repeated calls with the same key return the original response, including the PaymentIntent ID.
Key pattern: settle-{auction_id} where the auction_id is a UUID. Never includes timestamps, tokens, or retry counts. Must be stable so the second attempt gets the same response the first attempt started.
[9.7] Retraction mid-settlement
If a high bidder retracts their bid after the auction closes but before settlement writes SOLD, the settlement query (§9.4 step 2) picks up the next-highest ACCEPTED bid. Retractions during the CLOSED → SETTLING window are rejected by the API to close the race between INCR fence and the settlement's winner read; once the auction enters SETTLING, retraction is not possible. Retraction is also not allowed after the settlement state reaches PAYMENT_AUTHORIZED or later.
[9.8] Where duplicates are handled
| Stage | Duplicate source | Defense |
|---|---|---|
Kafka bids.incoming delivery | At-least-once + rebalance | bid_result:{bid_id} NX cache inside Lua (§17.1); partial unique index on (auction_id, sequence_num) WHERE status = 'ACCEPTED' |
| Bid client retry | Network blip | UNIQUE (bidder_id, idempotency_key, created_at) on bids |
| Settlement retry | Coordinator crash | Fencing token + conditional UPDATE |
| Payment retry | Settlement retry | Idempotency-Key: settle-{auction_id} (stable across retries) |
| WebSocket push redelivery | Pub/Sub fan-out bug | Client dedupes by sequence_num |
[9.9] Settlement is not atomic
The settlement pipeline touches Valkey (fence), Postgres (auction + settlement rows), Kafka (auctions.sold), and the payment provider. No distributed transaction binds them. A crash between any two produces a known recoverable state:
- Crash after INCR, before Postgres write → next attempt acquires a higher token, same outcome.
- Crash after Postgres write, before payment → next attempt sees a committed settlement with
status = 'INITIATED', sends the payment call (same idempotency key), updates status. - Crash after payment, before Kafka emit → next attempt re-reads the settlement row, sees
status = 'PAYMENT_CAPTURED', and emits.
Every stage of the pipeline is idempotent in terms of its side effect.
[9.10] Guarantees and non-guarantees
Guaranteed:
- A single
auctionsrow ends in statusSOLDper auction (after retries settle). - A single
auction_settlementsrow per auction, carrying the highest fencing token observed. - At most one captured charge at the payment provider (assuming the provider honors its idempotency contract).
- No accepted bid is ever lost.
Not guaranteed:
- Strict wall-clock bound on settlement time. Settlement is best-effort low-latency (sub-second under normal conditions) but can take minutes under backpressure.
- Atomicity across the payment boundary. If the payment provider captures the charge but the platform fails to persist
PAYMENT_CAPTURED, a reconciliation job corrects it.
[9.11] Bidder risk tiers and payment holds
Correctness mechanisms above stop double-charges. They don't stop a winner from walking away. The defense is a hold or deposit taken before the bid is even accepted, sized to the auction value and the bidder's history.
Hold by item value.
| Value | Pre-bid action |
|---|---|
| < $100 | None. Card on file at signup is enough. Charge post-win. |
| $100-$1000 | Card verification + saved payment token. No hold. |
| $1000-$10K | Preauth hold for the bid amount. Released on outbid. |
| > $10K | Refundable deposit (5-10%) + KYC. Manual review for new accounts. |
Hold by bidder tier.
| Tier | Who | Action |
|---|---|---|
| A | >10 settled wins, no chargebacks, account >90 days | Default policy by item value (above) |
| B | New account or <3 wins | One tier stricter than the value table says |
| C | Past chargeback, dispute, or fraud flag | Deposit always required; manual approval over $1K |
users.risk_tier is recomputed nightly from a small model over payment + dispute history. The tier check happens at the API before the bid hits Kafka. A failed hold returns REQUIRES_DEPOSIT and the UI prompts the user.
Tradeoff. Every dollar of friction costs conversion. New-user deposit thresholds are tuned quarterly against chargeback rate, not set once.
[9.12] Account deletion mid-auction
User deletion requests interact badly with an append-only bid log and a settlement that happens days later. Two cases.
Bidder deletes account while holding high bid.
Hard-delete would orphan the auction (no winner_id to pay). Solution:
- Deletion is soft while the user has active bids or pending settlements:
users.status = 'PENDING_DELETION'. The account cannot log in, cannot place new bids, but theuser_idremains valid for settlement. - Pending settlements proceed normally. The charge fires against the last-known payment method.
- After the final open settlement completes, a janitor job anonymizes the record (
email,name,address→ hashed or nulled) but keepsuser_idfor audit. The bid row keeps the anonymizedbidder_id. Audit log preserved, PII gone. - GDPR Article 17 "right to erasure" is satisfied by the anonymization step. Article 17(3)(b) exempts data retained for legal-defense purposes; bid history qualifies.
Seller deletes account mid-auction.
Active auctions are cancelled with status = 'CANCELLED_BY_SELLER_DELETE'. Bidders are refunded any held funds and notified. No settlement runs. Completed auctions where payout is pending continue to the seller's bank account on file (payout is a separate post-settlement flow; deleting the user does not cancel money owed).
Banned users. users.is_banned = true short-circuits bid acceptance at the API. Mid-auction bans are rare; when they happen, the user's current high bids are retracted (§4.7 retraction flow) and the auction recomputes.
[9.13] Seller payout flow
The post so far stops at "charge the winner." The seller's side of the ledger needs its own lifecycle.
| Stage | Trigger | What happens |
|---|---|---|
| Pending | auctions.sold emitted | seller_payouts row created: amount = final_price - fees, status = 'PENDING_HOLD', available_at = now + chargeback_window (7 days default, 30 days for new sellers). |
| Held | Continuous | Funds sit in the platform's Stripe Connect balance. Winner has dispute rights during the hold. |
| Disbursed | Cron job after available_at | Transfer via Stripe Connect transfers.create to seller's connected account. Idempotency key: payout-{auction_id}. |
| Clawback | Chargeback within hold window | Payout cancelled. seller_payouts.status = 'CLAWED_BACK'. Seller account balance shows negative if funds were partially released. |
| Settled | Payout completed + no dispute | status = 'SETTLED'. 1099-K tax form generated at year-end if US seller crosses threshold. |
Failure: seller bank account closed. Stripe Connect transfers.create returns account_closed. The payout is marked FAILED, the seller is emailed, and the platform holds the balance until a new account is added.
Failure: seller deletes account after sale. Funds owed are held for 12 months per most platform T&Cs; if unclaimed, they are escheated per state law (US) or become platform revenue per EU terms.
Tax. Sales tax / VAT is computed at settlement by a tax engine (Avalara or TaxJar) based on buyer location and item category. The tax portion is captured into a separate platform ledger and remitted quarterly. The seller does not see the tax on their payout.
Scale. At 10M active auctions and ~1M settlements/day, payouts are batched. One transfers.create per auction is too chatty for Stripe's rate limits. Batch payouts daily per seller: one transfers.create per seller per day covering all settled auctions. Cuts payout API calls by ~100×.
10. Bid Processing Model
[10.1] Why optimistic concurrency, not pessimistic
Pessimistic locking (SELECT ... FOR UPDATE on the auction row) serializes bids by holding a database row lock for the duration of the transaction. At 500 bids/sec on a hot auction, the lock wait queue grows, connection pool saturates, p99 latency spikes to seconds. PostgreSQL is not designed to hold a hot row for hundreds of concurrent transactions.
Why auctions.current_price is not written on every bid. The bids table is append-only: every accepted bid is a new row on a partitioned table, which Postgres handles cleanly at tens of thousands of inserts per second. Updating auctions.current_price on every bid would move the same contention back onto one row: 500 UPDATEs/sec on one hot row produces lock queueing, MVCC bloat, and index hot-pages, plus a flood of CDC events through Debezium. The design treats auctions.current_price as a hint refreshed occasionally (or not at all), and treats MAX(amount) FROM bids WHERE auction_id = ? AND status = 'ACCEPTED' as the authoritative value when it matters (settlement, hydrate, search indexing).
Optimistic concurrency (Valkey CAS via Lua) flips the model: each bid checks the current price as a precondition and succeeds or fails atomically in sub-millisecond time. Valkey's single-threaded execution means no lock manager is involved. Serialization is implicit in the event loop.
[10.2] The CAS script, step by step
The script at §4.2 step 2 is the entire acceptance logic in one atomic operation. Steps in order:
- Read current state from the auction hash.
- Check auction is
ACTIVEand now ≤current_end_time. - Check bid ≥ current + min_increment.
- Check
expected_pricematches current (this is the stale-view rejection). - Increment
sequence_num. - Update
current_price,high_bidder,current_end_time(extension). - Return the accept/reject result with the authoritative state.
All seven steps execute without interruption. No other CAS can see a half-applied state.
Why the expected_price check matters: it is the precondition that lets the client UI show a fresh view without the server pretending the client's view is current. If a bid at $105 arrives and the price has moved to $108, the bid is not just "too low"; it's operating on a stale view. Returning STALE_EXPECTED_PRICE with the current value lets the UI update and the user decide whether to rebid.
[10.3] Proxy bid resolution
Proxy bidding is an auto-bid-until-max agent. The resolver computes a jump-to price that ends the race in a single bid rather than cascading $1 at a time.
Algorithm on every bids.accepted for auction A:
- Load active
proxy_bidsfor A wheremax_amount > current_price. - Exclude the current
high_bidder's own proxy. - If none remain, stop.
- Let
winner= the proxy with the highestmax_amount;runner_up= second highest (or the non-proxycurrent_priceif only one proxy remains). jump_price = min(winner.max_amount, runner_up.max_amount + min_increment).- Submit one bid at
jump_priceonwinner's behalf through the same Kafka → CAS → Postgres path.
Worked example. Two proxies with maxes $200 and $500 on an auction starting at $10:
- A user bids $10 to enter.
bids.acceptedfires. - Resolver picks user-500 (highest max) vs user-200 (runner-up).
jump_price = min($500, $200 + $1) = $201.- User-500's proxy submits one bid at $201. Accepted. Done.
- Final state: price $201, high_bidder = user-500. One Kafka message, one CAS, one broadcast.
Edge cases:
- Two proxies with the same max. Both compute the same
jump_price=max. Tie is broken by earliestproxy_bids.created_at: that user's bid is submitted first, the later proxy's max no longer exceeds the newcurrent_price, and resolution terminates. - Proxy set mid-auction. If the new max is above the current price, the resolver runs with the new proxy as a candidate. Otherwise the proxy is dormant until someone else bids.
- Proxy withdrawal.
UPDATE proxy_bids SET is_active = false. The resolver checks the partial index (§7.3) before firing. - Late incoming manual bid during resolution. The CAS
expected_pricecheck rejects the resolver's bid if a manual bid landed first; the resolver re-runs on the nextbids.accepted. - Resolver racing the auction close. A proxy bid submitted just before
current_end_timeelapses: the Lua script checksnow > current_end_timeand rejects withAUCTION_CLOSED. The proxy does not fire a "would have won" bid after the close. The cascade terminates cleanly at the timer firing. - Resolver in flight when timer fires. Flink's timer emits
auctions.endingand the settlement consumer starts reading winners. If a resolver bid lands afterstatus = SETTLING, the Lua script rejects on status check (status is no longerACTIVE). Serialization is implicit: bid acceptance and settlement read Postgres/Valkey state atomically via the CAS + fencing token.
[10.4] Dutch auction
Price starts high and drops on a schedule. First bidder to "accept" wins at the current price.
Base flow.
- Flink emits a price-drop tick every
dutch_interval_secthat runs an atomic script to reducecurrent_pricebydutch_decrement. - The price-drop script also publishes to
auction:{id}:updatesso watchers see the drop. - An "accept" bid goes through the same CAS path. The script's
expected_pricecheck prevents accepting a stale price (if the price dropped between the client's render and the click, the CAS rejects and the UI updates).
Dutch with proxy. Users can set a buy-at-or-below limit. When the scheduled price drop crosses the limit, the proxy resolver fires an "accept" on their behalf. If two users have the same limit, the one whose proxy record was created first wins (earliest proxy_bids.created_at). The CAS makes ties deterministic: exactly one accept commits, the other sees a moved price.
Dutch end behavior. If no one accepts before the floor price, the auction closes UNSOLD. Most Dutch auctions set a floor equal to the seller's reserve; it's unusual to drop below reserve.
Why this matters. Dutch is structurally different from English because the platform is the price-mover, not the bidders. That means the CAS sees many more writes (every price tick is a write) but fewer acceptances. Valkey CPU stays low because tick writes are simple HSETs, not full scripts.
[10.5] Sealed-bid variants
All sealed-bid auctions share one property: bids are blind until close. The winner-selection logic differs.
First-price sealed-bid (default). Highest bid wins at their bid amount. Simple. Encourages underbidding because bidders try to guess what others will bid.
Vickrey (second-price). Highest bid wins but pays the second-highest bid amount. Theoretically removes the underbidding incentive: truthful bidding becomes the dominant strategy. Used at Google for AdWords auctions and some government bond sales. Settlement logic changes by one line:
-- First-price: pay your own bid
SELECT amount FROM bids WHERE auction_id = ? AND status = 'ACCEPTED'
ORDER BY amount DESC LIMIT 1;
-- Vickrey (second-price): pay the runner-up's bid
SELECT amount FROM bids WHERE auction_id = ? AND status = 'ACCEPTED'
ORDER BY amount DESC LIMIT 1 OFFSET 1;Hybrid sealed-English. Bids are blind for the first 90% of auction duration; the final 10% reveals the high bid and proceeds as a normal English auction. Combines sealed-bid's price-discovery with English's competitive ending. Bid type transitions at now > reveal_time; the CAS script branches on auction_type IN ('sealed_bid', 'hybrid').
Storage and privacy. Sealed-bid amounts are encrypted client-side with a seller-provided key so the platform cannot observe bids pre-close. This is a trust-model choice; encryption adds ~20 ms per bid and is often skipped for low-stakes auctions. For high-value sealed bids (government tenders, rare items), encryption is non-negotiable.
Tiebreak. Identical high bids: earliest sequence_num wins. This is also how English handles ties, but sealed-bid sees them more often because bidders don't see each other's moves.
11. Timers and Anti-Sniping
Auction end is a distributed-clock problem. The timer service decides when to close an auction; anti-sniping decides whether to extend when a bid lands near the end. Both share the same Flink job.
[11.1] Timer service
Flink's keyed-timer service is the authoritative auction-end clock. It consumes auctions.created and auctions.end_time_changed, re-arms a timer per auction, and fires into auctions.ending when the timer elapses.
Why keyed timers at this scale. 10M active auctions means 10M timers live at any moment, with roughly 1.4M firing per day and thousands more re-arming every minute as anti-sniping extensions land. Flink partitions timer state by auction_id across operator instances, so 10M timers distribute across the cluster instead of piling up on one node. Alternatives were rejected for specific reasons: setTimeout in application code dies on pod restart, cron-polling WHERE current_end_time < NOW() + X hammers a hot index at 10M rows with sub-second precision needs, Quartz is single-node, and pg_cron does not handle the volume of moving deadlines. Flink's keyed state plus checkpoint-replay is the specific combination that fits.
Flink runs in high-availability mode with Kubernetes-native JobManager leader election (ZooKeeper is the legacy path). State is checkpointed to S3 every 10 seconds. Flink's exactly-once guarantee covers internal state and timer firing, not payment side effects, which are made effectively-once downstream via the fencing token + idempotency key (§9). On JobManager failure, a new leader recovers state from the latest checkpoint and resumes. Missed timers fire on recovery: Flink's event-time semantics replay any timer whose deadline passed during downtime.
Reconciliation via advisory lock. A Postgres advisory lock is a 64-bit named coordination primitive that auto-releases on session end. The settlement consumer fleet uses one only for the housekeeping loop (reconciliation, DLQ replay). Bid processing and settlement itself are partition-owned via Kafka consumer groups and do not need global leadership.
Reconciliation workers poll pg_try_advisory_lock(91823746). The winner runs the janitor loop (find settlements stuck in INITIATED > 60 s, re-drive them). Losers idle. On crash, the session ends, the lock auto-releases, and a standby wins within the poll interval (5 s).
SELECT pg_try_advisory_lock(91823746);
-- Held for the session lifetime; releases on disconnect.Upgrade: etcd lease. For multi-region coordination, switch to an etcd lease (TTL = 15 s). Redlock's clock assumptions break under GC pauses; ZooKeeper is etcd-equivalent at heavier operational weight.
[11.2] Anti-sniping extension semantics
Anti-sniping extends the auction when a bid arrives within a configurable window of the end time. The extension is atomic with bid acceptance: the Lua script that accepts the bid also updates current_end_time in the same execution.
anti_snipe_seconds = 30: the trigger window.anti_snipe_extend = 120: the extension.- When
bid arrives && (end_time - now) < 30, setend_time = end_time + 120.
Extension is relative to the current end time, not the original. Rapid-fire bids near end can compound extensions until no bid arrives in the final window.
Infinite-extension attack. A bot placing one bid every 29 seconds (inside the 30 s window) keeps the auction open forever. Two caps prevent this:
max_extensions(default 20). Once hit, anti-sniping stops firing. Bids still accepted, but the auction ends at the currentcurrent_end_timeregardless of how close the bid lands.absolute_end_time = original_end + 30 minutes. Hard ceiling. The CAS script checksnew_end > absolute_end_timeand clamps. Beyond 30 min past the original close, no more extensions.
Both live on the auctions row (extension_count, absolute_end_time), incremented atomically by the Lua script alongside current_end_time. 30 minutes is long enough to cover legitimate rapid-fire bidding on hot items; longer than that is almost always a bot.
[11.3] Timer re-arming
When the CAS script extends current_end_time, the bid processor emits auctions.end_time_changed. Flink's keyed-timer state consumes this, cancels the old timer, and arms a new one at the new end time.
Timer precision hinges on the bid → CAS → event → Flink timer update path. End-to-end latency from bid acceptance to timer re-arm is typically 50-100 ms. If Flink fires the old timer before the re-arm lands, settlement runs against a stale end time. The fencing token + conditional UPDATE (§9.4) handles it: settlement writes SETTLING, re-reads current_end_time from Postgres, sees it has moved, and aborts.
[11.4] Clock skew
All timestamps are assigned server-side at API ingress; clients are never trusted. The end-time check in the CAS uses the auction's home-region Valkey clock (§14 pins each auction to one write region). Flink timers fire on event time carried on auctions.end_time_changed events, which propagate the ingress server_ts. NTP drift across the API tier is at most ~10 ms, well inside the 1 s precision target.
[11.5] Testing the concurrency model
The CAS + fencing + idempotency stack is hard to read from code alone. Tests that run before any change ships:
- Kafka rebalance fault injection. Kill a bid processor mid-consume while a hot auction is taking 500 bids/sec; assert no double-accept, no dropped bid, no sequence_num gap.
- Valkey primary failover chaos. Force a Sentinel-promoted replica while CAS is in flight; assert the client sees either ACCEPTED or REJECTED, never both, and the
bid_resultcache survives. - Settlement replay against prod snapshot. Copy a week of settled auctions, reset
auctions.statusandauction_settlements, re-run the pipeline with the payment provider in test mode; assert oneSOLDrow per auction and one capture per idempotency key. - Redelivery torture. Force Kafka to redeliver every
bids.incomingmessage 3×; assert thebid_result:{bid_id}dedup makes outcomes identical to a single-delivery run.
12. Hot Auctions and Fair Queueing
[12.1] The celebrity auction problem
99% of auctions take <1 bid/sec. A few take 100-500 bids/sec in their final minute. One Kafka partition per auction would waste 400 partitions on idle traffic; one partition for all auctions would create a hot-key problem where one popular auction starves the rest.
Solution: hash partitioning with 400 partitions. partition = hash(auction_id) % 400. On average, each partition serves 25K auctions. A hot auction shares a partition with ~25K quiet ones. The hot auction's bids queue behind themselves (serialized by the partition owner), but other auctions on the same partition are only marginally slowed.
Burst handling. When a single partition sees a 10× spike:
- The bid processor for that partition cannot scale horizontally (one consumer per partition).
- p99 latency rises from 60 ms to ~500 ms for bids on that partition.
- Users see a slow confirmation, not a rejection.
If a hot auction consistently overwhelms its partition, the auction is moved to a dedicated partition. Rare; most celebrity-driven bursts are short-lived.
[12.2] Per-user rate limit
One bidder cannot fire more than 10 bids/sec on one auction. Valkey counter with TTL:
INCR rate:bid:{user_id}:{auction_id}
EXPIRE rate:bid:{user_id}:{auction_id} 1
-- reject if count > 10
This protects the CAS script from a single misbehaving client. Legitimate rapid bidding (proxy cascades) comes from internal services that use a separate quota pool.
[12.3] Noisy-neighbor protection layers
| Layer | Mechanism | Where it runs |
|---|---|---|
| Per-user API rate limit | Sliding window by user_id | API gateway |
| Per-auction burst control | Kafka partition quotas (quota.consumer_byte_rate) | Kafka broker |
| Bid processor scheduling | Single consumer per partition; fair scheduling within | Bid processor |
| Pub/Sub fan-out cost | Bounded subscriber count per pod | WebSocket gateway |
13. Auction Search and Ranking
Discovery is the bridge between 10M listings and a buyer who wants to find one. The browse experience uses Elasticsearch as the primary read index, populated from Postgres via CDC and from bids.accepted via the Kafka pipe.
[13.1] What buyers actually query
Three queries dominate traffic:
- Category browse. "Show me electronics, sorted by ending soon."
- Free-text search. "Vintage rolex submariner."
- Saved-search alerts. "Notify me when a Honda CB750 listing under $5,000 is posted."
Everything else (advanced filters, geo) is long-tail.
[13.2] Ranking signals
The ranking function combines static and live signals. Static signals come from Postgres CDC; live signals come from bids.accepted.
| Signal | Source | Update cadence | Weight |
|---|---|---|---|
time_remaining_sec | Live (computed at query time) | per query | High for "ending soon" sort |
bid_count | bids.accepted stream | seconds | Medium (proxy for interest) |
watcher_count | Watchlist subscribe events | seconds | Medium |
| Text relevance (BM25) | Title + description tokens | At index time | High for free-text |
| Category match | category_id exact | At index time | High |
| Seller reputation | Postgres CDC | hourly | Medium |
| Promoted boost | auctions.promoted_until | hourly | High when active |
| Image quality score | Image moderation pipeline | At upload | Low (tiebreaker) |
[13.3] Ending-soon ranking
The most-clicked surface. Buyers want to see auctions ending in the next 1-24 hours, sorted by remaining time.
Naive approach. ORDER BY current_end_time ASC LIMIT 50 against Elasticsearch. Works but the head of the list is dominated by no-bid junk listings.
Production approach. Composite score: score = (1 / max(time_remaining_minutes, 1)) * (1 + log10(1 + bid_count)). Boosts auctions that are both ending soon and seeing real activity; the max(..., 1) floor avoids divide-by-zero at the instant of close. Cold listings drift to page 5+.
Refresh strategy. The time_remaining field is computed at query time from current_end_time minus now(), so the index does not need re-writes for the clock advancing. Anti-sniping extensions update current_end_time via the CDC pipe; freshness is sub-second.
[13.4] Free-text relevance
Standard Elasticsearch BM25 on title^3, description^1, brand^2. Title boost is highest because auction titles are short and dense.
Synonyms. A small synonym set per category: "rolex" / "rolex watch" / "submariner" / "sub" all match. Maintained by an editorial team; ~5K synonyms total.
Spell correction. phrase_suggester with a 1-edit-distance budget. Corrections are surfaced as "did you mean" without auto-replacing the query.
Personalization (optional). A small per-user vector of category affinities derived from past bid + watch history is mixed into the ranking with a low weight (10%). Not aggressive; users hate when search "guesses."
[13.5] Promoted listings
Sellers can pay to boost a listing in search results. Implementation:
auctions.promoted_until TIMESTAMPTZfield; when present and > now, a fixed score boost is added.- Promoted listings are clearly labeled in the UI (legal requirement in many regions).
- The boost saturates: at most 2 promoted slots per page, after which others rank organically. Prevents pay-to-rank-above-quality.
Revenue from promoted listings is tracked separately for billing.
[13.6] Index pipeline
Index lag target: <60 s p99 from auction creation to first searchable. Bid-count freshness target: <10 s p99.
[13.7] Failure modes
| What | Effect | Mitigation |
|---|---|---|
| Elasticsearch cluster down | Search returns 503; browse falls back to a static "popular" page from CDN | Multi-AZ ES cluster with 3 master nodes; fallback page refreshed hourly |
| Index lag spike (>5 min) | New listings invisible | Auto-page on indexer lag; pause promoted-listing billing during incident |
| Bad index update (mapping change) | Field type errors on query | Blue-green index strategy: build new index in parallel, swap alias atomically |
| Search abuse (scrapers) | Inflated query load, ranking distortion | Per-IP rate limit; bot-detection on User-Agent + click-through ratio |
14. Multi-Region
[14.1] Write locality
Each auction is pinned to one region at creation time (auctions.region). All bids for that auction route to that region's infrastructure. This avoids global coordination on the hot path and accepts slightly higher latency for cross-region bidders (~100 ms extra).
Routing: the L7 load balancer reads auction_id from the request path, resolves it to a region via a small lookup service backed by a cached auction_id → region table, and proxies the request accordingly. Auction creation (POST /api/v1/auctions, no auction_id in the path) routes to the user's home region as declared in their JWT; the created auction inherits that region and the auction_id → region cache is populated on create.
[14.2] Cross-region reads
Browse, search, and auction-detail reads are served from the nearest region. Postgres read replicas replicate across regions with typical lag of ~50 ms intra-region and 150-300 ms cross-region. Elasticsearch indexes are built per-region from the Kafka bids.accepted stream via MirrorMaker 2.
A reader in EU sees the auction detail from the EU Postgres replica. When they click "bid," the request routes to the auction's home region (say US-East) for processing.
[14.3] Region failover
If the auction's home region goes fully dark:
- Read side: other regions continue serving browse and detail traffic from their replicas.
- Write side: bids for that region's auctions reject with 503. There is no automated failover of write ownership; it would require durable cross-region state replication that would slow the hot path. Manual failover is a 5-10 min RTO: promote the replica to primary, update the
auction_id → regiontable, resume traffic.
A fast hot path wins over automated cross-region write failover, which would require durable cross-region consensus on auction state. Settlement runs in the auction's home region; the advisory-lock reconciliation job (§11) runs globally only in the primary DR region.
15. Bottlenecks and Backpressure
[15.1] Hot auction CAS contention
One auction pulling 500 bids/sec hits the CAS script 500 times/sec. Each script is 0.5 ms, so the key spends 250 ms/sec executing scripts. One Valkey shard is single-threaded, so the hard ceiling for this script is ~2K/sec per key (0.5 ms × 2K = 1 CPU-second per wall second).
Mitigation beyond 2K/sec: Valkey Cluster with consistent hashing (the key for a given auction lands on one specific shard; distribute hot auctions across shards), or shard state within one auction (rarely necessary).
Saturation fallback. If the per-key CAS queue depth crosses a threshold (p99 script wait > 50 ms), the bid processor trips a per-auction circuit breaker. New bids for that auction queue with a SLOW_PATH hint, and the client UI shows a "high traffic, bids may be delayed" banner instead of failing. The fallback drains as the queue clears. Ordering is preserved under stress and perceived latency stays honest.
[15.2] WebSocket fan-out amplification
A hot auction with 5K watchers produces 5K frames for every accepted bid. At 500 bids/sec, that is 2.5M frames/sec for one auction. Across 40 pods, ~62.5K frames/sec per pod from this auction alone.
Mitigation:
- Batching: combine multiple bid updates within 100 ms into one WebSocket frame for quiet connections.
- Coalescing: if two bids arrive within 50 ms, the second overwrites the first in the outbound buffer (the client only needs the latest state).
- Shedding: slow connections (client ack lag > 5 s) are disconnected. The client will reconnect and resume from Postgres.
[15.3] Kafka consumer lag
400 partitions; a processor crash leaves its partition unassigned for ~2 s (rebalance time) + consumer group re-join (~1 s). Bids pile up in the partition during this window. For a quiet partition, lag is recovered in seconds. For a hot partition, ~1000 bids queue.
KEDA scales the processor fleet on lag metric, but lag-driven autoscaling is reactive. For predictable peaks (evening prime time), schedule a pre-warm scale-up via a cron-triggered deployment annotation.
[15.4] Partition-bound parallelism
400 Kafka partitions = 400 concurrent consumers, period. Adding a 401st pod does nothing. Growing past 50K bids/sec requires adding partitions, which rebalances the consumer group and migrates state.
Plan for 2× growth: provision 800 partitions up front. The extra ones cost little in Kafka, and the option to scale is worth the future pain avoided.
[15.5] Postgres bid insert rate
50K inserts/sec into a single bids table is beyond one Postgres primary (typical ceiling: 15-25K/sec for narrow rows on NVMe). Mitigation:
- Partitioning by
created_atspreads current week's inserts across one weekly partition; older partitions see no writes. - Batched inserts. COPY or multi-row INSERT from the bid processor, 10-50 bids per Postgres round trip.
- Async write. The processor can return CAS acceptance to the client before the Postgres insert completes. Postgres insert is re-driven on failure from Kafka
bids.accepted.
Alternative: shard bids by auction_id across multiple Postgres primaries. Operational complexity; avoided unless scale demands it.
[15.6] Backpressure: three-layer admission control
Cause (§15.1-15.5) and response belong together. When a bottleneck flares, admission control sheds load before it becomes user-visible.
- Kafka consumer lag. If
bids.incominglag exceeds 30 s for 60 s, the API gateway returns 503 for new bids on that region. High-value auctions (top 1%) still accepted; others shed. - Bid processor loop time. If CAS + persist p99 exceeds 100 ms for 2 min, the API reduces per-user rate limits to 5 bids/sec (from 10).
- Postgres write latency. If
bidsinsert p95 exceeds 20 ms, the processor drops to async-write mode and batches inserts at 50/batch.
Each response is reversible on recovery. None require a deploy.
16. Retries, Fraud, and Recovery
[16.1] Client retry semantics
Bid clients retry on network errors with exponential backoff and jitter. The Idempotency-Key header keys the retry: the bid processor's UNIQUE (bidder_id, idempotency_key) constraint makes the second attempt a no-op, returning the stored result.
[16.2] Fraud detection
Auction fraud is a category problem. Real platforms see all of these in production. Defenses run on two timescales: inline (millisecond-level checks at the API) and offline (ClickHouse jobs over 30-day bid histories, daily).
Pattern: shill bidding (seller pumps own auction).
Seller creates a sock-puppet account and bids on their own listing to drive the price up.
- Inline signals. Same device fingerprint, same browser font hash, same IP /24, same payment-method fingerprint between bidder and seller account.
- Offline signals. Account graph: bidder who only bids on one seller's auctions, never wins, never pays. Payment-method reuse across "different" accounts.
- Action. Inline: flag and route to
bids.quarantine. Offline: account suspension + refund affected winners + ban payment method.
Pattern: bot bidding (high-frequency automated bids).
Bots that bid in the final 100 ms of an auction, often repeatedly across thousands of listings.
- Inline signals. Bid timing variance below human reaction floor (<150 ms between successive bids), inhuman click coordinates from the same client, missing or replayed CSRF tokens.
- Offline signals. Per-account bid-rate distribution clustering far above the cohort median.
- Action. Inline: rate-limit hard, then CAPTCHA challenge. Offline: account ban if pattern persists post-challenge.
Pattern: bid retraction abuse.
Bidder bids high to scare off competition, then retracts at the last moment to win at the previous lower price.
- Inline signals. Retraction within X minutes of placing a bid that became the high bid.
- Offline signals. Retraction frequency per bidder (>5% of bids retracted), retractions clustered near auction end.
- Action. Hard cap on retractions per bidder per month; flag-and-review if hit.
Pattern: collusion / bid cartels.
Coordinated group rotating winners across high-value categories to keep prices artificially low.
- Offline signals. Graph clustering on co-bidder pairs (accounts that consistently appear together but never outbid each other in the final stretch). Network analysis on shared shipping addresses, payment instruments, login geolocation.
- Action. Manual investigation; bans propagate across all linked accounts in one go.
Pattern: account takeover / fraudulent winning bids.
Compromised account places massive bid, abandons payment, leaves seller with no real winner.
- Inline signals. Login from new geolocation + first bid above N× the account's historical average + new payment method added in last 24 h.
- Action. Step-up authentication (SMS / email confirm) before the bid commits to Kafka.
Detection pipeline.
Operational reality. Inline rules cover ~80% of crude attacks at near-zero latency. Offline graph analysis catches the sophisticated 20% with 24 h lag. False-positive budget is the SLO: target <0.1% of legitimate bids flagged; human review capacity is the binding constraint. High-value items (>$10K) get tighter inline thresholds plus KYC before the first bid. Risk tiers gate inline action (§9.11).
Shill bidding by determined sellers using clean burner phones, residential proxies, and prepaid cards is hard to catch inline. The defense is offline payment-graph correlation over months plus selective audit of winners who never review the seller.
[16.3] Seller-side fraud and delivery disputes
Buyer-side fraud (§16.2) is only half the surface. Sellers can ship nothing, ship counterfeits, or list stolen goods. Defenses run at three points in the auction lifecycle.
At listing creation.
- Perceptual-hash check against the platform's stolen-goods denylist (§8.6).
- Seller account risk tier: new sellers cannot list above a category cap until they complete a first successful sale plus KYC.
- High-value categories (watches, electronics > $5K) require photo-with-serial or authenticator-partner verification before the listing goes live.
At settlement, before payout. Funds sit in a platform-held escrow for the chargeback window (7 days default, 30 days for new sellers; see §9.13). Payout releases only after the window closes and no dispute is open.
On buyer complaint (no-ship, not-as-described, counterfeit). A dispute opens a buyer_claims row in INVESTIGATING. The settlement state moves to DISPUTED and payout is frozen. Resolution paths:
- Seller uploads tracking proving delivery: claim closes, payout releases.
- Buyer returns the item with return-tracking: refund fires against the original
payment_intent_idwithIdempotency-Key: refund-{auction_id}. - Counterfeit confirmed by authenticator partner: seller account banned, funds clawed back, refund fires, listing pHash added to the stolen-goods denylist.
The claim window is 30 days from settlement by default, extended to 90 days for items > $10K. All transitions write to audit_log with actor and reason.
[16.4] Payment failures
Winner's payment method declines. Flow:
- Settlement captures the auction but payment returns
card_declined. auction_settlements.status = 'PAYMENT_FAILED'.- Notification to the winner: "Your payment failed, please update your payment method within 48 h."
- 48 h timer fires; if still failing, the settlement is marked
ABANDONED, the auction is offered to the second-highest bidder (if reserve is still met). - Second-chance offer via email + dashboard. If accepted, settlement restarts with a new fencing token.
17. Failure Scenarios
[17.1] Bid processor crashes mid-bid
Processor consumed a message, ran the CAS, but crashed before persisting to Postgres.
Effect. Valkey has the accepted bid reflected in current_price, and bid_result:{bid_id} holds the accept outcome with sequence_num = 24. Postgres does not have the bid row. Kafka offset was not committed.
Recovery. Kafka redelivers on partition rebalance. The new processor runs the CAS script again. The script's first action is SET bid_result:{bid_id} … NX; since the key already exists, the script returns the cached outcome (ACCEPTED, seq=24) without re-mutating state. The processor then writes the bid row to Postgres using the cached sequence_num, publishes bids.accepted, and commits the Kafka offset.
Why the bid_result cache is the correct mechanism. Without it, the second attempt sees current_price already at $105, the expected_price = $100 check fails, and the CAS returns STALE_EXPECTED_PRICE: a rejection for a bid that was actually accepted. The bid_id NX cache makes the CAS idempotent per bid, not per current-state view.
TTL. bid_result:{bid_id} lives for max(auction_end − now, 0) + 48 h. The 48 h buffer covers settlement retries and any reconciliation pass. Memory cost at 50K submissions/sec is bounded: by the time a 7-day auction's settlement completes, its bid_result keys are within the window of expiry.
Postgres-write ordering. The processor writes to Postgres after the CAS but before committing the Kafka offset. A second crash between Postgres write and offset commit is harmless. The redelivered message hits the cached result, and the UNIQUE (auction_id, sequence_num) partial index (§7.2) makes the second INSERT a no-op.
[17.2] Valkey node failure during CAS
Valkey primary dies during a Lua script execution.
Effect. The script may or may not have committed its changes to memory. AOF fsync = everysec means up to 1 s of writes can be lost on hard failure.
Recovery. Valkey Sentinel or Cluster promotes a replica. The replica has the state as of the last replication lag (~1-10 ms for local cluster).
Bid correctness. The bid processor sees a connection error and retries. Same bid, same expected_price. If the CAS's effect was replicated, retry fails with STALE_EXPECTED_PRICE. If not, retry succeeds. Either way, no double-accept.
Hot-start hydrate. If Valkey loses state entirely (data center loss), re-hydrate from Postgres: SELECT auction_id, MAX(amount), COUNT(*), MAX(sequence_num) FROM bids WHERE status = 'ACCEPTED' GROUP BY auction_id. Takes minutes for 10M auctions; during that window, new bids are rejected with 503.
[17.3] Flink checkpoint failure during settlement
Flink checkpoint fails while a settlement job is mid-flight.
Effect. On recovery, Flink replays events from the last successful checkpoint. The settlement event for this auction may fire twice.
Recovery. Fencing token guards (§9.4). The second attempt gets a higher token, writes SOLD + fencing_token = 43 (was 42). Payment call uses Idempotency-Key: settle-{auction_id}. Same key as attempt 42, so Stripe returns the original response. No double charge.
[17.4] WebSocket gateway pod crashes
Pod holding 50K connections dies.
Effect. 50K clients see disconnect. They reconnect to another pod (sticky-session LB routes to healthy pods).
Recovery. On reconnect, each client sends resume with its last seen sequence. The new pod fetches missing bids from Postgres. Service restored in seconds per client.
[17.5] Postgres primary failover during peak bidding
Postgres primary dies. Streaming replica promotes via Patroni; failover takes 20-60 s depending on health-check cadence and connection drain.
Effect. Bid processors cannot persist during those 30 s. Valkey CAS continues (Valkey is independent). Accepted bids pile up in a pending_persist queue in the bid processor.
Recovery. On Postgres return, processors drain the pending queue. Bid acknowledgements to clients are already sent (based on CAS result); clients see no degradation. WebSocket broadcasts also continue (Valkey Pub/Sub is independent).
Constraint. The pending_persist queue is bounded (~30 s × 50K = 1.5M messages). Each processor's queue is local memory. Hard cap: 100K per processor, after which the processor starts rejecting bids. With 400 processors, cap is 40M, plenty.
[17.6] Kafka broker outage
One broker out of six dies. Replication factor 3 means each partition has two survivors. Kafka rebalances leadership within seconds.
Effect. Sub-second p99 latency spike on produce for partitions whose leader was on the dead broker. No message loss.
Recovery. Automatic. Dead broker replaced and caught up via replica fetch.
[17.7] Region outage
Auction's home region goes dark (power, network, provider outage).
Effect. Bids for auctions pinned to that region fail with 503. Other regions unaffected for their auctions.
Recovery. Manual failover of auctions to a designated DR region: promote the Postgres replica, point the auction_id → region lookup to the new region, resume traffic. RTO 5-10 min. RPO is the Postgres replication lag at the moment of failure (typically <500 ms).
Settlements for in-flight auctions in the failed region are deferred until region recovery. If recovery exceeds the SLA, the DR region takes over settlement using the last-known auction_settlements state and fencing tokens.
Auction dying inside the final window. If the home region loses quorum in the final 30 seconds of an auction, the timer service in the DR region does not auto-fire the close. On manual failover (5-10 min RTO), the promoted Postgres replica carries the last durable current_end_time, and any end_time_changed events in Kafka replay on the new Flink job. If replication lag at the moment of failure was below the anti-snipe extension window (typical 120 s), no bids are lost and the auction closes on the extended end time. Winner and runner-up are determined from the highest accepted sequence_num in the replicated bids table. Settlement then runs normally on the DR region.
[17.8] Payment provider outage
Payment provider is down or degraded for 30-120 minutes. Rare but not theoretical: every major PSP has shipped multi-hour incidents at some point.
Effect. Settlements that reach the "capture" step fail with 5xx or timeout. Settlement state stays in INITIATED or PAYMENT_AUTHORIZED. Bidding is unaffected; only the last-mile charge is blocked.
Inline response.
- Circuit breaker on the payment client: open after 20% failure rate over 60 s, half-open retries every 30 s. Prevents piling up retries against a dead provider.
- Queue depth check: if
auction_settlementsinINITIATEDgrows past 1000, page ops. Capture is deferred but safe; winners wait longer for the "won" confirmation. - Email winners: "Your payment is being processed. If it fails, we will retry automatically for 48 hours." (literal user-facing copy)
Recovery. When the provider recovers, the reconciliation job (janitor loop, §11.1) drains the INITIATED queue in FIFO order. Each retry uses the same Idempotency-Key: settle-{auction_id}, so Stripe returns the original successful capture for any that slipped through before the outage.
Secondary provider. For a provider-wide outage (rare but catastrophic), a secondary PSP (Adyen as Stripe's backup) is wired as a configurable fallback. Switching providers changes the idempotency key namespace; any in-flight INITIATED settlements stay on the primary until it recovers. New settlements start on the secondary.
18. Operational Playbook
[18.1] Deployment
- Bid processor: rolling deploy, 10% of pods at a time. Partition reassignment on each pod rotation triggers a ~1 s rebalance. 40-pod fleet redeploys in 8 min with zero downtime.
- WebSocket gateway: rolling with 10 s connection drain. Clients reconnect automatically.
- Flink: savepoint-and-restart. 60-90 s window where timers don't fire; recovered timers fire on startup.
- Postgres schema changes:
ALTERon live tables usespg_repackor online-DDL techniques. Adding columns is free; changing types requires a shadow table migration.
[18.2] Metrics and alerts
Key metrics:
| Metric | Alert threshold | Reason |
|---|---|---|
| Bid acceptance p99 latency | >200 ms for 5 min | User-visible slowness |
Kafka bids.incoming lag | >30 s for 60 s | Bid backlog growing |
| Valkey CAS p99 | >2 ms for 2 min | Hot-key saturation |
| Settlement p99 | >5 s for 5 min | Revenue-critical path |
| Payment success rate | <99% for 5 min | Payment provider issue |
| WebSocket frame drop rate | >1% for 2 min | Gateway overload |
| Postgres write p95 | >20 ms for 2 min | Approaching bottleneck |
[18.3] Backup and recovery
- Postgres continuous archiving to S3 with 5-min PITR granularity. Daily full backup.
- Valkey AOF with everysec fsync. Snapshot to S3 every hour.
- Kafka replication factor 3; no separate backup (the topic retention is the backup).
- ClickHouse daily backup to S3; analytics replayable from Kafka retention.
[18.4] Capacity planning
- Monitor ratio of peak to average bid rate weekly. If the 30× ratio grows, partitions need scaling.
- Monitor hot-auction distribution. If the top 0.1% regularly exceeds 500 bids/sec, consider dedicated partition assignment.
- Monitor Postgres bid table growth. Re-evaluate archive cadence quarterly.
[18.5] Top 5 alerts (3 AM on-call)
- Bid acceptance p99 >500 ms. Likely Valkey or Postgres degradation.
- Settlement latency >30 s. Payment provider or Flink issue.
- Payment success rate <95%. Upstream provider outage or fraud spike.
- WebSocket reconnection rate >10× baseline. Gateway pod crashes or LB misrouting.
- Kafka lag >2 min. Processor fleet under-provisioned or stuck.
[18.6] Observability stack
- Metrics: Prometheus scrapes all services; long-term retention in Mimir or Thanos. Grafana for dashboards. Red-golden-signal boards per subsystem (API, bid processor, Valkey, Postgres, Kafka, Flink, WebSocket gateway, settlement).
- Tracing: OpenTelemetry SDK in each service; traces exported to Tempo or Jaeger. Trace header threads from API gateway through Kafka (producer-injected headers) into bid processor, broadcast, and settlement. The
bid_idis tagged on every span so a single bid is end-to-end queryable. - Logging: structured JSON to Loki; ten-minute hot retention, 30-day cold. Alert on error-rate anomalies, not absolute error counts.
- Profiling: continuous pprof for Go services (bid processor, WebSocket gateway) into Pyroscope. CPU flamegraphs are the fastest way to diagnose hot-key Valkey script regressions.
- Synthetic probes: a black-box tester places bids against a canary auction every 30 s from each region; SLO breach fires before real users notice.
[18.7] Lua script change management
The CAS script is a hot-path correctness change; treat it like a schema migration:
- EVALSHA versioning. Processors load
script_v<N>.shafrom config at startup. A deploy that flips the config toscript_v<N+1>.shais an atomic version swap. - Canary auction. New scripts are shadow-run against a synthetic auction in staging, then enabled for 1% of live auctions (steered via a Valkey config flag) before global rollout.
- Instant rollback. Rollback is a config flip back to the previous SHA; old scripts are never deleted from Valkey until two deploy cycles have passed.
19. SLOs and Error Budgets
| SLO | Target | Error budget |
|---|---|---|
| Bid acceptance availability | 99.99% | 52 min/year |
| Bid confirmation latency p99 | <200 ms | 7.2 h/month outside bound |
| Bid broadcast latency p99 | <200 ms | 7.2 h/month outside bound |
| Settlement correctness (zero double-settlements) | 100% | 0 incidents/year |
| Settlement latency p99 | <5 s | 43 h/month |
| WebSocket connection success | 99.9% | 43 min/month |
| Search freshness (new auction visible in search) | <60 s | 10% of auctions/day |
Error budgets drive release cadence. If bid acceptance availability dips below 99.99% monthly, feature deploys halt and engineering focuses on stability until the budget recovers.
20. Security
- Authentication: OAuth 2.0 for the API; session cookies for the web client. JWTs carry
user_id,account_status,region. - Authorization: bidders cannot bid on their own auctions (enforced at the API). Sellers cannot modify auctions after first bid (enforced by status check).
- Payment data: never touches platform infrastructure. Payment provider handles PAN; the platform stores only a tokenized reference.
- Webhook signatures: all incoming provider webhooks (Stripe, Adyen) verified via HMAC before processing.
- Rate limits: per-user, per-IP, per-auction. Rate limit events logged for fraud analysis.
- PII: bid history visible to the bidder, seller, and platform ops. Watcher lists are private. High-bidder usernames are masked in public views, so competitors cannot directly identify each other from the live feed.
- Reserve price: stored plaintext in
auctions.reserve_priceunder Postgres row-level security. The value never appears in API responses, WebSocket frames, Kafka payloads (bids.accepted,auctions.sold), or analytics exports. Buyers see a booleanreserve_metonly, and only after settlement. Seller, settlement worker, and platform ops are the only readers. - Admin actions audited: all admin overrides (bid cancellation, auction force-close, user ban) logged to
audit_logwith actor, timestamp, reason. - Edge protection: WAF in front of the API with rules for SQL injection, path traversal, and known bad bot UAs. Anycast scrubbing (Cloudflare, Shield Advanced, or equivalent) absorbs volumetric DDoS.
- Proof-of-work challenges: the bid endpoint optionally requires a lightweight PoW token on suspicious sessions (new account, high bid amount, residential-proxy IP block). Cost is ~100 ms client-side, invisible to real users, expensive for scrapers at scale.
- CAPTCHA: adaptive challenge on account creation, password reset, and bid submission when the risk score crosses a threshold. Managed service (Turnstile, hCaptcha) rather than hand-rolled.
21. Key Takeaways
- Optimistic concurrency via Valkey CAS scales bid acceptance to 50K/sec with sub-ms latency. Pessimistic row locks cannot.
- Effectively-once settlement stacks three layers: fencing token, conditional UPDATE, and provider idempotency key. Any one alone is insufficient.
- Kafka partition per-
auction_idgives free per-auction serialization. Partition count is the hard parallelism ceiling; provision 2× growth up front. - Anti-sniping belongs inside the CAS, not a separate pipeline. Atomic extension is the only way to close the "accept bid" vs "extend end_time" race.
- The Postgres-only variant (§5.2) is the right starting point up to ~500 bids/sec. Graduate to Kafka + Valkey + Flink only when scale demands it.
22. Appendix
A. Atomic CAS + anti-sniping Lua (with bid_id dedup)
-- KEYS[1] = auction:{id}
-- KEYS[2] = bid_result:{bid_id}
-- ARGV:
-- 1 bid_amount
-- 2 expected_price
-- 3 bidder_id
-- 4 min_increment
-- 5 now (unix seconds)
-- 6 anti_snipe_seconds
-- 7 anti_snipe_extend
-- 8 result_ttl_seconds (auction_end - now + 48h)
-- 1. Redelivery dedup. If bid_id has a cached outcome, return it verbatim.
local cached = redis.call('GET', KEYS[2])
if cached then
return cjson.decode(cached)
end
local price = tonumber(redis.call('HGET', KEYS[1], 'current_price'))
local endt = tonumber(redis.call('HGET', KEYS[1], 'current_end_time'))
local status = redis.call('HGET', KEYS[1], 'status')
local function finish(result)
redis.call('SET', KEYS[2], cjson.encode(result), 'EX', tonumber(ARGV[8]))
return result
end
-- 2. Auction state checks (order matters; stale before too-low for UX).
if status ~= 'ACTIVE' or tonumber(ARGV[5]) > endt then
return finish({0, 'AUCTION_CLOSED', price, endt})
end
if tonumber(ARGV[2]) ~= price then
return finish({0, 'STALE_EXPECTED_PRICE', price, endt})
end
if tonumber(ARGV[1]) < price + tonumber(ARGV[4]) then
return finish({0, 'BID_TOO_LOW', price, endt})
end
-- 3. Acceptance path.
local seq = redis.call('HINCRBY', KEYS[1], 'sequence_num', 1)
redis.call('HSET', KEYS[1],
'current_price', ARGV[1],
'high_bidder', ARGV[3])
local time_left = endt - tonumber(ARGV[5])
if time_left < tonumber(ARGV[6]) then
local new_end = endt + tonumber(ARGV[7])
redis.call('HSET', KEYS[1], 'current_end_time', new_end)
return finish({1, seq, ARGV[1], new_end, 'EXTENDED'})
end
return finish({1, seq, ARGV[1], endt, 'OK'})The bid_result:{bid_id} cache makes the script idempotent per bid. A Kafka redelivery after the processor crashed post-CAS returns the original outcome (same sequence_num) without re-mutating state (see §17.1).
B. Fencing token flow sequence
C. Bid sequence invariants
For a given auction:
sequence_numis monotonically increasing on accepted bids. Assigned atomically by the Valkey CAS script.- Gap-free within accepted bids (partial unique index on
(auction_id, sequence_num) WHERE status = 'ACCEPTED'). Rejected bids carrysequence_num = NULLand do not consume a number. - Highest
amountamongstatus = 'ACCEPTED'is the current winner. Ties broken by lowestsequence_num(earliest arrival). auctions.current_priceis maintained by the bid processor's Postgres write; it is the displayed price. The authoritative ordering, however, lives inbids: highest accepted amount with lowest sequence_num tiebreak. For any read where exactness matters (e.g. settlement), derive frombidsdirectly.
Explore the Technologies
| Technology | Role | Learn more |
|---|---|---|
| Postgres 17 | Source of truth for auctions, bids, settlements | PostgreSQL |
| Valkey 8 | Per-auction state, CAS target, Pub/Sub bus | Redis/Valkey |
| Kafka 4.0 | Bid delivery bus, per-auction ordering | Kafka |
| Apache Pulsar | Alternative dispatch bus (§5.3) | Pulsar |
| Flink 1.19 | Auction-end timers, settlement pipeline | Flink |
| etcd 3.6 | Leader lease (upgrade from advisory lock) | etcd |
| ClickHouse 24 | Analytics over bid history | ClickHouse |
| Elasticsearch 8 | Auction browse and search | Elasticsearch |
Patterns: Message Queues & Event Streaming, Circuit Breakers & Resilience, Auto-scaling, Replication & Consistency, Rate Limiting.
Practice this design: Online Auction interview question.