System Design: News Aggregator (100K Sources, Dedup, Personalized Ranking)
Goal: Build a news aggregation platform that crawls 100K sources (RSS/Atom feeds, news APIs), ingests 5 million articles per day, deduplicates near-identical stories, ranks articles by relevance and freshness, and serves personalized feeds to 50 million daily active users. Support breaking news detection with alerts, category classification, and multi-language content.
Mental model -- four ideas that make everything else click:
- Crawling is scheduling. Each source has a different publication frequency. A Valkey sorted set acts as a priority queue: pop the source whose next-poll-time is earliest, fetch, re-insert with updated time.
- Deduplication is set similarity. Two articles covering the same earthquake are near-duplicates. MinHash compresses each article into a 512-byte signature. LSH groups similar signatures into buckets without pairwise comparison.
- Ranking is decay. Every article starts with a score based on source authority and topic importance. Freshness decays exponentially. After 24 hours, even the best article fades from the feed.
- Breaking news is anomaly detection. Count articles per topic per time window. When a topic's volume exceeds 5x its rolling average, something happened.
TL;DR: Adaptive RSS/Atom polling with per-source frequency (every 5 minutes for major outlets, every 6 hours for low-volume blogs). Articles flow through Kafka to a deduplication pipeline using MinHash + Locality-Sensitive Hashing to cluster near-duplicate stories. Ranking combines freshness (exponential decay), source authority, cluster size, and user interest vectors. Breaking news detection compares current-window ingestion volume per topic against rolling averages via Flink. Personalization blends globally ranked articles with per-user category weights and diversity constraints. Content served from Valkey cache with PostgreSQL as source of truth.
System invariant: Ingestion, processing, and serving can fail independently. Crawlers continue publishing to Kafka even if dedup is down. Feeds continue serving from Valkey cache even if PostgreSQL is unavailable. Breaking news detection runs as a parallel Flink consumer with no dependency on the main processing pipeline. This isolation is what makes the system resilient: any single layer can degrade without cascading into total outage.
The Three Problems
News aggregation appears straightforward until the constraints become concrete. Three problems shape every decision in this design.
Problem 1: Efficient source polling at scale. 100K RSS/Atom feeds need to be checked regularly. Fixed-interval polling at 5 minutes means 100K / 300s = 333 feed fetches per second. Each fetch is an HTTP request that may return zero to fifty new articles. Most fetches return nothing new because the source has not published since the last check. Polling too frequently wastes bandwidth and risks getting the crawler IP-banned. Polling too infrequently means missing breaking news by hours.
The New York Times publishes 200+ articles per day. A niche tech blog publishes 2 per week. Polling both every 5 minutes means 288 daily polls to the NYT (productive, finds new articles on most checks) and 288 daily polls to the blog (wasteful, finds something new once per 288 * 7 = 2,016 polls).
Problem 2: Deduplication is not exact matching. When a major event occurs, 500 news outlets publish stories about it within an hour. These stories cover the same event but have different headlines, different wording, and different details. Simple exact-match dedup (matching on title or URL) misses 90% of duplicates. "Trump wins election" and "Donald Trump elected president" describe the same event. The system needs fuzzy, near-duplicate detection that groups articles covering the same story into clusters.
Problem 3: Ranking without engagement signals. Unlike social media where likes, shares, and comments provide ranking signals, news articles arrive cold. There is no user engagement data for a brand-new article. Ranking must rely on source authority, freshness, topic importance, and content quality. Engagement data only becomes useful after the article has been served to enough users, creating a cold-start problem for every single article.
Scale numbers:
- 100K RSS/Atom sources
- 5M raw articles/day ingested
- ~2M unique stories/day (after dedup, ~60% are near-duplicates)
- 50M DAU
- 10K feed requests/sec at peak (personalized feed reads)
- Average article record: ~1.2KB (metadata only, not full text)
Requirements
Functional Requirements
| ID | Requirement | Priority |
|---|---|---|
| FR-01 | Crawl 100K RSS/Atom feeds on adaptive schedules | P0 |
| FR-02 | Ingest and parse articles (title, summary, author, published date, category) | P0 |
| FR-03 | Deduplicate near-identical articles (same story, different sources) | P0 |
| FR-04 | Rank articles by freshness, source authority, topic relevance | P0 |
| FR-05 | Personalized feed per user based on reading history and interests | P1 |
| FR-06 | Breaking news detection and push notification | P1 |
| FR-07 | Category classification (politics, sports, tech, business, etc.) | P0 |
| FR-08 | Multi-language support (English, Spanish, French, German, Japanese) | P1 |
| FR-09 | Search across articles by keyword | P1 |
| FR-10 | Trending topics (most-covered stories in the last hour) | P1 |
| FR-11 | Source management (add, remove, configure sources) | P1 |
| FR-12 | Read history tracking and "read" markers | P1 |
| FR-13 | Topic following (user subscribes to specific topics) | P2 |
| FR-14 | Bookmarking/saving articles for later | P2 |
Non-Functional Requirements
| ID | Requirement | Target |
|---|---|---|
| NFR-01 | Feed generation latency (p50 / p99) | < 50ms / < 200ms |
| NFR-02 | Article ingestion-to-available | < 15 minutes (normal), < 5 minutes (breaking news) |
| NFR-03 | Throughput (feed reads) | 10K personalized feeds/sec |
| NFR-04 | Deduplication accuracy | > 90% precision, > 85% recall |
| NFR-05 | Availability | 99.99% |
| NFR-06 | Crawler politeness | Respect robots.txt, min 1s delay between requests to same domain |
| NFR-07 | Article retention | 30 days (full), 1 year (metadata only) |
| NFR-08 | Breaking news detection latency | < 5 minutes from event to notification |
Scale Estimation
These requirements define what the system must do. Before choosing an architecture, the scale numbers reveal how hard each requirement becomes. The estimates below size every component and justify the resource table at the end.
Crawling
Total sources: 100K
Adaptive polling: ~300K polls/hour = 83 polls/sec
New articles per poll: ~0.2 (80% of polls find nothing new)
New articles: 83 * 0.2 = ~17 articles/sec = ~1.5M/day
After cross-source dedup: ~600K unique stories/day
Storage
Article record (PostgreSQL):
title (100B) + summary (500B) + source_url (200B) + source_id (8B)
+ published_at (8B) + category (50B) + cluster_id (16B)
+ ranking_score (8B) + metadata (300B) = ~1.2KB per article
Daily storage: 1.5M articles * 1.2KB = 1.8GB/day
30-day retention (full): 54GB
1-year retention (metadata only): ~25GB
MinHash signatures: 128 values * 4 bytes = 512 bytes per article
Daily: 1.5M * 512B = 768MB (kept for 48 hours)
Feed Serving
50M DAU, average 5 feed loads/day = 250M feed reads/day
Peak: 10K personalized feeds/sec
Each feed: 50 articles * 500 bytes = 25KB
Bandwidth: 10K * 25KB = 250MB/sec
Valkey cache:
Global feeds (100 categories * 25KB): 2.5MB
Top 100K pre-computed user feeds * 25KB: 2.5GB
User interest profiles: 50M * 200 bytes = 10GB
LSH buckets (48-hour window): ~2GB
Total: ~15GB across 6 shards
Summary
| Resource | Sizing |
|---|---|
| Crawler workers | 50 pods (2 concurrent fetches each, 100 total) |
| PostgreSQL | 100GB, partitioned by month, 1 primary + 2 replicas |
| Valkey Cluster | 15GB across 6 shards |
| Elasticsearch | 3-node cluster, 100GB (30-day index) |
| Kafka | 3 brokers, 12 partitions, 3-day retention |
| ClickHouse | 2-node cluster, 500GB |
| Flink | 3-node cluster |
| Feed Service | 10 pods |
Why Naive Approaches Fail
The scale estimation above reveals the load each component must absorb. Before arriving at a distributed architecture, it is worth tracing what happens when a single server tries to handle these numbers -- the progression of failures clarifies why every subsequent design decision exists.
Single-server bottleneck progression. At 100 sources, a single cron job polls all feeds every 5 minutes, stores articles in a local PostgreSQL instance, and serves feeds via a monolithic API. Total polling: 100 feeds / 300s = 0.33 fetches/sec. Storage: a few hundred articles/day. This works.
At 1,000 sources, polling takes longer than the interval. Each HTTP fetch takes 200-500ms on average (DNS resolution, TLS handshake, XML parsing). Sequential polling of 1,000 feeds: 1,000 * 350ms = 350 seconds -- nearly 6 minutes to complete a 5-minute cycle. The system falls behind permanently. Polls overlap. The single-threaded poller cannot keep up.
At 100,000 sources, connection exhaustion dominates. Even with 50 concurrent connections, polling takes 100,000 / 50 * 350ms = 700 seconds per cycle. The database is overwhelmed: 17 articles/sec means 17 INSERTs/sec plus concurrent feed reads. A single PostgreSQL instance handles this, but add dedup (requiring a scan of recent articles for each insert) and the query planner collapses. Full-table scans for near-duplicate detection at 1.5M rows per day produce query times exceeding 10 seconds. The database becomes the bottleneck for both writes and reads simultaneously.
The math forces decomposition: separate crawling from processing from serving. Each layer scales independently.
Approach 1: Poll all sources at the same frequency. Polling every source every 5 minutes generates 1.7M polls per hour. 80% of those polls find nothing new. The crawler fleet burns compute and bandwidth fetching empty feeds. Worse, aggressive polling triggers rate limiting on news sites, getting crawler IPs banned.
Approach 2: Deduplicate by exact title or URL match. This catches only obvious duplicates. Two articles with titles "Massive earthquake hits Japan" and "7.1 magnitude quake strikes central Japan" cover the same event but share no words in common beyond "Japan." Exact matching has roughly 10% recall on real-world news data. 90% of duplicates slip through, filling feeds with redundant stories.
Approach 3: Use BERT embeddings for semantic deduplication. BERT produces 768-dimensional vectors that capture semantic meaning with roughly 95% accuracy. At 5M articles per day, each comparison requires a dot product over 768 floats. Finding duplicates means comparing each new article against all articles from the last 24 hours. Even with approximate nearest neighbor search, the compute cost is prohibitive. A single GPU processes roughly 1,000 embeddings per second. The ingestion rate of 58 articles per second would require multiple dedicated GPU clusters just for dedup.
Approach 4: Rank purely by recency. A 1-minute-old article from a random blog appears above a 30-minute-old article from the BBC. Source authority disappears from the feed. Users see low-quality content constantly rotating to the top. Recency matters, but without source authority and topic importance, the feed becomes noise.
What is needed: adaptive polling that matches each source's publication frequency, a dedup algorithm that runs in O(N) time with 90% accuracy, and a ranking formula that balances freshness with authority. The architecture that follows addresses each failure mode directly.
Architecture Overview
These numbers shape every architectural decision that follows. The high-level design must handle 83 crawls per second, absorb 1.5M articles per day through dedup and ranking, and serve 10K personalized feeds per second from cache. The architecture is not a linear pipeline. It is a fan-out: a single ingestion stream feeds multiple independent consumers that write to shared storage. The read path is physically and logically separate. It operates only on stored state.
Write Path and Detection Path
Two independent paths branch from Kafka. The write path flows down through Dedup, Classifier, and Ranker. The Ranker is the fan-out write node: it materializes the fully enriched article into PostgreSQL (source of truth), Valkey (feed cache), and Elasticsearch (search index). The detection path flows through Flink in parallel, counting articles per topic and writing trending topics to Valkey. These paths have no dependency on each other.
Read Path
The read path never directly interacts with the write path. It reads materialized state from storage. The Feed Service pulls pre-computed global feeds from Valkey, applies per-user interest weights, filters read history, and enforces diversity constraints. On a Valkey cache miss, it falls back to PostgreSQL. For keyword search, it queries Elasticsearch. No component in the read path knows about Kafka, dedup, or the ranker.
Simplified: crawlers produce data, Kafka distributes it, dedup groups it, the ranker decides order, and the Feed Service serves it. Everything else is machinery supporting these five steps.
Data ownership. PostgreSQL stores durable article state and is the source of truth. Valkey stores precomputed feeds, user interest profiles, and LSH dedup buckets -- all derived or ephemeral. Elasticsearch stores inverted indexes for keyword search. If Valkey loses data, feeds degrade to higher latency but remain correct. If Elasticsearch goes down, search is unavailable but feeds are unaffected.
Every deep dive that follows maps to a specific node in these diagrams.
API Design
With the architecture established, the API contract defines how clients interact with the system. Each endpoint below maps to one of the read-path or write-path components shown in the diagrams above.
Versioning Strategy
All API endpoints are versioned via URL path prefix (/api/v1/, /api/v2/). When a breaking change is required (field removal, type change, semantic change), a new version is introduced. The previous version remains supported for 6 months after the new version launches. Non-breaking changes (new optional fields, new endpoints) are added to the current version without incrementing. Clients receive a Sunset header on deprecated versions with the retirement date.
Authentication
All API requests require a JWT Bearer token in the Authorization header. Tokens are signed with HMAC-SHA256, expire after 15 minutes, and are refreshed via the /api/v1/auth/refresh endpoint using a long-lived refresh token (30-day expiry, stored hashed in PostgreSQL, rotated on each use). Scopes control access: read:feed for feed and search endpoints, write:events for read-event recording, admin:sources for source management. The API Gateway validates the JWT signature and scope before routing to backend services.
POST /api/v1/auth/refresh
Content-Type: application/json
{
"refresh_token": "dGhpcyBpcyBhIHJlZnJlc2ggdG9rZW4..."
}
Response (200 OK):
{
"access_token": "eyJhbGciOiJIUzI1NiIs...",
"expires_in": 900,
"token_type": "Bearer"
}
Get Personalized Feed
GET /api/v1/feed?category=all&cursor=eyJzY29yZSI6MC45NCwiaWQiOjEyMzQ1fQ&limit=50
Authorization: Bearer {token}
Response (200 OK):
{
"articles": [
{
"id": 12345,
"title": "New Breakthrough in Quantum Computing",
"summary": "Researchers at MIT have achieved...",
"source": {"name": "MIT Technology Review", "authority": 0.92},
"article_url": "https://technologyreview.com/...",
"image_url": "https://cdn.example.com/...",
"published_at": "2026-04-03T10:30:00Z",
"category": "tech",
"cluster_size": 12,
"is_trending": false,
"ranking_score": 0.94
}
],
"trending_topics": ["earthquake japan", "election results"],
"next_cursor": "eyJzY29yZSI6MC44NywiaWQiOjEyMzUwfQ"
}Pagination uses cursor-based navigation. The cursor encodes the last-seen ranking_score and article_id, ensuring stable pagination even as new articles arrive and scores change. Offset-based pagination (page=2) breaks when rankings shift between requests -- articles appear twice or are skipped entirely. The cursor guarantees monotonic forward progress.
Rate Limiting
All responses include rate limiting headers:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1711362000
Feed endpoints allow 100 requests per minute per user. Search endpoints allow 30 queries per minute per user. When the limit is exceeded, the API returns 429 Too Many Requests with a Retry-After header.
Search Articles
GET /api/v1/search?q=quantum+computing&category=tech&from=2026-03-01&cursor=eyJzY29yZSI6NS4yLCJpZCI6OTk5fQ&limit=20
Search hits Elasticsearch with a BM25 query, filtered by category and date range. Results are boosted by ranking_score to prioritize authoritative sources. Cursor encodes the BM25 score and document ID for stable pagination.
Record Read Event
POST /api/v1/events/read
Content-Type: application/json
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
{
"article_id": 12345,
"dwell_time_sec": 45,
"source": "feed"
}
The Idempotency-Key header prevents duplicate event recording on client retries. The server stores the key in Valkey with a 24-hour TTL; if the same key arrives again, the server returns the original response without re-processing. Read events flow to ClickHouse for analytics and update the user's interest profile in Valkey. Dwell time acts as a quality signal: a user who spends 45 seconds on an article is more engaged than one who bounces after 3 seconds.
Other Endpoints
GET /api/v1/trending -- Current trending topics
GET /api/v1/sources -- List sources
POST /api/v1/sources -- Add a new source
Idempotency-Key: {uuid}
GET /api/v1/users/{id}/interests -- User interest profile
PUT /api/v1/users/{id}/interests -- Update interests
GET /api/v1/users/{id}/bookmarks -- Bookmarked articles
POST /api/v1/users/{id}/bookmarks -- Bookmark an article
Idempotency-Key: {uuid}
Error Contract
All error responses follow a structured JSON format with machine-readable codes:
{
"error": {
"code": "FEED_UNAVAILABLE",
"message": "Feed service is temporarily unable to serve personalized feeds.",
"details": "Valkey shard 3 is unreachable. Falling back to global feed.",
"retry_after_sec": 30
}
}Error codes are stable across API versions:
| Code | HTTP Status | Meaning |
|---|---|---|
FEED_UNAVAILABLE | 503 | Feed cache and fallback both failed |
RATE_LIMIT_EXCEEDED | 429 | Per-user rate limit hit |
DEDUP_UNAVAILABLE | 202 | Article accepted but dedup status uncertain |
INVALID_CURSOR | 400 | Cursor token is malformed or expired |
SOURCE_NOT_FOUND | 404 | Requested source ID does not exist |
AUTH_EXPIRED | 401 | JWT token has expired, refresh required |
SCOPE_DENIED | 403 | Token lacks required scope for this endpoint |
Data Model
The API contract reveals what data the system must persist. The data model below supports every endpoint above -- from feed serving and search to read-event tracking and interest profiles.
PostgreSQL Schema
CREATE TABLE sources (
id SERIAL PRIMARY KEY,
name VARCHAR(200) NOT NULL,
feed_url TEXT NOT NULL UNIQUE,
site_url TEXT,
language VARCHAR(5) DEFAULT 'en',
category VARCHAR(50),
authority_score FLOAT NOT NULL DEFAULT 0.5,
avg_articles_day FLOAT DEFAULT 1.0,
poll_interval INT NOT NULL DEFAULT 3600,
last_polled_at TIMESTAMPTZ,
last_new_at TIMESTAMPTZ,
enabled BOOLEAN NOT NULL DEFAULT TRUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE articles (
id BIGSERIAL PRIMARY KEY,
source_id INT NOT NULL REFERENCES sources(id),
title TEXT NOT NULL,
summary TEXT,
article_url TEXT NOT NULL,
author VARCHAR(200),
published_at TIMESTAMPTZ NOT NULL,
ingested_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
category VARCHAR(50),
language VARCHAR(5) DEFAULT 'en',
cluster_id UUID,
is_primary BOOLEAN DEFAULT FALSE,
ranking_score FLOAT NOT NULL DEFAULT 0,
image_url TEXT,
word_count INT,
minhash_sig BYTEA
) PARTITION BY RANGE (published_at);
-- Cluster lookups for merging and display: only articles with assigned clusters
CREATE INDEX idx_articles_cluster ON articles (cluster_id)
WHERE cluster_id IS NOT NULL;
-- Feed assembly by category: sorted by score for top-N retrieval
CREATE INDEX idx_articles_category ON articles (category, ranking_score DESC);
-- Per-source history: recent articles per source for within-source dedup
CREATE INDEX idx_articles_source ON articles (source_id, published_at DESC);
-- Re-ranking batch job: scan by ingestion window for score recomputation
CREATE INDEX idx_articles_ingested ON articles (ingested_at DESC);
CREATE TABLE user_interests (
user_id BIGINT NOT NULL,
category VARCHAR(50) NOT NULL,
interest_score FLOAT NOT NULL DEFAULT 0.5,
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (user_id, category)
);
CREATE TABLE read_history (
user_id BIGINT NOT NULL,
article_id BIGINT NOT NULL,
read_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
dwell_time_sec INT,
PRIMARY KEY (user_id, article_id)
) PARTITION BY RANGE (read_at);Partitioning strategy. The articles table is partitioned by published_at month. This aligns with retention policy: dropping a 30-day-old partition is a metadata operation, not a row-by-row delete. Queries naturally scope to recent partitions because feeds rarely show articles older than a few days.
Why minhash_sig is stored in the articles table. The MinHash signature is needed during retroactive dedup (when the dedup service recovers from a failure and reprocesses articles that entered without cluster assignment). Storing it in PostgreSQL avoids recomputing the signature from text, which would require storing the full article text.
Valkey Key Patterns
# Global feed by category (refreshed every 5 minutes)
feed:global:{category} -> JSON array of top 100 articles
SET feed:global:tech '[{"id":123,"title":"...","score":0.95}]' EX 300
# User personalized feed (computed on demand, cached 5 minutes)
feed:user:{user_id} -> JSON array of top 50 articles
SET feed:user:u42 '[...]' EX 300
# Crawler scheduling (sorted set, score = next_poll_epoch)
ZADD crawler:schedule 1711361100 "source:42"
# Pop next source to poll:
ZPOPMIN crawler:schedule 1
# LSH buckets (48-hour TTL)
lsh:band:{band_id}:{bucket_hash} -> set of article IDs
SADD lsh:band:0:a3f2c1 12345 12390 12401
EXPIRE lsh:band:0:a3f2c1 172800
# User interest vector
user:interests:{user_id} -> hash of category:score pairs
HSET user:interests:u42 tech 0.8 politics 0.3 sports 0.6
EXPIRE user:interests:u42 2592000
# Trending topics (refreshed every 5 minutes)
trending:topics -> JSON array of breaking topics
SET trending:topics '[...]' EX 300
Elasticsearch Mapping
{
"mappings": {
"properties": {
"title": {"type": "text", "analyzer": "standard"},
"summary": {"type": "text", "analyzer": "standard"},
"category": {"type": "keyword"},
"source_name": {"type": "keyword"},
"language": {"type": "keyword"},
"published_at": {"type": "date"},
"ranking_score": {"type": "float"},
"cluster_id": {"type": "keyword"}
}
}
}Storage Tiering
Data moves through three temperature tiers based on access frequency and age:
| Tier | Technology | Data Stored | Retention | Access Pattern |
|---|---|---|---|---|
| Hot | Valkey (6 shards) | Pre-computed feeds, user interest profiles, LSH dedup buckets, crawler schedule | Feeds: 5-min TTL. Interests: 30-day TTL. LSH: 48-hour TTL. | Sub-millisecond reads. 10K feed reads/sec. 83 crawler pops/sec. |
| Warm | PostgreSQL (partitioned) + Elasticsearch | Articles with full metadata, sources, user read history, search index | PG: 30 days full, 1 year metadata. ES: 30-day index. | PG: re-ranking batch every 15 min, cache-miss fallback. ES: 100 searches/sec. |
| Cold | S3 (Standard-IA) | Archived articles older than 1 year, historical MinHash signatures | Indefinite | Rare. Compliance audits, historical analysis, model training. |
Hot-to-warm is automatic: Valkey TTLs expire and PostgreSQL is the fallback. Warm-to-cold is a scheduled job: a monthly archiver exports articles older than 1 year from PostgreSQL to S3 as Parquet files, then drops the corresponding partition.
Schema Evolution
The schema evolves under three constraints: zero-downtime deploys, backward-compatible Kafka consumers, and no full-table rewrites on a 100GB table.
New columns are always nullable first. Adding sentiment_score FLOAT to the articles table is an ALTER TABLE ... ADD COLUMN with no default -- an instant metadata-only operation in PostgreSQL. A backfill job populates existing rows in batches of 10K with UPDATE ... WHERE sentiment_score IS NULL LIMIT 10000. Application code handles NULL until backfill completes.
Dual-write migration for structural changes. When a Kafka message schema changes (adding a field to the article event), both old-format and new-format consumers must coexist. The producer writes both fields during the transition window. Old consumers ignore the new field. New consumers read the new field. After all consumers are upgraded, the old field is dropped from the next producer release.
Backward-compatible changes only. Column renames, type changes, and NOT NULL constraints on existing columns are prohibited in-place. Instead, add the new column, backfill, migrate application code, then drop the old column in a subsequent release.
Partitioning Strategy
The data model defines the schema; the partitioning strategy determines how that data is physically distributed across nodes. Incorrect partitioning creates hot spots that negate horizontal scaling.
PostgreSQL: RANGE by published_at (Monthly)
The articles table is range-partitioned by published_at into monthly partitions. This aligns with retention (drop old partitions as a metadata operation) and query patterns (feeds query the current month almost exclusively).
Hot partition risk. The current month receives 100% of writes -- 1.5M inserts/day concentrated on a single partition. Mitigation: the partition is on the primary with WAL replication to replicas. Reads are distributed across replicas. The write volume (17 inserts/sec) is well within a single partition's capacity.
Pre-creation. Partitions for the next 3 months are created ahead of time by pg_partman. This prevents a race condition at month boundaries where the first insert of the new month fails because the partition does not exist.
Partition pruning. Queries with WHERE published_at > NOW() - INTERVAL '7 days' touch only the current partition (and possibly the previous one near month boundaries). The planner eliminates all other partitions before execution.
Kafka: 12 Partitions Keyed by source_domain
All articles from the same source land in the same partition, preserving within-source ordering for dedup (detecting articles the crawler already fetched).
Skew risk. CNN and BBC produce 100x more articles than small blogs. A partition receiving CNN, BBC, and Reuters could see 10x the volume of a partition with only low-volume sources. Mitigation: virtual partitioning via hash(source_id) mod 12. This distributes high-volume sources across partitions rather than concentrating them. Within-source ordering is preserved because a given source_id always maps to the same partition.
Valkey: Hash Slot Distribution Across 6 Shards
Valkey Cluster uses 16,384 hash slots distributed across 6 shards (~2,730 slots per shard). Feed keys (feed:global:*) hash to slots based on the category suffix. LSH bucket keys (lsh:band:{band_id}:{bucket_hash}) distribute by band_id prefix, spreading dedup load evenly because band_id ranges from 0 to 31.
User interest keys (user:interests:{user_id}) distribute by user_id hash, producing even distribution across shards given the uniform distribution of user IDs.
Elasticsearch: Index-per-Month with Aliases
Each month gets a dedicated index (articles-2026-04, articles-2026-05). A write alias (articles-write) points to the current month's index. A read alias (articles-read) points to all indexes within the retention window. ILM (Index Lifecycle Management) policy:
- Hot phase (current month): 3 primary shards, 1 replica. Allocated to hot-tier data nodes with SSD storage.
- Warm phase (1-6 months old): Force-merged to 1 segment per shard, moved to warm-tier nodes with HDD storage. Read-only.
- Cold phase (6-12 months old): Shrunk to 1 shard, moved to cold-tier nodes. Searchable but slow.
- Delete phase (> 12 months): Index deleted.
Cross-Partition Queries
A query like "top articles across all time" must hit all PostgreSQL partitions -- a full table scan across 12+ months. This is mitigated by routing such queries to Elasticsearch (warm path) rather than PostgreSQL (hot path). Elasticsearch's inverted index handles cross-time-range queries efficiently through its read alias spanning all active indexes.
Rebalancing
Kafka partition reassignment is an online operation -- the controller replicates data to the new broker assignment while consumers continue reading from the old assignment. PostgreSQL partition attach/detach is online with an exclusive lock lasting < 1 second (blocking only concurrent DDL, not DML). Valkey Cluster resharding migrates hash slots between nodes online using CLUSTER SETSLOT MIGRATING/IMPORTING.
Caching Strategy
The partitioning strategy distributes data across nodes, but most read traffic should never reach those nodes. The caching layer absorbs 95%+ of feed reads and is the primary determinant of p99 latency.
Cache Topology
| Layer | Technology | Data | TTL | Purpose |
|---|---|---|---|---|
| L1 in-process | Node.js LRU cache (per Feed Service pod) | Source metadata (name, authority_score) | 60s | Avoid Valkey round-trip for hot lookups. ~500 sources fit in 1MB. |
| L2 distributed | Valkey (6 shards) | Feeds (global + user), interest profiles, LSH buckets | Feeds: 5 min. Interests: 30 days. LSH: 48 hours. | Serve feeds without touching PostgreSQL. Sub-millisecond reads. |
| L3 source-of-truth | PostgreSQL | All articles, sources, users | Permanent | Durable storage. Fallback on L2 miss. |
Invalidation
Feed caches (feed:global:*) use TTL-based invalidation. The 5-minute TTL means feeds can be at most 5 minutes stale -- acceptable for news aggregation. Source metadata in L1 uses a 60-second TTL; changes to source authority scores propagate within 60 seconds without an explicit invalidation event.
Source configuration changes (enabling/disabling a source, changing poll_interval) trigger event-driven invalidation: a Kafka consumer in the Feed Service listens to the source-config topic and evicts the affected L1 cache entries immediately.
LSH buckets use the 48-hour TTL and expire naturally. No explicit invalidation is needed because the dedup window is inherently time-bounded.
Thundering Herd Mitigation
When feed:global:tech expires and 10K requests arrive simultaneously, all would query PostgreSQL for the same data. Request coalescing via Valkey SETNX lock prevents this:
SETNX feed:global:tech:lock 1 EX 5 -- 5-second lock
The first request acquires the lock and computes the feed from PostgreSQL. The remaining requests poll Valkey every 100ms for up to 500ms. If the feed is not populated within 500ms (computation failure or extreme latency), the system falls back to stale cache using GETEX feed:global:tech with a 60-second TTL extension. This serves slightly stale data rather than overwhelming PostgreSQL.
Cache Warming
Pre-computed feeds for the top 2% of users (1M users) are refreshed on a 5-minute cron cycle. Each cycle reads the latest global rankings from Valkey, applies per-user interest weights, and writes the result to feed:user:{user_id}.
On Valkey restart (all shards cold), lazy population recovers the cache: each request that misses Valkey falls through to PostgreSQL, computes the result, and populates the cache. During this warm-up period (roughly 10 minutes), p99 latency spikes to approximately 200ms as PostgreSQL handles the full read load. Pre-warming the top 2% user feeds and all global category feeds reduces this window to approximately 3 minutes.
Eviction Policy
Valkey is configured with maxmemory-policy allkeys-lru. When memory pressure forces evictions:
- Feed keys (
feed:global:*,feed:user:*) evict first -- easily recomputed from PostgreSQL in < 50ms per feed. - LSH bucket keys (
lsh:band:*) evict next -- repopulated organically as new articles arrive. - Interest profile keys (
user:interests:*) evict last -- expensive to rebuild because recomputation requires aggregating read history from ClickHouse (a 15-minute batch job).
This priority ordering is not explicitly configured; it is a natural consequence of TTLs. Feed keys have 5-minute TTLs (frequently refreshed, frequently evicted under LRU). Interest profiles have 30-day TTLs (long-lived, evicted only under severe memory pressure).
Cache-DB Consistency
Valkey is a cache, PostgreSQL is the source of truth. Maximum staleness: 5 minutes for feeds (TTL), 30 days for interest profiles (acceptable because interests shift slowly). The system never reads from Valkey and assumes it is authoritative -- a Valkey miss always falls through to PostgreSQL.
Divergence detection: an hourly reconciliation job compares feed:global:* key count and feed lengths against PostgreSQL article counts per category. If the delta exceeds 10% (more articles in PostgreSQL than represented in cached feeds), an alert fires. This catches bugs in the feed materialization pipeline, not normal TTL-based staleness.
Consistency Model
The caching strategy introduces intentional staleness. The consistency model makes these tradeoffs explicit per operation, preventing confusion about what guarantees each path provides.
Per-Operation Guarantees
| Operation | Guarantee | Mechanism |
|---|---|---|
| Article write | Linearizable | Single-leader PostgreSQL primary with synchronous replication |
| Feed read (cache hit) | Eventually consistent | TTL-based, max 5-minute stale |
| Feed read (cache miss) | Read-your-writes | Read from PostgreSQL primary, populate Valkey |
| Search | Eventually consistent | Elasticsearch index lag ~2s behind PostgreSQL |
| Read event write | At-least-once | Kafka acks=all, idempotent ClickHouse insert (deduplicated by event_id) |
| Interest profile update | Eventually consistent | Batch aggregation every 15 minutes from ClickHouse |
| Cross-service (ingest to feed) | Eventually consistent | Kafka-based propagation, ~5s median lag |
CAP and PACELC Position
The system is AP -- it prioritizes availability over consistency. During a network partition, feeds continue serving from Valkey cache (stale data) rather than returning errors. A user may see an article that has been retracted or miss a breaking story that arrived during the partition. This is acceptable because news aggregation tolerates minutes of staleness.
Under normal operation (the PACELC "else" clause), the system chooses latency over consistency: feeds serve from cache even when PostgreSQL has newer data. The 5-minute TTL is the explicit consistency-latency tradeoff. Reducing TTL to 30 seconds increases PostgreSQL read load by 10x; increasing TTL to 30 minutes makes feeds noticeably stale. Five minutes balances both.
This tradeoff is specific to news aggregation. Inventory-critical systems (like an e-commerce flash sale) cannot serve stale stock counts without overselling. The news domain's tolerance for staleness enables the aggressive caching that makes 10K feeds/sec possible.
Subsystem Consistency Boundaries
- Ingestion pipeline: Linearizable within a single source. Kafka partition ordering guarantees articles from the same source are processed in publish order. Cross-source ordering is not guaranteed and not needed.
- Dedup: Eventually consistent. LSH buckets converge within 48 hours. An article that narrowly misses a bucket during a Valkey shard restart will be caught by the nightly cluster drift correction job.
- Ranking: Eventually consistent. The 15-minute re-ranking batch means scores can be up to 15 minutes stale. Breaking news bypasses this via the Flink fast path.
- Feed serving: Eventually consistent. 5-minute TTL. Pre-computed feeds for top 2% users refresh on the same 5-minute cadence.
Technology Selection
The partitioning, caching, and consistency requirements above constrain technology choices. Each component must serve its specific access pattern.
| Component | Technology | Role | Rationale |
|---|---|---|---|
| Article metadata | PostgreSQL 16 | Source of truth for articles, sources, users | Relational model for complex queries. Partitioned by date for retention management. |
| Feed cache | Valkey 8 Cluster | Pre-computed feeds, article cache, user interests | Sub-millisecond reads for feed serving at 10K requests/sec. |
| Full-text search | Elasticsearch | Article search by keyword | BM25 ranking, faceted search by category and source. |
| Dedup index | Valkey | MinHash LSH buckets | Fast bucket lookups during ingestion. 48-hour TTL matches dedup window. |
| Event streaming | Kafka | Article ingestion, breaking news events | Decouples crawling from processing. Buffers ingestion spikes. |
| Stream processing | Apache Flink | Trending topic detection, breaking news | Windowed aggregation on article topics with state management. |
| Engagement analytics | ClickHouse | Read tracking, click-through rates, dwell time | Columnar store optimized for aggregation queries over time-series behavioral data. |
| Crawler scheduling | Valkey Sorted Set | Per-source next-poll-time priority queue | O(log N) insertion, O(log N) pop of earliest source. |
Why PostgreSQL over DynamoDB? The re-ranking batch job runs every 15 minutes and needs to scan articles by category, join with source authority scores, and order by ranking score. These are relational queries. DynamoDB would require denormalizing authority scores into every article record and maintaining separate GSIs for each query pattern. At 100GB of data, PostgreSQL handles this without sharding. If the dataset grew to terabytes, the tradeoff would shift.
Why Valkey over Memcached? Three features drive this. The crawler scheduler needs a sorted set (ZPOPMIN for priority queue). LSH dedup needs sets (SADD/SMEMBERS for bucket membership). User interests need hashes (HGETALL for category-score pairs). Memcached supports only string key-value pairs, meaning all three structures would need application-level serialization and deserialization on every operation.
Why ClickHouse for engagement data? The ranking formula needs click-through rates per article. Computing CTR means aggregating across 250M read events per day. PostgreSQL handles the insert volume, but a query like "CTR for article 12345 in the last 24 hours" requires scanning millions of rows in a row-oriented layout. ClickHouse stores the article_id column contiguously, making this aggregation 10-50x faster.
Queue and Stream Capacity Planning
Technology selection identifies Kafka as the event backbone. Capacity planning ensures the cluster is sized for sustained throughput, burst absorption, and multi-day replay.
Kafka Cluster Sizing
Write throughput:
Sustained: 17 articles/sec * 1.2KB avg message = 20.4 KB/sec
Peak (10x burst during major event): 170 articles/sec * 1.2KB = 204 KB/sec
Partition count:
Target per-partition throughput: 20 messages/sec (conservative, allows headroom)
Peak messages/sec: 170
Partitions needed: 170 / 20 = 8.5, rounded up to 12 for headroom and even
distribution across 3 brokers (4 partitions per broker)
Replication:
Replication factor: 3 (each message stored on all 3 brokers)
min.insync.replicas: 2 (tolerates 1 broker failure without blocking writes)
Retention and storage:
Sustained write rate: 20.4 KB/sec * 86,400 sec/day = 1.76 GB/day (per replica)
3 replicas: 5.3 GB/day
3-day retention: ~15.9 GB total Kafka storage
Peak day (10x for 4 hours): adds ~2.9 GB, total ~18.8 GB worst case
Consumer Groups
Four independent consumer groups read from the raw-articles topic:
- Dedup pipeline -- primary write path, assigns cluster_id
- Topic classifier -- runs fastText inference, adds category
- Flink breaking news detector -- windowed aggregation for trending topics
- Read-event enricher -- correlates article metadata with ClickHouse events
Each consumer group maintains its own offsets. If Flink restarts, it replays from its last committed offset without affecting the dedup pipeline.
Consumer Lag Thresholds
| Level | Lag | Duration | Meaning | Action |
|---|---|---|---|---|
| Normal | < 1K messages | -- | Consumers keeping pace | None |
| Warning | 10K messages | > 10 min | ~10 min behind at sustained rate | Alert. Check consumer pod health. |
| Critical | 100K messages | > 100 min | Consumer pods likely failed or saturated | Auto-scale consumer pods via HPA. Page on-call. |
| Emergency | 1M messages | > 16 hours | Systemic failure | Reduce crawler polling frequency by 50% via shared Valkey flag (adaptive backpressure). |
Dead Letter Queue
Messages that fail 3 processing attempts (malformed XML that crashes the parser, encoding errors, articles with no parseable title) are routed to the articles.dlq topic. DLQ retention is 7 days. A daily review job counts DLQ messages by error type and alerts if the count exceeds 100 (indicating a systematic parsing issue rather than isolated bad feeds). After the root cause is fixed, DLQ messages are replayed into the main topic.
Ordering Guarantees
Per-partition ordering is keyed by hash(source_id) mod 12. Articles from the same source are processed in publish order within a single partition. Cross-source ordering is not guaranteed and not needed -- the dedup pipeline operates independently per article, and the ranker determines final ordering by score, not by arrival time.
Backpressure
Kafka itself is the backpressure buffer. Crawlers produce at whatever rate sources allow. Consumers auto-scale based on lag. If lag exceeds 1M messages (the emergency threshold), crawlers read a shared Valkey flag (crawl:backpressure = true) and reduce their polling frequency by 50% -- doubling poll intervals for all sources. This flag is cleared automatically when lag drops below 100K.
System Flows
Each component above solves one piece of the puzzle. The sequence diagrams below trace requests end-to-end, showing how these pieces connect at runtime -- both under normal operation and during failure.
The Write Path (Ingestion)
Every article flows through this path exactly once. Each step enriches the article with one signal before passing it to the next.
The Ranking Service is the fan-out point. It receives an article that already has a cluster_id (from dedup) and a category (from the classifier), computes a ranking score, and writes the fully enriched record to three stores. Each store holds a different projection of the same data, optimized for its access pattern:
| Store | What it holds | Who reads it | When |
|---|---|---|---|
| PostgreSQL | Full article record (title, summary, URL, cluster_id, minhash_sig, ranking score) | Re-ranking batch job, Feed Service on cache miss | Every 15 min (batch), on Valkey miss (rare) |
| Valkey | Pre-computed feed lists (top 100 articles per category as JSON) | Feed Service for every feed request | 10K reads/sec at peak |
| Elasticsearch | Search-optimized copy (title, summary, category, source, date) | Feed Service for keyword search only | ~100 searches/sec |
PostgreSQL is the source of truth. If Valkey loses data, feeds degrade to slightly higher latency (PostgreSQL fallback) but remain correct. If Elasticsearch goes down, search is unavailable but feeds are unaffected.
ClickHouse closes the feedback loop. The re-ranking batch job reads engagement signals (CTR, dwell time) from ClickHouse and folds them into the ranking formula. New articles start with zero engagement weight. As users interact, engagement data accumulates and shifts ranking over the following hours.
The Read Path (Feed Serving)
Every feed request follows this path. The common case (Valkey cache hit) completes in under 5ms.
The fallback path (cache miss to PostgreSQL) adds roughly 50ms of latency. At a 95% cache hit rate, this affects 500 of the 10K peak requests per second. These misses are self-healing: the Feed Service repopulates the Valkey cache on every miss, so subsequent requests for the same category hit the cache.
Failure Path (Dedup Service Degradation)
The system must continue processing articles even when individual components degrade. The diagram below traces what happens when Valkey LSH buckets become unavailable during dedup.
The circuit breaker opens after 3 consecutive Valkey timeouts (500ms threshold). While open, all dedup checks fall back to PostgreSQL exact-match on article_url -- lower recall (catches only URL-identical duplicates, not near-duplicates) but keeps the pipeline flowing. Articles that pass through during this degraded window carry a dedup_uncertain flag. When Valkey recovers, a backfill job reprocesses flagged articles through the full MinHash + LSH pipeline, merging any missed duplicates into existing clusters.
These flows show the system operating normally and under failure. The deep dives that follow explain how each component achieves its guarantees.
Adaptive Polling
Problem: How does the system decide when to check each of the 100K sources?
Simple example: The New York Times publishes 200 articles per day. A personal blog publishes once a week. Polling both every 5 minutes means 288 daily polls to the NYT (productive, finds new articles on most checks) and 288 daily polls to the blog (wasteful, finds something new once per 2,016 polls).
Mental model: Think of a mail carrier who learns each household's delivery pattern. The apartment building gets checked every hour because it always has new mail. The vacation cabin gets checked once a week. The carrier adjusts based on what they found last time: if the cabin suddenly has a pile of packages, increase frequency temporarily.
The crawler scheduler uses a Valkey sorted set as a priority queue. Each source is a member, scored by its next-poll-time (epoch seconds). Workers pop the source with the lowest score (earliest next-poll-time), fetch the feed, process new articles, and re-insert the source with an updated next-poll-time calculated by the adaptive algorithm.
def calculate_poll_interval(source):
articles_per_day = source.avg_articles_per_day_30d
if articles_per_day > 100:
base_interval = 300 # 5 minutes (major outlets)
elif articles_per_day > 20:
base_interval = 900 # 15 minutes
elif articles_per_day > 5:
base_interval = 3600 # 1 hour
elif articles_per_day > 1:
base_interval = 7200 # 2 hours
else:
base_interval = 21600 # 6 hours (low-volume sources)
# If last poll found new articles, poll sooner
if source.last_poll_found_new:
interval = int(base_interval * 0.5)
else:
interval = int(base_interval * 1.2)
# Clamp to min 2 minutes, max 24 hours
return max(120, min(86400, interval))This reduces total polls from 1.7M per hour (fixed 5-minute interval) to roughly 300K per hour (adaptive). An 80% reduction.
Politeness enforcement. Each worker maintains a per-domain rate limiter in memory. Maximum 1 request per second per domain. If a domain's RSS feed returns HTTP 429 (rate limited) or the robots.txt specifies a crawl delay, the worker backs off exponentially up to 24 hours. The User-Agent string identifies the service and provides a contact URL so site operators can report issues.
Feed format handling. The parser supports RSS 2.0, Atom 1.0, and JSON Feed. All formats normalize into a common Article struct containing title, summary, author, published timestamp, category, and source URL. For feeds that include only titles without summaries, the article relies on the title alone for dedup and classification.
Tradeoff: Adaptive polling introduces complexity in the scheduler. Fixed-interval polling is simpler to implement and reason about. The 80% reduction in wasted fetches justifies the added complexity at this scale, but for a system with 1,000 sources, fixed-interval polling works fine.
The Ingestion Pipeline
Once articles leave the crawlers, they need a path to storage that tolerates failures and spikes.
Problem: Crawlers must keep fetching feeds even when downstream services slow down or fail. If the dedup service falls behind during a major news event and crawlers wait for it, polling stops, and breaking stories from other sources are missed. The ingestion path must tolerate backpressure without stalling the crawlers.
A direct RPC from crawler to dedup service creates this coupling. If the dedup service takes 500ms per article during a spike, crawlers block. At 170 articles per second (10x spike), the crawler fleet saturates within seconds.
Kafka breaks this coupling. Crawlers publish articles to a raw-articles topic and return immediately to polling the next source. Downstream consumers pull from Kafka at their own rate. During a 10x ingestion spike, Kafka absorbs the burst (minutes to hours of buffer at 3-day retention), while consumers process at a steady pace.
Kafka also enables fan-out. The dedup pipeline and Flink breaking news detector consume the same stream as independent consumer groups. If Flink crashes, the dedup pipeline is unaffected, and vice versa. This isolation is what allows the system invariant to hold: each layer fails independently.
The raw-articles topic uses 12 partitions, keyed by source domain. All articles from the same domain land in the same partition, preserving order for within-source dedup (detecting articles the crawler already fetched in a previous poll). Cross-source dedup (detecting the same story from different outlets) happens in the dedup service itself via MinHash.
Near-Duplicate Detection with MinHash + LSH
With duplicates arriving from hundreds of sources, the dedup service is the core algorithmic challenge. When 200 outlets publish stories about the same earthquake, the system must group them into one cluster and surface the best representative article.
Problem: Each new article must be compared against all articles from the last 24 hours (~5M) to find near-duplicates. Pairwise comparison is O(N^2) = 25 trillion comparisons per day. That is computationally impossible. The algorithm must find near-duplicates in approximately O(N) time.
Simple example: Two articles arrive:
- Article A: "Massive earthquake hits central Japan, 7.1 magnitude"
- Article B: "7.1 magnitude quake strikes Japan, hundreds evacuated"
These share significant textual overlap ("7.1", "Japan", "earthquake"/"quake") but are not identical. Exact title matching fails. The system needs a way to quantify how similar they are and decide whether they cover the same story.
Mental model: Imagine fingerprinting documents by randomly sampling words. If two documents share 60% of the same sampled words, they are probably about the same topic. MinHash formalizes this intuition: it compresses each document into a fixed-size signature such that the probability of two signatures matching at any position equals the Jaccard similarity of the original documents.
Now let's formalize that intuition. The algorithm has four steps.
Step 1: Shingling. Convert each article's text into a set of character n-grams (shingles). A 5-character shingle of "Breaking news" produces: {"Break", "reaki", "eakin", "aking", "king ", "ing n", "ng ne", "g new", " news"}. Character n-grams are more robust than word n-grams because they handle different word forms and partial matches.
Step 2: MinHash signatures. Apply k=128 independent hash functions to the shingle set. For each hash function, record the minimum hash value across all shingles. The resulting 128 minimum values form the MinHash signature (512 bytes at 4 bytes per value). The fraction of matching values between two signatures estimates the Jaccard similarity of the original shingle sets.
def compute_minhash(text, num_hashes=128):
shingles = set()
for i in range(len(text) - 4):
shingles.add(text[i:i+5])
signature = [float('inf')] * num_hashes
for shingle in shingles:
for i in range(num_hashes):
h = hash_function(i, shingle) # Different hash per index
signature[i] = min(signature[i], h)
return signatureStep 3: LSH banding. Divide the 128-value signature into b=32 bands of r=4 values each. For each band, hash the r values to produce a band hash. Two articles that share the same band hash in any band become candidate pairs. The probability of two articles with Jaccard similarity J being detected as candidates is 1 - (1 - J^r)^b.
With b=32, r=4: articles with 50% or greater similarity have a 99.8% chance of being detected. Articles with less than 20% similarity have only a 0.4% chance of false pairing.
Step 4: Cluster assignment. For each new article, compute its MinHash signature, look up LSH buckets in Valkey, find candidate duplicates, compute exact Jaccard similarity with candidates, and assign to an existing cluster (if similarity exceeds 0.4) or create a new cluster. Within each cluster, the article with the highest source authority score becomes the primary representative.
LSH buckets are stored in Valkey with a 48-hour TTL:
# LSH bucket key pattern
lsh:band:{band_id}:{bucket_hash} -> set of article IDs
SADD lsh:band:0:a3f2c1 12345 12390 12401
EXPIRE lsh:band:0:a3f2c1 172800
Performance: Computing MinHash for one article takes roughly 2ms. LSH bucket lookup in Valkey takes under 1ms. At 58 articles per second (5M per day), the dedup pipeline uses less than one CPU core.
Why MinHash over alternatives?
| Approach | Time Complexity | Memory | Accuracy | Tradeoff |
|---|---|---|---|---|
| Exact title match | O(1) per lookup | Low | ~10% recall | Misses everything except obvious duplicates |
| TF-IDF cosine similarity | O(N*D) per article | High (sparse vectors) | ~80% | Does not scale beyond 100K articles |
| SimHash | O(1) per lookup | Low (64-bit hash) | ~70% | Fast but relies on hamming distance, misses semantic similarity |
| MinHash + LSH | O(1) average per lookup | Medium (512B per article) | ~90% precision, ~85% recall | Best accuracy-to-cost ratio at this scale |
| BERT embeddings | O(N*768) | High (768-dim vectors) | ~95% | Highest accuracy, but GPU cost prohibitive at 5M articles/day |
MinHash + LSH gives the best tradeoff between accuracy and compute cost. SimHash is faster but misses articles that are semantically similar yet lexically different. BERT is more accurate but costs 10x more in compute.
Topic Classification
With duplicates clustered, each article needs a category label.
Problem: RSS feed categories are unreliable. Some feeds tag every article as "news." Others use inconsistent taxonomies ("tech" vs "technology" vs "computers"). The ranking system and personalized feeds depend on consistent category labels. Without a classifier, category-based filtering and user interest profiles break down.
Simple example: An article titled "Apple announces M5 chip with 40% performance gain" should be classified as "tech." An article titled "Fed raises interest rates by 25 basis points" should be "business." An article about a tech CEO testifying before Congress should receive both "tech" and "politics."
The constraint here is speed. At 58 articles per second, the classifier runs on every article in the write path. A transformer-based model (BERT, 97% accuracy) takes 50ms per article on GPU, requiring dedicated GPU infrastructure just for classification. A fastText model (92% accuracy) takes under 1ms per article on CPU, running as a lightweight step in the existing processing pipeline with no additional hardware.
The model outputs a probability distribution across 8 categories. Articles receive their top category and a secondary category if the second-highest probability exceeds 0.3. The 5% accuracy gap between fastText and BERT is acceptable because category is one of five ranking signals, weighted at 15%. A misclassified article still ranks correctly if its freshness, authority, and cluster signals are strong.
Article Ranking
Deduplication and classification prepare the article. Ranking decides where it appears.
Problem: With 2M unique stories per day, the system must decide which articles appear first in each feed. A brand-new article has zero engagement data. Ranking must work from the moment an article enters the system.
Simple example: Three articles arrive within the same minute:
- Article A: "Fed raises rates" from Reuters (authority: 0.95), covering an event that 80 other sources also published about
- Article B: "My thoughts on the economy" from a personal blog (authority: 0.15), unique article
- Article C: "Earthquake hits Japan, 7.1 magnitude" from BBC (authority: 0.93), 200 other sources published about this
Article C should rank highest: high authority, massive cluster (indicating major news event), and fresh. Article A should rank second. Article B should rank low despite being fresh and unique because the source has no authority.
Mental model: Think of a weighted voting system. Freshness, source authority, cluster size, and user interest each cast a vote. Freshness decays over time, so old articles naturally fade. Cluster size acts as a proxy for importance: when 200 outlets cover a story, it matters more than a story only one outlet covers.
def compute_ranking_score(article, user_interests=None):
# Freshness: exponential decay with 6-hour half-life
age_hours = (now() - article.published_at).total_seconds() / 3600
freshness = math.exp(-0.693 * age_hours / 6) # 0.693 = ln(2)
# Source authority: curated + algorithmic score (0-1)
authority = article.source.authority_score
# Cluster size: logarithmic bonus for stories covered by many sources
cluster_bonus = min(1.0, math.log(article.cluster_size + 1) / math.log(50))
# Category relevance (if personalized)
category_relevance = 0.5
if user_interests:
category_relevance = user_interests.get(article.category, 0.5)
# Engagement: click-through rate from past 24 hours
ctr = article.click_count / max(article.impression_count, 1)
engagement = min(1.0, ctr * 10)
score = (
0.30 * freshness
+ 0.25 * authority
+ 0.20 * cluster_bonus
+ 0.15 * category_relevance
+ 0.10 * engagement
)
return scoreThe 6-hour half-life means an article's freshness score drops to 0.5 after 6 hours, 0.25 after 12 hours, and 0.125 after 18 hours. A high-authority, high-cluster article published 6 hours ago can still outrank a low-authority article published 1 minute ago.
Re-ranking. A batch job recomputes ranking scores every 15 minutes, reading from PostgreSQL, computing new scores, and writing them back. The freshness component ensures scores decay naturally between recomputation cycles. Valkey feed caches are updated after each batch.
Tradeoff: The 15-minute recomputation interval means articles can be up to 15 minutes stale in their ranking position. Reducing this to 1 minute increases PostgreSQL read load by 15x. The 15-minute interval is acceptable because the freshness decay function provides smooth score changes between recomputations. Breaking news bypasses this cycle entirely through a separate fast path.
Ranking produces a global order stored in PostgreSQL and cached in Valkey. Personalization adjusts this order at read time in the Feed Service. The Ranker never knows about individual users.
Breaking News Detection
Problem: When an earthquake strikes, the feed should surface the story within minutes, not wait for the next 15-minute ranking cycle. How does the system distinguish between a normal topic volume and an anomalous spike indicating breaking news?
Simple example: On an average Tuesday, the topic "earthquake" appears in 3 articles per hour across all sources. At 2:15 PM, 45 articles about an earthquake arrive within 15 minutes. The ratio of current volume to historical average is 45 / (3 * 0.25) = 60x. Something happened.
Mental model: A seismograph for news topics. Each topic has a baseline vibration level (rolling average of articles per time window). When the needle spikes well beyond the baseline, an alert fires.
Flink consumes the same Kafka raw-articles stream as the main processing pipeline. It maintains tumbling windows of 5 minutes, counting articles per classified topic within each window.
def detect_breaking_news(window_counts, rolling_averages):
breaking = []
for topic, count in window_counts.items():
avg = rolling_averages.get(topic, 0.5)
ratio = count / max(avg, 0.5)
if ratio > 5.0 and count > 10:
breaking.append({
"topic": topic,
"ratio": ratio,
"count": count,
"average": avg
})
return sorted(breaking, key=lambda x: x["ratio"], reverse=True)[:20]The dual threshold (ratio > 5.0 AND count > 10) prevents false alarms. A niche topic with a rolling average of 0.1 articles per window would trigger at just 1 article without the count threshold. The absolute minimum of 10 articles ensures that only genuinely significant events break through.
Fast path. When Flink detects a breaking topic, it immediately pushes the top article from that cluster into Valkey feed caches at position 0, bypassing the normal 15-minute ranking cycle. This reduces detection-to-feed latency from 15 minutes to under 5 minutes.
Notification Service. When Flink flags a breaking topic, the Notification Service delivers alerts to users who follow that topic. It operates through two channels. For users with the mobile app installed, it sends push notifications via APNs (Apple) and FCM (Google). For users who opted into email alerts, it queues a breaking news digest through an email provider. The service looks up topic subscribers from a user_topic_subscriptions table in PostgreSQL, batches them by channel, and dispatches asynchronously. At 50M DAU with an average of 3 followed topics per user, a major breaking event may target 5-10M users. The service rate-limits to avoid notification fatigue: maximum one push notification per topic per hour, and a global cap of 5 push notifications per user per day.
Tradeoff: The 5-minute tumbling window introduces up to 5 minutes of detection delay. Sliding windows with 1-minute granularity would detect faster but require 5x more state in Flink. For news aggregation, 5-minute detection is acceptable. A real-time alerting system (like earthquake early warning) would need sub-second detection, which is a fundamentally different architecture.
Personalized Feed Assembly
Problem: A feed ranked purely by global signals shows every user the same content. A sports fan and a tech enthusiast see identical articles. Personalization must blend global ranking with individual preferences without increasing feed latency beyond 200ms at the 99th percentile.
Simple example: The global top-10 feed contains 4 politics articles, 3 tech articles, 2 sports articles, and 1 business article. A user who reads primarily tech content should see the 3 tech articles promoted higher while still receiving top politics and sports stories for serendipity.
Personalization is a read-path concern. The write path produces globally ranked articles with no knowledge of individual users. The Feed Service applies per-user interest weights when assembling a response. No per-user state exists in the write path.
def assemble_personalized_feed(user_id, category="all", limit=50):
# Get user interest profile from Valkey
interests = valkey.hgetall(f"user:interests:{user_id}")
# Get global top articles (pre-ranked, cached in Valkey)
global_articles = valkey.get(f"feed:global:{category}")
# Filter out already-read articles
read_ids = get_recent_read_ids(user_id, days=7)
candidates = [a for a in global_articles if a.id not in read_ids]
# Re-rank by user interests
for article in candidates:
interest_boost = float(interests.get(article.category, 0.5))
article.personalized_score = article.ranking_score * (0.7 + 0.3 * interest_boost)
candidates.sort(key=lambda a: a.personalized_score, reverse=True)
# Diversity constraint: no more than 3 articles from the same source
diversified = []
source_count = {}
for article in candidates:
count = source_count.get(article.source_id, 0)
if count < 3:
diversified.append(article)
source_count[article.source_id] = count + 1
if len(diversified) >= limit:
break
return diversifiedThe interest profile is a hash map of category-to-score pairs stored in Valkey, derived from reading behavior. When a user reads a tech article, the tech interest score increments by a small amount. Scores decay over 30 days so that preferences track changing interests.
# Valkey key pattern for user interests
HSET user:interests:u42 tech 0.8 politics 0.3 sports 0.6
EXPIRE user:interests:u42 2592000 # 30-day TTL
Diversity constraint. Without the 3-per-source cap, a user interested in tech would see 10 articles from TechCrunch in a row because TechCrunch has high authority and publishes frequently. The cap forces variety across sources, exposing users to different editorial perspectives.
Pre-computation for heavy users. The on-demand personalization path takes roughly 5ms per feed. At 10K feeds per second, that is manageable. For the most active 1M users (top 2%), feeds are pre-computed every 15 minutes and stored in Valkey. This eliminates the re-ranking computation for the users who generate the most load.
Tradeoff: The 0.7 + 0.3 * interest_boost formula means personalization can shift an article's score by at most 30%. This is a deliberate constraint. Aggressive personalization creates filter bubbles where users only see content matching their existing preferences. The 70% global weight preserves editorial diversity. Tuning this ratio is an ongoing product decision, not a technical one.
Background Reconciliation
Several background processes run continuously to correct drift and reclaim resources. Without them, the system degrades silently over hours.
LSH bucket cleanup. LSH buckets in Valkey have a 48-hour TTL and expire automatically. But if a Valkey shard restarts, buckets are lost before their TTL. New articles repopulate buckets organically as they are ingested, so no manual intervention is needed. The 48-hour window means dedup accuracy may drop temporarily for articles published just before the restart.
Re-ranking consistency. The batch re-ranking job runs every 15 minutes. If it fails or runs slowly, ranking scores become stale -- freshness decay stops being applied. Stale scores cause old articles to remain near the top of feeds. A monitoring alert fires if the re-ranking job has not completed in 20 minutes. The mitigation is straightforward: the job is idempotent and can be restarted without side effects.
Feed cache invalidation. Valkey feed caches have a 5-minute TTL. If the re-ranking job updates scores in PostgreSQL but Valkey still holds the old feed, users see slightly stale rankings for up to 5 minutes. This is acceptable. For breaking news, the fast path bypasses the cache entirely by inserting directly into Valkey at position 0.
Cluster drift correction. Over time, MinHash clusters can drift -- articles that should be in the same cluster end up in different clusters because they arrived outside the 48-hour LSH window. A nightly batch job re-scans the last 48 hours of articles, recomputes MinHash signatures against all active clusters, and merges clusters that should be unified. This catches roughly 2-3% of missed duplicates.
Stale source cleanup. Sources that have returned errors for 30 consecutive days are automatically disabled. A weekly report flags sources whose authority scores have drifted significantly from their 90-day engagement averages, indicating quality changes that need manual review.
Bottlenecks and Mitigations
Seven bottlenecks to watch in production.
1. MinHash computation at ingestion spikes. During a major global event, article ingestion can spike 10x (from 17 to 170 articles per second). The dedup pipeline processes sequentially per partition and may fall behind. Mitigation: The dedup service auto-scales based on Kafka consumer lag. When lag exceeds 10K messages, Kubernetes HPA adds pods. MinHash computation is embarrassingly parallel since each article is independent.
2. Feed serving at 10K requests/sec. Each personalized feed requires a Valkey read, re-ranking over roughly 200 candidates, and diversity filtering. Mitigation: Pre-compute personalized feeds for the top 2% most active users (1M users). For the remaining 98%, compute on demand with a 5-minute cache TTL. On-demand computation takes roughly 5ms per feed.
3. Valkey memory for LSH buckets. Each article creates entries in up to 32 LSH band buckets. At 1.5M articles per day with a 48-hour window, that is up to 96M bucket entries. Mitigation: The 48-hour TTL auto-expires old entries. Bucket values are article IDs (8 bytes each), keeping total LSH memory under 2GB.
4. Elasticsearch indexing lag. Indexing 1.5M articles per day (roughly 17 per second). If the cluster falls behind, search results become stale. Mitigation: Bulk indexing (1,000 documents per batch). With 3 shards and 3 data nodes, the cluster handles 5K documents per second. Current load is well within capacity.
5. ClickHouse write volume. 250M read events per day (roughly 2,900 per second). Mitigation: ClickHouse handles millions of inserts per second with batch ingestion. Events are buffered in application memory and flushed in batches of 10K rows.
6. PostgreSQL partition management. Monthly partitions accumulate. Dropping old partitions and creating new ones must happen automatically. Mitigation: pg_partman extension automates partition creation and retention. Drop partitions older than 30 days for articles, older than 90 days for read_history.
7. Stale interest profiles for inactive users. A user who has not visited in 29 days has an interest profile about to expire. If they return, their first feed will be unpersonalized. Mitigation: Fall back to the global feed for users without interest profiles. The experience is still good because global rankings reflect source authority and topic importance. The profile rebuilds within a few sessions.
Component-Level Failure Modeling
Each bottleneck above describes a performance limit. The failure scenarios below describe what happens when components stop working entirely. The system's AP design means feeds continue serving under every scenario -- the question is how much quality degrades.
Crawler Fleet Failure
Scenario: All 50 crawler pods crash due to a bad deploy.
| Time | Event |
|---|---|
| T+0 | Bad deploy crashes all crawler pods |
| T+0 | Valkey schedule accumulates overdue sources |
| T+0 | Existing feeds continue serving from cache |
| T+2min | Kubernetes rolls back (failed health checks) |
| T+5min | Crawler pods restart, begin processing backlog |
| T+15min | Backlog cleared, normal ingestion resumes |
Impact: 15-minute gap in new articles. Users see slightly stale feeds. No data loss because the Valkey schedule retains overdue sources. Breaking news during the gap is delayed by 15 minutes.
Dedup Pipeline Failure
Scenario: MinHash dedup service crashes. Articles enter without cluster assignment.
| Time | Event |
|---|---|
| T+0 | Dedup service pods crash |
| T+0 | Kafka consumer lag grows |
| T+0 | Option A: Articles bypass dedup, enter with cluster_id=NULL |
| T+5min | Dedup service recovers, processes backlog |
| T+10min | Retroactive dedup assigns cluster_ids using stored minhash_sig |
Impact: For 5-10 minutes, feeds may show duplicate stories from different sources. This is a cosmetic issue, not a data integrity problem. Retroactive dedup fixes it. The key design decision that enables retroactive dedup: storing the MinHash signature in the articles table so it does not need to be recomputed.
PostgreSQL Failover
Scenario: PostgreSQL primary becomes unavailable.
| Time | Event |
|---|---|
| T+0 | Primary crashes |
| T+0 | Writes fail (new articles cannot be persisted) |
| T+0 | Reads served from Valkey cache (pre-computed feeds still work) |
| T+15s | Synchronous replica promoted by Patroni |
| T+30s | New primary accepts connections, writes resume |
Impact: Article ingestion paused for roughly 30 seconds. Feed serving unaffected because feeds are served from Valkey. No data loss with synchronous replication.
Kafka Broker Failure
Scenario: One of three Kafka brokers fails.
| Time | Event |
|---|---|
| T+0 | Broker crashes |
| T+5s | New partition leaders elected on remaining brokers |
| T+10s | Producers redirect to new leaders |
| T+15s | Consumers resume from last committed offset |
Impact: Up to 15 seconds of increased latency for article ingestion. No data loss with acks=all and min.insync.replicas=2. The two remaining brokers handle the load.
Valkey Shard Failure
Scenario: One of 6 Valkey shards becomes unavailable.
| Time | Event |
|---|---|
| T+0 | Shard crashes, affecting ~17% of cached feeds and LSH buckets |
| T+0 | Feed requests to affected shard return cache miss |
| T+0 | Feed service falls back to PostgreSQL for affected users |
| T+5s | Valkey replica promoted |
| T+30s | Cache begins warming with new requests |
Impact: Feed latency increases from sub-50ms to roughly 100ms for affected users during the 30-second recovery window. LSH buckets on the failed shard are lost, potentially causing temporary dedup misses. These self-heal as new articles repopulate the buckets.
Flink Cluster Failure
Scenario: All Flink task managers crash. Breaking news detection stops.
| Time | Event |
|---|---|
| T+0 | Flink cluster fails |
| T+0 | Breaking news detection and push notifications stop |
| T+0 | Normal feed serving and article ingestion continue unaffected |
| T+5min | Flink restarts from last Kafka committed offset |
| T+10min | Reprocesses missed windows, resumes real-time detection |
Impact: No push notifications for trending topics during the outage. Users still see new articles via the normal ranking pipeline -- they just are not surfaced with breaking-news priority. The Flink consumer replays missed Kafka messages on restart, so no article data is lost. Detection gap: 5-15 minutes depending on restart time.
Tradeoff: Flink is isolated from the critical feed-serving path. This is deliberate -- breaking news detection is a P1 feature, not P0. The system degrades gracefully to "normal ranking only" mode.
Elasticsearch Failure
Scenario: Elasticsearch cluster goes red (no primary shards available).
| Time | Event |
|---|---|
| T+0 | ES cluster unavailable |
| T+0 | Search API returns 503 Service Unavailable |
| T+0 | Feed serving and article ingestion continue unaffected |
| T+5-30min | ES self-heals (shard reallocation) or manual intervention (shard reassignment) |
Impact: Search is unavailable. Feed serving is completely unaffected because feeds are served from Valkey, not Elasticsearch. No data loss -- articles continue being written to PostgreSQL. Once ES recovers, the indexing backlog (buffered in the Ranker) is replayed.
Tradeoff: Search is a P1 feature. The system does not attempt a PostgreSQL-based search fallback because full-text search on a relational database at this volume would produce unacceptable latency and load.
ClickHouse Failure
Scenario: ClickHouse cluster becomes unavailable. Read events stop recording.
| Time | Event |
|---|---|
| T+0 | ClickHouse unavailable |
| T+0 | Read event inserts fail; events buffered in Kafka (3-day retention) |
| T+0 | Personalization uses existing (stale) interest profiles in Valkey |
| T+0 | Re-ranking batch uses last-known engagement scores |
| Recovery | Replay buffered events from Kafka into ClickHouse |
Impact: Personalization quality degrades because interest profiles are not updated with new reading behavior. Ranking continues with stale engagement scores. Kafka's 3-day retention buffers all events during the outage; on recovery, the full backlog is replayed so no behavioral data is permanently lost.
Tradeoff: ClickHouse is not in the hot path for feed serving. Personalization degrades gradually (interest profiles have 30-day TTLs), so even a multi-day outage produces a slow decline in personalization quality rather than a cliff.
Feed Service Pod Crash
Scenario: One of 10 Feed Service pods crashes.
| Time | Event |
|---|---|
| T+0 | Pod crashes, Kubernetes service removes it from load balancer |
| T+0 | In-flight requests on that pod fail (clients retry) |
| T+0 | Load balancer routes traffic to remaining 9 pods |
| T+10s | Kubernetes restarts the pod |
| T+15s | New pod passes health check, added back to load balancer |
Impact: Brief latency spike for users whose requests were in-flight on the crashed pod. The L1 in-process cache on that pod is lost, so the restarted pod experiences a higher Valkey hit rate for 60 seconds until the L1 cache warms. With 9 remaining pods absorbing the load (11% increase per pod), feed serving continues without user-visible degradation.
Deployment and Operations
Failure modeling reveals how the system recovers. The deployment strategy determines how changes are introduced without triggering those failure scenarios.
Crawler workers. Rolling update with maxUnavailable: 10%. Workers are stateless; the schedule lives in Valkey. New workers pick up where old ones left off.
Dedup service. Blue-green deployment. The new version reads from the same Kafka consumer group. Both versions can process articles simultaneously during the transition because MinHash is deterministic: the same article always produces the same signature.
Feed service. Rolling update with maxUnavailable: 0, maxSurge: 25%. Pods are stateless, all state is in Valkey and PostgreSQL.
Flink jobs. Savepoint-based deployment. Take a Flink savepoint, deploy the new job from the savepoint, verify output, tear down the old job. Zero event loss.
Feature Flags
Feature flags gate risky changes without requiring a deploy:
new_ranking_formula-- A/B tests ranking weight changes (e.g., shifting freshness weight from 0.30 to 0.35). 5% of users see the new formula; engagement metrics compared after 7 days.breaking_news_push-- Disables push notifications for breaking news without affecting detection or feed surfacing. Used during notification delivery issues or excessive false-positive events.lsh_band_count-- Tunes dedup sensitivity at runtime. Reducing bands from 32 to 24 lowers recall but reduces Valkey memory. Increasing to 40 catches more near-duplicates at the cost of more false candidates.
Flags are evaluated at request time (feed serving flags) or consumer startup (pipeline flags). A flag management service (LaunchDarkly or equivalent) distributes flag state with < 1 second propagation latency.
Database Migrations
Online DDL. Index rebuilds on the 100GB articles table use pg_repack to rebuild without locking. Standard CREATE INDEX CONCURRENTLY is used for new indexes -- it does not block writes, though it takes longer.
Column additions. New nullable columns are added first (ALTER TABLE ... ADD COLUMN), then backfilled in batches of 10K rows with rate-limited UPDATE statements that throttle to < 5% of CPU on the primary. Application code handles NULL values during the backfill window.
Schema changes affecting Kafka consumers. The dual-write migration pattern applies: the producer emits both old-format and new-format fields for the duration of the rollout. Avro schema registry enforces backward compatibility -- new schemas must be able to deserialize old messages.
Rollback
- Code rollback:
kubectl rollout undo deployment/<service>completes in < 30 seconds. Kubernetes drains old pods and replaces them with the previous image. - Data rollback: Not possible for article ingestion -- articles are immutable once written. Forward-fix is the only option (mark bad articles with a flag, exclude from feeds, re-process).
- Ranking formula rollback: Immediate via feature flag toggle. No deploy needed. The 15-minute re-ranking batch picks up the reverted formula on its next run.
Observability
Deployment and rollback depend on knowing the system's state. The observability stack provides the signals needed to detect problems, trace root causes, and verify fixes.
Metrics
| Metric | Source | Alert Threshold |
|---|---|---|
articles_ingested_per_min | Kafka consumer | < 30 for 10 min (normally ~100) |
dedup_precision | Dedup service (sampled) | < 80% (manual spot-check) |
feed_latency_p99 | Feed service | > 200ms for 5 min |
crawler_schedule_overdue | Valkey | > 10K sources overdue by > 30 min |
kafka_consumer_lag | All consumers | > 100K for 10 min |
breaking_news_detection_delay | Flink | > 10 min from event to detection |
elasticsearch_index_lag | Elasticsearch | > 5 min |
Dashboard Panels
- Ingestion pipeline -- articles per minute, dedup rate, average cluster size
- Feed serving -- latency p50/p99, cache hit rate, personalization coverage
- Crawler health -- active workers, polls per minute, error rate per source domain
- Trending topics -- live list with volume and detection timestamp
- Source health -- top sources by article volume, sources returning errors
Logging
Structured JSON logs with consistent fields across all services:
{
"timestamp": "2026-04-04T10:30:00.123Z",
"level": "WARN",
"component": "dedup-service",
"action": "lsh_lookup_fallback",
"trace_id": "abc123def456",
"source_id": 42,
"article_id": 12345,
"duration_ms": 502,
"error": "valkey_timeout"
}Log levels follow strict semantics:
- ERROR -- Component failure requiring intervention (PostgreSQL connection lost, Kafka consumer crash).
- WARN -- Degraded path that self-recovers (dedup-uncertain fallback, cache miss to PostgreSQL, circuit breaker open).
- INFO -- State transitions (pod startup, partition created, re-ranking batch completed, feature flag changed).
Correlation: trace_id is generated by the crawler and propagated via Kafka message headers through all downstream consumers. A single trace_id links the full lifecycle of an article from crawl through dedup, classification, ranking, and feed serving.
Distributed Tracing
OpenTelemetry spans instrument the full article lifecycle:
crawl (200ms) → kafka_produce (5ms) → dedup (3ms) → classify (1ms) → rank (2ms) → pg_write (15ms) → valkey_update (1ms) → es_index (10ms)
Key span budgets:
dedup_lsh_lookup: p99 < 5ms (Valkey round-trip)pg_write: p99 < 20ms (single INSERT with indexes)feed_assembly: p99 < 50ms (Valkey read + re-ranking + diversity filter)
Trace sampling: 1% for normal traffic (sufficient for latency distribution analysis). 100% for error paths (every failed request is fully traced). Traces are stored in Jaeger with 7-day retention.
SLI/SLO
| SLI | SLO | Measurement |
|---|---|---|
| Feed latency p99 | < 200ms | Prometheus histogram at Feed Service |
| Article freshness | < 15 min from source publish to feed appearance | End-to-end trace (crawl timestamp to feed cache write) |
| Dedup accuracy (precision) | > 95% | Weekly sample audit: 1000 random cluster pairs manually verified |
| Availability | 99.99% (52 min downtime/year) | External uptime check on /healthz every 30 seconds |
SLO burn rate alerts: if the error budget for feed latency is consumed at 10x the sustainable rate, a page fires. At 2x the rate, a ticket is created. This avoids alerting on brief spikes that do not threaten the monthly budget.
Runbooks
kafka_consumer_lag_high -- Check consumer pod health (kubectl get pods). Check for poison messages in the partition (consumer stuck on a single offset). If pods are healthy, scale consumers (kubectl scale deployment dedup-service --replicas=8). If a poison message is blocking, skip it and route to DLQ.
valkey_memory_high -- Check TTL expiry distribution (redis-cli --bigkeys). Check for LSH bucket bloat (unusually large sets in lsh:band:*). If memory exceeds 90%, trigger manual eviction of feed keys (SCAN + DEL on feed:global:* -- they are easily recomputed).
feed_latency_spike -- Check Valkey hit rate (should be > 95%). If hit rate dropped, check for recent Valkey restart or shard failure. Check PostgreSQL query plans (pg_stat_statements) for regressions. Check for thundering herd (multiple concurrent cache recomputations for the same key).
Security
Observability exposes system internals for operators. Security ensures those internals -- and user data -- are protected from unauthorized access.
Crawler Identity
All crawlers use a consistent User-Agent string identifying the service and providing a contact URL. robots.txt is fetched and respected for every domain before the first crawl. Crawl delay directives are honored.
Content Attribution
All articles link back to the original source. No full-text storage, only headlines and summaries (fair use). Source attribution is displayed prominently in the UI. Full-text scraping of copyrighted articles without permission is legally risky and architecturally unnecessary for an aggregator.
Input Validation
Source feed URLs are validated before addition: must be valid HTTP/HTTPS, no internal IPs (SSRF prevention), no file:// URIs. Article content is HTML-stripped before storage to prevent XSS.
Authentication
JWT Bearer tokens authenticate all API requests. Tokens are signed with HMAC-SHA256, expire after 15 minutes, and are refreshed via /api/v1/auth/refresh. The short expiry limits the window of exposure if a token is compromised. Refresh tokens are stored hashed (bcrypt) in PostgreSQL and rotated on each use -- a stolen refresh token becomes invalid after the legitimate user refreshes.
Authorization
Scope-based access control is enforced at the API Gateway before requests reach backend services:
| Scope | Access |
|---|---|
read:feed | Feed retrieval, search, trending topics |
write:events | Read-event recording (dwell time, clicks) |
admin:sources | Source CRUD operations (add, disable, configure) |
Default user tokens carry read:feed and write:events. The admin:sources scope is restricted to operator accounts. Scope violations return 403 Forbidden with error code SCOPE_DENIED.
Encryption
- In transit: TLS 1.3 for all inter-service communication. Valkey clients connect via TLS (not plaintext). Kafka brokers enforce TLS for producer/consumer connections.
- At rest: AES-256 encryption for PostgreSQL tablespaces and S3 archived data. Valkey is not encrypted at rest (performance tradeoff -- data is ephemeral cache and regenerable).
- Key rotation: AWS KMS manages encryption keys with 90-day automatic rotation. Application-layer secrets (JWT signing key, database credentials) rotate quarterly via Kubernetes Secrets with rolling restart.
PII Handling
User read history and interest profiles constitute PII -- they reveal reading habits and topic preferences. These are encrypted at rest in PostgreSQL and ClickHouse. GDPR right-to-erasure is implemented via cascade delete on user_id: deleting a user removes read_history, user_interests, bookmarks, and topic subscriptions. Valkey keys for the user (user:interests:{user_id}, feed:user:{user_id}) are explicitly deleted. ClickHouse events are marked for exclusion from future aggregation queries (columnar storage makes physical deletion expensive, so logical deletion via a deleted flag is used, with physical compaction monthly).
Data retention: user behavioral data (read history, events) is retained for 1 year, then purged. Interest profiles expire via 30-day Valkey TTL and are not persisted long-term.
Testing and Validation
Security controls prevent unauthorized access. Testing validates that the system behaves correctly under expected and unexpected conditions.
Load Testing
k6 scripts simulate 10K concurrent feed requests against the staging environment. The test profile ramps from 100 to 10K virtual users over 5 minutes, sustains peak load for 15 minutes, then ramps down. Pass criteria: p99 < 200ms, error rate < 0.1%, no Valkey connection pool exhaustion. Load tests run weekly in staging with a production-like data volume (30 days of articles, 50M simulated user profiles).
A separate burst test simulates a breaking news scenario: 10x ingestion spike (170 articles/sec into Kafka) concurrent with 2x feed request spike. Pass criteria: Kafka consumer lag does not exceed 10K messages, feed latency p99 stays below 300ms during the burst.
Chaos Engineering
Gremlin or LitmusChaos experiments validate the failure scenarios described in the Component-Level Failure Modeling section:
- Kill 1 of 6 Valkey shards: Verify feed serving continues from PostgreSQL fallback. Measure latency increase (should be < 200ms p99). Verify LSH dedup continues on remaining 5 shards.
- Partition a Kafka broker from the network: Verify no data loss after the partition heals. Verify producer failover completes in < 15 seconds. Verify consumer lag recovers within 5 minutes.
- Kill the entire Flink cluster: Verify feeds still serve normally (no breaking news surfacing). Verify breaking news detection resumes within 15 minutes of restart. Verify no Kafka offset regression.
- Saturate PostgreSQL with slow queries: Verify Valkey cache absorbs feed traffic. Verify re-ranking batch job fails gracefully and alerts fire.
Chaos experiments run monthly in staging. Critical scenarios (Valkey shard kill, Kafka broker partition) run quarterly in production during low-traffic windows.
Integration Tests
An end-to-end pipeline test runs on every deploy to staging:
- Inject a known article (controlled RSS feed with a unique sentinel article).
- Verify the article appears in Kafka
raw-articlestopic within 10 seconds. - Verify the article receives a
cluster_id(dedup) within 30 seconds. - Verify the article is classified with the correct category within 35 seconds.
- Verify the article appears in the feed for the matching category within 60 seconds.
- Verify the article is searchable in Elasticsearch within 60 seconds.
Failure at any step blocks the deploy from proceeding to production.
Contract Tests
API response schemas are validated via Pact. The Feed Service, Search endpoint, and Event Recording endpoint each have consumer-driven contract tests that verify response shape, required fields, and data types. Contracts are checked on every PR.
Kafka message schemas use Avro with the schema registry enforcing backward compatibility. A new schema version must be able to deserialize messages produced by the previous version. Incompatible schema changes are rejected at registry registration time, preventing broken consumers from reaching production.
Data Validation
- Weekly dedup audit: Sample 1,000 article pairs from the same cluster. Manual review (or ML-assisted review) verifies precision > 95% (articles in the same cluster are genuinely about the same story) and recall > 90% (articles about the same story are in the same cluster).
- Monthly cache-DB reconciliation: Compare Valkey feed contents with PostgreSQL article counts per category. Flag divergence > 5% as a potential feed materialization bug.
- Continuous data quality: A streaming job monitors ingested articles for anomalies -- articles with empty titles, published dates in the future, or summaries shorter than 10 characters. Anomalous articles are flagged and excluded from ranking.
Cost and Capacity
Testing validates correctness; cost analysis validates economic viability. The table below projects costs at the current 1x scale and two growth milestones.
Cost Table
| Component | 1x (50M DAU) | 10x (500M DAU) | 100x (5B DAU) |
|---|---|---|---|
| PostgreSQL (db.r6g.2xlarge) | $1,200/mo | $8,000/mo (4 read replicas) | $40,000/mo (sharded, 10 clusters) |
| Valkey (6 shards, r6g.xlarge) | $600/mo | $3,000/mo (30 shards) | $20,000/mo (300 shards) |
| Kafka (3 brokers, m6i.xlarge) | $400/mo | $2,000/mo (15 brokers) | $12,000/mo (100 brokers) |
| Elasticsearch (3 nodes, r6g.xlarge) | $900/mo | $5,000/mo (15 nodes) | $30,000/mo (100 nodes) |
| ClickHouse (2 nodes) | $500/mo | $3,000/mo (10 nodes) | $18,000/mo (50 nodes) |
| Compute (crawlers + services) | $2,000/mo | $15,000/mo | $100,000/mo |
| Total | ~$5,600/mo | ~$36,000/mo | ~$220,000/mo |
Cost Cliffs
Elasticsearch becomes the most expensive component first. At 10x, full-text indexing of 15M articles/day requires significant compute and storage. Each article generates an inverted index entry for title and summary tokens. Mitigation: index only titles and summaries (not full article bodies -- which the system does not store anyway), and use ILM to move old indexes to warm/cold nodes with cheaper storage.
Valkey memory is the tightest constraint. Feed pre-computation for the top 2% of users consumes 2.5GB (100K users * 25KB feeds). At 10x, the top 2% is 10M users, requiring 250GB -- far exceeding a single Valkey cluster. Mitigation: switch to cohort-based caching at 10x (10K interest-profile cohorts instead of 10M individual feeds), reducing memory to approximately 250MB.
Hidden Costs
Cross-AZ data transfer. Kafka replication (factor 3) across 3 availability zones means every message is transmitted twice across AZ boundaries. At 1x, this is negligible (~$20/mo). At 100x, sustained write throughput of 2MB/sec * 86,400 sec/day * 2 cross-AZ copies * $0.01/GB = ~$3,500/mo. A real cost that does not appear on the Kafka compute bill.
Elasticsearch storage. At 100x, 30-day article retention in ES (with replicas) requires approximately 10TB. At $0.10/GB/mo for gp3 EBS, storage alone costs $1,000/mo per ES node.
Optimization Levers
At 10x, three optimizations defer the need for 100x infrastructure:
- Cohort-based feed caching (described above) reduces Valkey memory by 1,000x for pre-computed feeds.
- ES index pruning -- index only articles with
ranking_score > 0.3(top 40% by score). Low-ranking articles from low-authority sources are unlikely to be searched. Reduces ES volume by 60%. - Kafka tiered storage -- move old segments to S3, keeping only the last 6 hours on broker disks. Reduces Kafka EBS costs by 90% at the expense of slower replay for consumer group resets.
Multi-Region Considerations
Cost analysis assumes a single-region deployment. Multi-region changes the cost structure and introduces consistency challenges that must be evaluated against the latency benefit.
Current Design: Single-Region (US-East)
The current architecture is single-region. This is deliberate: news aggregation tolerates minutes of staleness, so the complexity and cost of multi-region replication are not justified at 50M DAU. Users in Europe experience 80-120ms additional latency compared to US users, but the feed serving p99 (45ms in-region + 120ms cross-Atlantic = 165ms) still meets the 200ms SLO.
What Changes at 500M DAU
At 500M DAU with a global user base, cross-region latency becomes unacceptable for APAC users (200-300ms added). Three changes are required:
Regional Valkey caches. Deploy Valkey clusters in US-East, EU-West, and APAC-Southeast. Each region serves feeds from its local cache. Cache misses fall through to the central PostgreSQL in US-East. This eliminates cross-region latency for the 95% of requests that hit cache.
Centralized crawlers and dedup. Crawlers remain in US-East. Sources are global, and dedup must be global to prevent cross-region duplicates (the same CNN article should not be independently deduped in each region). Ingested articles propagate to regional Kafka clusters via MirrorMaker 2.
PostgreSQL read replicas. A single-leader primary in US-East with asynchronous read replicas in EU and APAC. Write latency for read-event recording: 100-200ms cross-region (acceptable because event recording is asynchronous and non-blocking). The re-ranking batch job runs against the US-East primary.
Regional vs. Global Data
| Data | Scope | Rationale |
|---|---|---|
| Articles and sources | Global (US-East primary) | Single source of truth prevents cross-region duplicates |
| User interests and read history | Regional (user affinity to nearest region) | Keeps user data close to the user; reduces cross-region writes |
| Feed cache | Regional (local Valkey) | Latency-sensitive; served from nearest cache |
| LSH dedup buckets | Global (US-East Valkey) | Dedup must be globally consistent |
| Search index | Regional (local ES clusters) | Search is latency-sensitive; regional indexes populated via MirrorMaker |
Cross-Region Consistency
Regional feeds are eventually consistent with approximately 5-second lag (Kafka MirrorMaker replication latency). A user in APAC may see a breaking news article 5 seconds after a user in US-East. This is acceptable for news aggregation.
Breaking news detection runs per-region on mirrored Kafka topics. Regional Flink clusters may detect regional trends that the US-East cluster misses (e.g., a local election trending in APAC). This is a feature, not a bug -- regional trending topics are more relevant to regional users.
Failover
Active-passive configuration. If US-East fails completely:
- DNS failover (60-second TTL) routes traffic to EU-West.
- EU-West read replica is promoted to primary (approximately 30 seconds).
- Data loss window: up to 60 seconds of un-replicated writes (asynchronous replication to EU).
- Crawlers restart in EU-West from the Valkey schedule (which is replicated).
- Feed serving resumes from EU-West Valkey cache (already warm for EU users; cold for redirected US users).
Active-active multi-region (both regions accepting writes simultaneously) is not implemented. The complexity of conflict resolution for article dedup and ranking outweighs the availability benefit at 500M DAU. At 5B DAU, active-active becomes necessary, requiring CRDTs or conflict-free data structures for user interest profiles and operational transformation for article metadata.
Three Things That Break at 10x
Multi-region is one scaling challenge. Three domain-specific problems also emerge as the system grows from 50M to 500M DAU with a global, multi-language user base.
1. Dedup Breaks Across Languages
MinHash with character n-grams works well for articles in the same language. An earthquake story in English and the same story in Spanish produce entirely different shingle sets, so MinHash sees them as unrelated. At 10x with a global user base, multi-language dedup becomes a P0 requirement.
Character shingling assumes whitespace tokenization. CJK scripts (Chinese, Japanese, Korean) do not use whitespace between words. A Japanese headline about the same earthquake produces shingles that are character-level bigrams of kanji, sharing zero overlap with the English shingles. MinHash recall for cross-language pairs drops below 5%.
Mitigation path: Separate dedup pipelines per language, with cross-language cluster linking via named entity extraction. If two clusters in different languages mention the same named entities (people, places, event descriptions extracted via NER) within the same 30-minute window, they are linked as cross-language duplicates. This is more expensive than single-language MinHash but scales linearly with the number of languages (not quadratically with articles).
2. Feed Personalization Becomes Computationally Infeasible
At 1x, pre-computing feeds for the top 2% of users (1M users) takes approximately 5 seconds per batch (1M feeds * 5ms each, parallelized across 10 pods). At 10x, the top 2% is 10M users. Even with 100 pods, pre-computation takes 500 seconds -- nearly the entire 5-minute refresh interval.
Individual feed pre-computation does not scale past 10M users. The system must switch to cohort-based caching: cluster users into 10K cohorts based on interest profile similarity (k-means on interest vectors), pre-compute one feed per cohort, and personalize at read time by applying a lightweight cohort-to-user adjustment (< 1ms). This reduces pre-computation from 10M feeds to 10K feeds -- a 1,000x reduction -- with only a small loss in personalization accuracy (cohort members have similar but not identical interests).
3. Authority Scoring Requires Continuous Calibration
Static authority scores become stale as sources change quality over time. A once-reputable outlet that pivots to clickbait should see its authority drop. At 1x, manual curation of 100K source scores is feasible (review the top 1,000 sources quarterly, algorithmically derive the rest). At 10x with 1M sources, manual curation breaks.
Production systems derive authority from multiple real-time signals: domain age, citation frequency (how often other sources link to a given source), historical accuracy (correlation between headlines and outcomes), and user engagement quality (dwell time distribution, not just click count). This requires a separate ML pipeline ingesting engagement data from ClickHouse and citation data from the crawlers, producing updated authority scores every 24 hours. The pipeline feeds into the ranking formula as a replacement for the static authority_score column.
Secondary Evolution Items
Beyond these three breaking points, several components evolve incrementally:
Joins disappear from hot paths. The re-ranking batch job currently joins articles with sources for authority scores. At higher volume, the system denormalizes: stores source_name and authority_score directly in the articles table. Write-time cost increases (update all articles when a source's authority changes), but read-time cost drops to a single table scan.
The crawler becomes smarter. Mature crawlers learn publication patterns per source: a source that publishes at 9 AM and 3 PM every weekday gets polled heavily at those times and lightly otherwise. WebSub (formerly PubSubHubbub) eliminates polling entirely for sources that support it: the source pushes a notification when new content is available.
Interest profiles become collaborative. Content-based personalization (a user who reads tech sees more tech) is augmented by collaborative filtering: users who read similar articles also read these other articles. This requires a user-article interaction matrix with approximate nearest neighbor search over user vectors. The compute cost scales with the number of users, not articles, making it expensive at 500M DAU but valuable for discovery beyond a user's existing interests.
Explore the Technologies
Core Technologies
| Technology | Role in This Design | Deep Dive |
|---|---|---|
| Kafka | Article ingestion pipeline, event streaming | Kafka |
| Flink | Trending detection, breaking news, aggregation | Apache Flink |
| Elasticsearch | Article full-text search | Elasticsearch |
| Valkey/Redis | Feed cache, crawler scheduling, LSH buckets, user interests | Redis |
| PostgreSQL | Article metadata, source management, user data | PostgreSQL |
| ClickHouse | Read event analytics, interest profile aggregation | ClickHouse |
Infrastructure Patterns
| Pattern | Relevance to This Design | Deep Dive |
|---|---|---|
| Message Queues and Event Streaming | Kafka for decoupled article processing pipeline | Event Streaming |
| Caching Strategies | Multi-layer: Valkey for feeds, in-process for interest profiles | Caching |
| Rate Limiting and Throttling | Per-domain crawler politeness, per-user API rate limits | Rate Limiting |
| Distributed Tracing | OpenTelemetry spans across crawl-to-feed pipeline | Distributed Tracing |
| Metrics and Monitoring | Prometheus for ingestion, feed serving, crawler health | Metrics |
| Load Balancer | Feed service traffic distribution across pods | Load Balancer |
| Hot-Warm-Cold Tiering | Article lifecycle from Valkey to PostgreSQL to S3 | Tiering |
| Database Sharding | PostgreSQL range partitioning by published_at | Sharding |
| Replication and Consistency | PostgreSQL primary-replica, Kafka replication factor 3 | Replication |
Practice this design: Try the News Aggregator interview question to test your understanding with hints and structured guidance.