AI/ML System Design Interview
How AI System Design Interviews Differ
Regular system design interviews are about building systems with predictable behavior. You design a URL shortener, a chat app, a rate limiter. Inputs map to outputs in a way you can reason about. If you implement it correctly, it works correctly. Every time.
AI system design interviews throw that assumption out the window. The system you're designing will sometimes be wrong. Confidently wrong. Wrong in ways you can't easily predict or reproduce. Your job is to design around that fundamental uncertainty, and that single fact changes how you approach everything.
The interviewer doesn't care whether you can explain transformers or walk through backpropagation. They want to know if you can architect a production system where a probabilistic component sits at the center, and everything else handles the mess that creates.
The AI System Design Framework
Before you draw a single box on the whiteboard, ask yourself four questions:
What metric defines success? "Better recommendations" is not a metric. Click-through rate, conversion rate, day-7 user retention, revenue per session. Pick one primary metric and be explicit about the secondary ones you're sacrificing.
What error rate is tolerable? A content moderation system with 1% false negatives means 1 in 100 harmful posts gets through. At 10,000 posts per minute, that's 100 harmful posts per minute reaching users. Is that acceptable? This question forces you to think about business context, not just model performance numbers on a dashboard.
What latency budget do you have? A recommendation system that takes 3 seconds to respond kills the user experience. A fraud detection system that takes 3 seconds might be perfectly fine if it runs asynchronously. Your latency constraints shape your entire architecture: whether you can call an LLM at all, whether you need model distillation, whether you need a tiered approach.
What's the cost envelope? If you're sending 100 million requests per day through GPT-4, your inference bill alone could blow past $500K per month. Most interviewers want to see that you understand this reality and can design systems that use expensive models surgically, not universally.
Common AI System Design Questions
Most AI system design questions fall into three buckets. Recognizing which bucket you're in helps you structure your answer faster.
Content understanding systems (moderation, classification, extraction) follow a multi-tier pattern. Fast, cheap models handle the obvious cases. Medium-cost models handle the middle ground. Expensive LLMs handle the ambiguous tail. Human reviewers handle what the LLMs can't confidently classify. The real design challenge is setting the right confidence thresholds at each tier and building feedback loops so the cheaper tiers get better over time.
Recommendation and ranking systems require thinking about the full pipeline: data collection, feature engineering, candidate generation (narrowing millions of items to hundreds), ranking (ordering the hundreds), and serving. The interesting design decisions live in the feature store architecture, the split between online and offline features, and the A/B testing infrastructure that lets you measure whether your changes actually move the metric you care about.
Generative systems (chatbots, content creation, code generation) center on retrieval-augmented generation, output quality control, and cost management. Here's the thing most people miss: the retrieval layer matters more than the model in most cases. A mediocre model with excellent retrieval beats a frontier model with no context. Put the same care into your knowledge base, chunking strategy, and retrieval pipeline that you'd put into model selection.
The Data Flywheel
This is the concept that separates a demo from a product. Every interaction with your AI system generates data: user clicks, corrections, dismissals, escalations to humans. If you design the system to capture this signal and feed it back into training, evaluation, and threshold tuning, your system improves over time without anyone manually intervening.
The flywheel has four stages. Users interact with the system. The system logs predictions alongside outcomes. An evaluation pipeline compares predictions to actual results. Insights from evaluation feed into model retraining, threshold adjustment, and retrieval improvements.
When an interviewer asks about continuous improvement, they're really asking about the flywheel. When they ask about monitoring, they're partly asking whether you've designed the system to detect when the flywheel breaks down, when the data distribution shifts and your model starts operating on inputs it wasn't trained for.
Answering the Tradeoff Questions
Every AI system design interview reaches a point where the interviewer pushes on tradeoffs. "What if the latency requirement drops to 50ms?" "What if you need to cut costs by 80%?" "What if accuracy needs to go from 95% to 99.5%?"
These moments determine your score. Don't panic and start redesigning from scratch. Instead, work the cost-quality-latency triangle explicitly.
To reduce latency: use smaller models, pre-compute where possible, add caching layers, move to model distillation, accept lower accuracy on the tail.
To reduce cost: batch requests, use tiered model selection, cache frequent queries, reduce the percentage of requests hitting expensive models, accept slightly lower quality.
To increase quality: add human review for low-confidence predictions, use ensemble approaches, invest in better training data, add more retrieval context, accept higher latency and cost.
The best answers put numbers on these tradeoffs. "Switching from GPT-4 to a fine-tuned GPT-3.5 would reduce our per-request cost from $0.03 to $0.002, cut latency from 800ms to 200ms, and based on our eval set, drop accuracy from 94% to 89%. For the 70% of requests that are straightforward classifications, that's an acceptable trade." That level of specificity tells the interviewer you've built real systems, not just read about them.
Sample Questions
Design a real-time content moderation system using LLMs that processes 10,000 posts per minute.
They want to see you balance accuracy, latency, and cost. Discuss a multi-tier approach: fast heuristics first, ML classifier second, LLM for ambiguous cases. Address false positive/negative tradeoffs and human-in-the-loop for edge cases.
Design the ML infrastructure for a recommendation system that serves 100 million users.
Cover the full stack: data collection, feature engineering, model training, candidate generation, ranking, and serving. Discuss offline vs. online features, cold start problems, and A/B testing infrastructure.
How would you architect a customer support system that uses LLMs to handle 60% of tickets automatically?
Address retrieval-augmented generation (RAG), knowledge base management, confidence thresholds for automation vs. human handoff, feedback loops for continuous improvement, and cost modeling.
Evaluation Criteria
- Demonstrates understanding of ML-specific architecture patterns (feature stores, model serving, A/B testing)
- Addresses data quality, model monitoring, and drift detection as first-class concerns
- Makes explicit cost-quality-latency tradeoffs with reasoning
- Considers the human-in-the-loop component for AI systems that can fail
- Discusses organizational implications: who owns the model, the data, and the evaluation pipeline
Key Points
- •AI system design tests your ability to build end-to-end ML systems, not just pick the right model. The model is maybe 10% of the work.
- •Always start with the problem definition: what metric are you optimizing, what error rate is acceptable, what latency is required, and what's the cost budget?
- •The cost-quality-latency triangle is the defining tradeoff in AI systems. You can't maximize all three, and interviewers want to see you reason through the tension.
- •Human-in-the-loop is not optional. No AI system is 100% accurate, and the design must account for graceful fallback to human judgment.
- •The data flywheel is what separates good AI products from great ones. Systems that learn from their own outputs compound their advantage over time.
Common Mistakes
- ✗Designing only the model and ignoring the surrounding system. In production ML, the model is roughly 10% of the code and complexity.
- ✗Not discussing cost at all. LLM inference is expensive, and interviewers want to see you reason about unit economics.
- ✗Treating AI outputs as deterministic. They are probabilistic, and your system must handle confidence scores, thresholds, and fallbacks.
- ✗Ignoring the cold start problem. New users, new content, and new categories all break AI systems that depend on historical data.