AI Back-of-Envelope Formulas

QPS from Daily Volume

Convert daily request count to requests per second. QPS = daily_requests / 86,400 (seconds in a day). Since developers mostly code during working hours, peak QPS is roughly 3x the average. Example: 100M requests/day = 1,157 avg QPS, roughly 3,000 peak QPS.

GPU Count

How many GPUs are needed to handle the load. Formula: peak_QPS / QPS_per_GPU x 2 (failover buffer). The key input is QPS per GPU, which depends on model size: 7B INT4 handles roughly 200 QPS, 34B INT8 roughly 50 QPS, 70B FP16 roughly 15 QPS. Example: 3,000 peak QPS on 7B = 3,000/200 x 2 = 30 GPUs.

Model Memory

How much GPU memory a model needs. Start with: weights = parameters x bytes_per_parameter. A 70B model in FP16 (2 bytes each) = 140 GB. In INT4 (0.5 bytes each) = 35 GB. Then add 30-50% on top for KV-cache (the model's working memory during inference) and batch processing overhead.

Vector DB Storage

How much disk space is needed for code or document embeddings. Formula: chunks x dimensions x bytes_per_value, plus 30% for the HNSW search index. Example: 150K code chunks x 1024 dimensions x 4 bytes (float32) = 600 MB raw. With int8 quantization (1 byte each): 150 MB. Much more manageable.

Cost Per Request (API)

What each LLM call costs when using a cloud API. Formula: (input_tokens x input_rate) + (output_tokens x output_rate). The rates vary by model tier. Example for inline completion on 7B: (1,300 input x $0.05/1K) + (100 output x $0.10/1K) = roughly $0.001. For an agent task on 70B: roughly $0.05.

Monthly Infra Cost

Project monthly spending for either API or self-hosted. API: daily_cost x 30 days. Self-hosted: GPU_count x hourly_rate x 730 hours/month. Example: 58 GPUs x $2/hr x 730 = $85K/month self-hosted, versus $4.5M/month on API for the same traffic. The difference is dramatic at scale.

Token Count Estimate

Quick way to estimate token counts from code or text. Roughly 4 characters per token, 0.75 words per token. Practical examples: a 2,000-line source file is roughly 15K-20K tokens. A 500-line function is roughly 3K-4K tokens. A short chat message is 20-50 tokens. These estimates help size context windows and estimate costs.

Cache Impact on Cost

How much caching reduces the inference bill. Response cache with a 15-25% hit rate: multiply total cost by 0.75-0.85 (those cached requests are free). KV-cache prefix reuse with 30-50% hit rate: additional 15-25% savings specifically on autocomplete requests. Combined effect: 30-50% total cost reduction. Always model costs with caching.

Embedding Index Cost

What it costs to build the vector search index. Formula: total_chunks x cost_per_embedding. At API pricing: 150K chunks x $0.0001/chunk = roughly $15 for a full index build. Incremental re-indexing on each file save touches only 1-5 chunks, so the ongoing cost is negligible. The upfront build is a one-time cost per repo.

Re-Ranking Latency Budget

Re-ranking (the second pass that improves retrieval quality) adds 80-150ms of latency. For inline autocomplete with a 300ms total budget, this is too slow. Re-ranking is typically skipped for L1 completions and used only for agent tasks where the time budget is seconds, not milliseconds. This is a common trade-off: better retrieval quality vs response speed.

Agent Task Token Budget

How many tokens a typical agent task consumes. A simple L2 task (refactor one file): 10K-30K tokens. A complex L2 task (refactor 12 files): 50K-200K tokens. An L3 autonomous build: 500K-2M tokens. Each tool call adds roughly 5K-10K tokens (the tool result enters the context). These estimates help set session budgets.

Latency Budget (Autocomplete)

How the total 300ms autocomplete latency breaks down across system components. Debounce wait (150ms) + context assembly (30ms) + network to server (10ms) + LLM inference TTFT (100ms) + post-processing validation (5ms) + render ghost text (5ms) = roughly 300ms total. Each component has a hard budget. If inference takes 200ms instead of 100ms, something else must shrink.

Cache ROI Formula

Whether a caching layer is worth building. Monthly savings = (monthly_inference_cost x cache_hit_rate). Monthly cost of cache infra = Redis/Memcached hosting. ROI = savings / cost. Example: $200K/month inference x 20% hit rate = $40K saved. Redis cluster costs $2K/month. ROI = 20x. Caching almost always pays for itself in AI systems because inference is so expensive.