AI Cost Engineering
Cost Per Request
The price of a single LLM call, calculated as: (input_tokens x input_price) + (output_tokens x output_price). Like a phone call billed per minute for talking and listening separately. Inline completion: roughly $0.001. Agent task: roughly $0.05. The 50x difference is why cost-aware routing matters.
Input vs Output Tokens
LLM providers charge separately for reading (input) and writing (output). Input tokens are the prompt sent to the model. Output tokens are what the model generates. Input is typically 2-5x cheaper than output because generation requires more GPU computation. Optimize by keeping prompts concise and caching responses.
Token Estimation
Tokens are the units LLMs process, roughly word-sized pieces. A quick rule of thumb: 1 token is about 4 characters of English text, or 0.75 words. A 100-line function is roughly 500-800 tokens. A 10,000-line codebase is roughly 50K-80K tokens. Code tends to use slightly more tokens than prose because of syntax characters.
Cost-Aware Routing
Sending different types of requests to different-priced models based on complexity. Like choosing between regular mail and express delivery depending on urgency. Route 90% of simple requests (close bracket, finish variable) to a cheap 7B model at $0.001 each. Reserve the expensive 70B model ($0.05) for the 10% that need multi-step reasoning. Saves 60-70% overall.
Blended Cost Per Request
The average cost across all request types, accounting for routing. If 90% of requests cost $0.001 (7B) and 10% cost $0.05 (70B), the blended cost is (0.9 x $0.001) + (0.1 x $0.05) = $0.006. This is the number that matters for unit economics and pricing decisions, not the cost of any single model tier.
Response Cache
Storing the model's response keyed by the prompt hash, so identical or near-identical prompts get an instant cached answer instead of a fresh GPU computation. Like a lookup table that remembers recent answers. 15-25% hit rate for code completions. Saves the entire inference call. The single biggest lever on cost.
KV-Cache Prefix Reuse
A GPU-level optimization specific to LLM serving. When a developer types one character, 99% of the prompt is unchanged from the last request. The system reuses the attention state already computed for the unchanged prefix. Like skipping to the last chapter of a book instead of re-reading from page one. 30-50% hit rate during rapid typing.
API Provider Margin
The markup between raw GPU compute cost and what API providers charge. Providers typically charge roughly 8x the raw compute cost. That margin covers their GPU fleet, networking, redundancy, engineering team, and profit. This is why self-hosting becomes dramatically cheaper at high volume: the 8x markup disappears.
Revenue-to-Compute Ratio
How much revenue each dollar of compute generates. A completion costs $0.001. A developer makes 100 per day = $0.10/day in compute. If the subscription is $20/month, that's a 200x ratio. This ratio only holds if routing and caching work well. Without them, the 10% of queries hitting the expensive 70B model can eat 80% of the budget.
Session Budget
A spending cap on long-running AI tasks to prevent surprise bills. Long L3 autonomous sessions (building entire apps) can consume millions of tokens over hours and silently burn $100+. Set a ceiling (e.g., $15). At 80% consumed: warn. At 100%: stop, checkpoint progress, and present partial results.
Cost Projection
Estimating what a task will cost before it starts, based on historical data from similar past sessions. The system tracks token consumption and cost for every completed task and uses that data to predict costs for new tasks. 'Building a Next.js app with auth and Stripe typically costs $8-12 based on 50 similar sessions.' Helps set accurate session budgets.
Monthly Cost Math
The formula for projecting monthly infrastructure spend. Daily volume x cost per request x 30 days. Example: 100M completions/day x $0.001 x 30 = $3M/month on API pricing. With caching (saves 20%) + routing (saves 60% on the expensive 10%): $1.5-2M/month. Always model costs with and without caching to understand the impact.