LLM Production Incident Patterns
A Different Kind of Production Incident
LLM integrations have brought a whole class of production incidents that traditional runbooks simply do not cover. These systems fail in ways that look nothing like a database going down or a service running out of memory. The API returns 200 OK, response time is within bounds, and the content is complete nonsense. Or the API works perfectly, but the bill at the end of the day reads $50,000 instead of $500.
If you are running LLMs in production, you need incident response patterns built specifically for these failure modes. Your existing playbook was not designed for this.
Provider Dependency and Outage Patterns
When you integrate with an LLM provider, you are picking up an external dependency that behaves differently from anything else in your stack. Traditional APIs have SLAs measured in uptime percentages and latencies in milliseconds. LLM providers have latencies measured in seconds, rate limits measured in tokens per minute, and outage patterns that are genuinely hard to predict.
OpenAI, Anthropic, Google, and the other providers all have outages. Not once a year. Monthly. Sometimes weekly. The incident might be a complete outage where every request returns a 500, or a partial degradation where responses come back slow, truncated, or lower quality than usual. Your architecture has to handle all of these gracefully.
Multi-provider failover is the strongest defense here. Abstract your LLM calls behind a provider-agnostic interface and set up automatic routing. When your primary provider starts returning errors or goes past your latency threshold, route to a secondary provider. This means normalizing prompts across providers (system prompt format, token limits, response parsing), which takes real engineering effort but pays for itself quickly.
Semantic caching can eliminate 30-60% of LLM API calls entirely. A lot of queries from different users are semantically identical, or close enough that a cached response works fine. "What is your return policy?" asked by a thousand users does not need a thousand LLM calls. Use embedding similarity to match incoming queries against cached responses, with a tunable similarity threshold.
Graceful degradation is non-negotiable. Every LLM-powered feature needs a defined fallback. For search, fall back to keyword search. For summarization, show the raw text. For a chatbot, display a "temporarily unavailable" message with a link to traditional support. A graceful fallback is infinitely better than a stack trace.
Cost Runaway Incidents
This is a failure type that does not exist with traditional infrastructure. A bug in prompt construction creates a loop. A traffic spike sends 10x normal volume to the LLM API. An attacker figures out they can trigger expensive long-context calls through your public endpoint. The result: thousands or tens of thousands of dollars in API charges before anyone catches it.
Cost circuit breakers matter just as much as availability circuit breakers:
- Per-minute spend limits. Track API spending in real-time and cut the circuit when spending crosses a threshold. If your normal spend is $10/hour, alert at $30/hour and kill at $100/hour.
- Per-request cost estimation. Before sending a request, estimate its cost based on input token count. Reject requests that would blow a per-request budget. This catches the "someone pasted a 100-page document into the chat" scenario.
- Per-user rate limiting. Do not let one user (or one attacker) eat a disproportionate share of your LLM budget. Use token-based rate limiting per user, not just request-based.
- Daily and monthly budgets. Set hard caps at the provider level (most support this) and in your own application. Belt and suspenders.
The first time you get a surprise $20,000 bill, you will wish you had built these controls. Do it before that happens.
Prompt Injection Is a Security Incident
When someone successfully injects a prompt into your LLM application, treat it as a security incident. Not a quality issue. Not a UX bug. A security incident.
A successful prompt injection can extract your system prompt (exposing business logic and confidential instructions), manipulate the model into generating harmful content through your application, pull user data from the context window, or bypass access controls if the model has tool-calling capabilities.
Your response should follow your existing security incident protocol:
- Detect. Watch model outputs for signs of injection: system prompt leakage, unexpected tool calls, outputs that do not match expected formats, or content policy violations.
- Contain. Disable the affected feature or endpoint. If the model has tool-calling access, revoke it immediately. Do not wait to understand the full scope before containing.
- Investigate. Go through the logs. What did the attacker send? What did the model return? Was any sensitive data exposed? This is why logging inputs and outputs matters. Without logs, you are investigating blind.
- Remediate. Patch the vulnerability. Add input validation for the specific injection pattern. Strengthen your output validation. Reduce the model's permissions if they were too broad.
- Communicate. If user data was exposed, follow your data breach notification procedures. Prompt injection that leaks personal data is a data breach under GDPR.
Handling LLM Incidents Step by Step
When an LLM incident hits, work through this decision tree:
Is the LLM provider down? Check their status page, but more importantly, check your own error rate metrics. Status pages always lag behind reality. If the provider is down, activate your failover or graceful degradation. There is nothing else you can do until they recover.
Is the problem latency or quality? Latency issues are easier to detect and deal with. Quality issues require pulling output samples and evaluating them manually or against a golden test set. For high latency, the trigger is straightforward: if P95 exceeds your threshold, route to the backup provider.
Is this a cost incident? If spending is spiking, kill the circuit immediately. Investigate after you have stopped the bleeding. A few minutes of degraded AI features costs far less than an unbounded API bill.
Is this a security incident? If you spot prompt injection, system prompt leakage, or unexpected tool calls, treat it at full severity. Contain first, investigate second.
Is the issue in your code or the provider's? Check if your prompts changed recently. Check if the provider updated their model (they sometimes do this without any notice). Send a test request with a known-good prompt and compare the response against your expected baseline.
Build runbooks for each of these scenarios and practice them. The worst time to figure out your LLM incident response is during an actual incident at 2 AM while customers are complaining and costs are climbing.
Incident Timeline
- T+0mLLM provider API starts returning 429 rate limit errors during peak traffic
- T+1mApplication retry logic amplifies the load, all LLM-powered features start failing
- T+2mNo graceful degradation exists, pages that depend on LLM responses show errors
- T+5mCustomer support gets flooded with complaints about broken AI features
- T+10mEngineering manually disables LLM features using feature flags
- T+15mNon-AI fallback paths activated, user experience partially restored
- T+30mLLM provider comes back up, gradual re-enablement of AI features begins
Detection Signals
- •LLM API response time exceeding 10 seconds when the normal P95 is 2-5 seconds
- •Rate limit (429) or server error (500/503) responses from the LLM provider
- •Cost anomaly, like a sudden spike in per-minute API spending
- •Content quality dropping, with users reporting nonsensical or irrelevant AI outputs
- •Prompt injection detected, where outputs leak system prompt content or contain unexpected instructions
Prevention
- For every LLM-powered feature, decide ahead of time what the user sees when the LLM is unavailable
- Set up circuit breakers tuned for LLM API latency patterns (seconds, not milliseconds)
- Configure multi-provider failover so you can route to Anthropic when OpenAI is down, and the other way around
- Set rate limit budgets well below provider limits so you have headroom for traffic spikes
- Cache aggressively. Semantic caching for repeated similar queries can serve 30-60% of requests without making an API call.
Key Points
- •LLM provider outages are external dependency failures, similar to AWS going down. You have zero control over when they happen or how long they last.
- •LLM latency is measured in seconds, not milliseconds. Your circuit breakers, timeouts, and retry policies need to be calibrated for that reality.
- •Cost runaway is a failure type unique to LLMs. A bug that sends long prompts in a loop can burn thousands of dollars in minutes.
- •Prompt injection is a security incident. It needs the same response protocol as XSS or SQL injection, not just a bug fix.
- •Output quality degradation might not trip any infrastructure alert. The API returns 200 OK while the model produces nonsense.
Common Mistakes
- ✗Shipping LLM features with no fallback, so when the provider goes down, users see raw error messages
- ✗Setting circuit breaker timeouts based on traditional API patterns (100-500ms) when LLM responses routinely take 2-10 seconds
- ✗Not building cost circuit breakers, which lets a runaway bug or traffic spike generate an uncapped API bill
- ✗Treating LLM provider outages as rare events when they actually happen multiple times a month across major providers
- ✗Not logging prompt inputs and model outputs, making it impossible to investigate quality incidents or injection attacks after the fact