LLM Production Incident Patterns

A Different Kind of Production Incident

LLM integrations have brought a whole class of production incidents that traditional runbooks simply do not cover. These systems fail in ways that look nothing like a database going down or a service running out of memory. The API returns 200 OK, response time is within bounds, and the content is complete nonsense. Or the API works perfectly, but the bill at the end of the day reads $50,000 instead of $500.

If you are running LLMs in production, you need incident response patterns built specifically for these failure modes. Your existing playbook was not designed for this.

Provider Dependency and Outage Patterns

When you integrate with an LLM provider, you are picking up an external dependency that behaves differently from anything else in your stack. Traditional APIs have SLAs measured in uptime percentages and latencies in milliseconds. LLM providers have latencies measured in seconds, rate limits measured in tokens per minute, and outage patterns that are genuinely hard to predict.

OpenAI, Anthropic, Google, and the other providers all have outages. Not once a year. Monthly. Sometimes weekly. The incident might be a complete outage where every request returns a 500, or a partial degradation where responses come back slow, truncated, or lower quality than usual. Your architecture has to handle all of these gracefully.

Multi-provider failover is the strongest defense here. Abstract your LLM calls behind a provider-agnostic interface and set up automatic routing. When your primary provider starts returning errors or goes past your latency threshold, route to a secondary provider. This means normalizing prompts across providers (system prompt format, token limits, response parsing), which takes real engineering effort but pays for itself quickly.

Semantic caching can eliminate 30-60% of LLM API calls entirely. A lot of queries from different users are semantically identical, or close enough that a cached response works fine. "What is your return policy?" asked by a thousand users does not need a thousand LLM calls. Use embedding similarity to match incoming queries against cached responses, with a tunable similarity threshold.

Graceful degradation is non-negotiable. Every LLM-powered feature needs a defined fallback. For search, fall back to keyword search. For summarization, show the raw text. For a chatbot, display a "temporarily unavailable" message with a link to traditional support. A graceful fallback is infinitely better than a stack trace.

Cost Runaway Incidents

This is a failure type that does not exist with traditional infrastructure. A bug in prompt construction creates a loop. A traffic spike sends 10x normal volume to the LLM API. An attacker figures out they can trigger expensive long-context calls through your public endpoint. The result: thousands or tens of thousands of dollars in API charges before anyone catches it.

Cost circuit breakers matter just as much as availability circuit breakers:

Per-minute spend limits. Track API spending in real-time and cut the circuit when spending crosses a threshold. If your normal spend is $10/hour, alert at $30/hour and kill at $100/hour.
Per-request cost estimation. Before sending a request, estimate its cost based on input token count. Reject requests that would blow a per-request budget. This catches the "someone pasted a 100-page document into the chat" scenario.
Per-user rate limiting. Do not let one user (or one attacker) eat a disproportionate share of your LLM budget. Use token-based rate limiting per user, not just request-based.
Daily and monthly budgets. Set hard caps at the provider level (most support this) and in your own application. Belt and suspenders.

The first time you get a surprise $20,000 bill, you will wish you had built these controls. Do it before that happens.

Prompt Injection Is a Security Incident

When someone successfully injects a prompt into your LLM application, treat it as a security incident. Not a quality issue. Not a UX bug. A security incident.

A successful prompt injection can extract your system prompt (exposing business logic and confidential instructions), manipulate the model into generating harmful content through your application, pull user data from the context window, or bypass access controls if the model has tool-calling capabilities.

Your response should follow your existing security incident protocol:

Detect. Watch model outputs for signs of injection: system prompt leakage, unexpected tool calls, outputs that do not match expected formats, or content policy violations.
Contain. Disable the affected feature or endpoint. If the model has tool-calling access, revoke it immediately. Do not wait to understand the full scope before containing.
Investigate. Go through the logs. What did the attacker send? What did the model return? Was any sensitive data exposed? This is why logging inputs and outputs matters. Without logs, you are investigating blind.
Remediate. Patch the vulnerability. Add input validation for the specific injection pattern. Strengthen your output validation. Reduce the model's permissions if they were too broad.
Communicate. If user data was exposed, follow your data breach notification procedures. Prompt injection that leaks personal data is a data breach under GDPR.

Handling LLM Incidents Step by Step

When an LLM incident hits, work through this decision tree:

Is the LLM provider down? Check their status page, but more importantly, check your own error rate metrics. Status pages always lag behind reality. If the provider is down, activate your failover or graceful degradation. There is nothing else you can do until they recover.

Is the problem latency or quality? Latency issues are easier to detect and deal with. Quality issues require pulling output samples and evaluating them manually or against a golden test set. For high latency, the trigger is straightforward: if P95 exceeds your threshold, route to the backup provider.

Is this a cost incident? If spending is spiking, kill the circuit immediately. Investigate after you have stopped the bleeding. A few minutes of degraded AI features costs far less than an unbounded API bill.

Is this a security incident? If you spot prompt injection, system prompt leakage, or unexpected tool calls, treat it at full severity. Contain first, investigate second.

Is the issue in your code or the provider's? Check if your prompts changed recently. Check if the provider updated their model (they sometimes do this without any notice). Send a test request with a known-good prompt and compare the response against your expected baseline.

Build runbooks for each of these scenarios and practice them. The worst time to figure out your LLM incident response is during an actual incident at 2 AM while customers are complaining and costs are climbing.

A Different Kind of Production Incident

If you are running LLMs in production, you need incident response patterns built specifically for these failure modes. Your existing playbook was not designed for this.

Provider Dependency and Outage Patterns

Cost Runaway Incidents

Cost circuit breakers matter just as much as availability circuit breakers:

Per-minute spend limits. Track API spending in real-time and cut the circuit when spending crosses a threshold. If your normal spend is $10/hour, alert at $30/hour and kill at $100/hour.
Per-request cost estimation. Before sending a request, estimate its cost based on input token count. Reject requests that would blow a per-request budget. This catches the "someone pasted a 100-page document into the chat" scenario.
Per-user rate limiting. Do not let one user (or one attacker) eat a disproportionate share of your LLM budget. Use token-based rate limiting per user, not just request-based.
Daily and monthly budgets. Set hard caps at the provider level (most support this) and in your own application. Belt and suspenders.

The first time you get a surprise $20,000 bill, you will wish you had built these controls. Do it before that happens.

Prompt Injection Is a Security Incident

When someone successfully injects a prompt into your LLM application, treat it as a security incident. Not a quality issue. Not a UX bug. A security incident.

Your response should follow your existing security incident protocol:

Detect. Watch model outputs for signs of injection: system prompt leakage, unexpected tool calls, outputs that do not match expected formats, or content policy violations.
Contain. Disable the affected feature or endpoint. If the model has tool-calling access, revoke it immediately. Do not wait to understand the full scope before containing.
Investigate. Go through the logs. What did the attacker send? What did the model return? Was any sensitive data exposed? This is why logging inputs and outputs matters. Without logs, you are investigating blind.
Remediate. Patch the vulnerability. Add input validation for the specific injection pattern. Strengthen your output validation. Reduce the model's permissions if they were too broad.
Communicate. If user data was exposed, follow your data breach notification procedures. Prompt injection that leaks personal data is a data breach under GDPR.

Handling LLM Incidents Step by Step

When an LLM incident hits, work through this decision tree:

Is this a security incident? If you spot prompt injection, system prompt leakage, or unexpected tool calls, treat it at full severity. Contain first, investigate second.

A Different Kind of Production Incident

Provider Dependency and Outage Patterns

Cost Runaway Incidents

Prompt Injection Is a Security Incident

Handling LLM Incidents Step by Step

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics

LLM Production Incident Patterns

A Different Kind of Production Incident

Provider Dependency and Outage Patterns

Cost Runaway Incidents

Prompt Injection Is a Security Incident

Handling LLM Incidents Step by Step

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics