Third-Party Dependency Failures

When Someone Else's Problem Becomes Your Problem

You don't control Stripe's uptime. You don't control AWS S3's availability. You don't control Twilio's capacity. But when any of them go down, your customers blame you. Third-party dependency failures are unique because you can't fix the root cause. All you can do is survive until the provider recovers.

The uncomfortable truth: most teams don't test their fallback strategies. They have circuit breakers configured, they have retry logic in place, they might even have a secondary provider on paper. But nobody has actually blocked the primary provider at the network level and verified that the fallback path works end to end.

Dependency Mapping

Start with a complete inventory. Every HTTP call that leaves your network boundary is a dependency. Group them by criticality:

Critical (service fails without it): Payment processors, auth providers, primary database. These need circuit breakers, cached fallbacks, and ideally a secondary provider.

Important (degraded without it): Email delivery, push notifications, analytics. These need circuit breakers and graceful degradation. Show the user their data without the enrichment. Send the email later.

Nice-to-have (invisible when down): Feature flags (fall back to defaults), A/B testing, non-critical webhooks. These need timeouts so they don't slow down the critical path.

Circuit Breaker Patterns

A circuit breaker tracks failure rates for an external call. When failures exceed a threshold (say, 50% over 10 seconds), it "opens" and stops making calls for a cooldown period. After the cooldown, it lets one request through (half-open state) to test if the service recovered.

The circuit breaker alone isn't enough. You need to define what happens when the circuit is open. Returning a 503 to your user is just failing faster. Real resilience means having a degraded but functional experience. Show cached product prices. Queue the order for later processing. Use a backup payment processor.

The Provider Status Page Problem

Don't rely on provider status pages for detection. Cloudflare's status page showed "All Systems Operational" for 15 minutes into their 2023 outage. Build your own monitoring for every critical dependency. Track response time, error rate, and availability from your side. You'll detect the outage before the provider acknowledges it.

Subscribe to provider status pages via RSS or webhook for post-detection context, not for primary detection. When your monitoring fires and the provider's status page confirms it, you know there's nothing to do on your side except activate fallbacks and wait.

Financial Impact Tracking

Track the revenue impact of dependency outages separately. When Stripe is down for 30 minutes and you lose $23,000 in abandoned carts, that number justifies the engineering investment in a secondary payment processor. Without concrete numbers, "we should add a backup payment provider" is just another item on the backlog that never gets prioritized.

When Someone Else's Problem Becomes Your Problem

Dependency Mapping

Start with a complete inventory. Every HTTP call that leaves your network boundary is a dependency. Group them by criticality:

Critical (service fails without it): Payment processors, auth providers, primary database. These need circuit breakers, cached fallbacks, and ideally a secondary provider.

Nice-to-have (invisible when down): Feature flags (fall back to defaults), A/B testing, non-critical webhooks. These need timeouts so they don't slow down the critical path.

Circuit Breaker Patterns

The Provider Status Page Problem

Financial Impact Tracking

When Someone Else's Problem Becomes Your Problem

Dependency Mapping

Circuit Breaker Patterns

The Provider Status Page Problem

Financial Impact Tracking

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics

Third-Party Dependency Failures

When Someone Else's Problem Becomes Your Problem

Dependency Mapping

Circuit Breaker Patterns

The Provider Status Page Problem

Financial Impact Tracking

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics