Third-Party Dependency Failures
When Someone Else's Problem Becomes Your Problem
You don't control Stripe's uptime. You don't control AWS S3's availability. You don't control Twilio's capacity. But when any of them go down, your customers blame you. Third-party dependency failures are unique because you can't fix the root cause. All you can do is survive until the provider recovers.
The uncomfortable truth: most teams don't test their fallback strategies. They have circuit breakers configured, they have retry logic in place, they might even have a secondary provider on paper. But nobody has actually blocked the primary provider at the network level and verified that the fallback path works end to end.
Dependency Mapping
Start with a complete inventory. Every HTTP call that leaves your network boundary is a dependency. Group them by criticality:
Critical (service fails without it): Payment processors, auth providers, primary database. These need circuit breakers, cached fallbacks, and ideally a secondary provider.
Important (degraded without it): Email delivery, push notifications, analytics. These need circuit breakers and graceful degradation. Show the user their data without the enrichment. Send the email later.
Nice-to-have (invisible when down): Feature flags (fall back to defaults), A/B testing, non-critical webhooks. These need timeouts so they don't slow down the critical path.
Circuit Breaker Patterns
A circuit breaker tracks failure rates for an external call. When failures exceed a threshold (say, 50% over 10 seconds), it "opens" and stops making calls for a cooldown period. After the cooldown, it lets one request through (half-open state) to test if the service recovered.
The circuit breaker alone isn't enough. You need to define what happens when the circuit is open. Returning a 503 to your user is just failing faster. Real resilience means having a degraded but functional experience. Show cached product prices. Queue the order for later processing. Use a backup payment processor.
The Provider Status Page Problem
Don't rely on provider status pages for detection. Cloudflare's status page showed "All Systems Operational" for 15 minutes into their 2023 outage. Build your own monitoring for every critical dependency. Track response time, error rate, and availability from your side. You'll detect the outage before the provider acknowledges it.
Subscribe to provider status pages via RSS or webhook for post-detection context, not for primary detection. When your monitoring fires and the provider's status page confirms it, you know there's nothing to do on your side except activate fallbacks and wait.
Financial Impact Tracking
Track the revenue impact of dependency outages separately. When Stripe is down for 30 minutes and you lose $23,000 in abandoned carts, that number justifies the engineering investment in a secondary payment processor. Without concrete numbers, "we should add a backup payment provider" is just another item on the backlog that never gets prioritized.
Incident Timeline
- T+0mStripe API starts returning 503 errors. Payment processing fails for all checkout flows. No prior warning from Stripe's status page.
- T+2mCircuit breaker trips after 10 consecutive failures. Checkout page shows 'Payment temporarily unavailable'. Orders queue in a dead letter queue.
- T+5mOn-call paged. Checks internal services first (database, cache, application). Spends 4 minutes before checking Stripe's status page, which now shows degraded performance.
- T+10mTeam activates fallback: cached payment tokens for returning customers process via a secondary payment processor (Adyen). New customers still blocked.
- T+15mStripe recovers partially. Team enables retry processing for queued orders. Some duplicate charge risk identified and flagged for manual review.
- T+30mFull recovery. 847 orders were delayed, 12 customers were double-charged and need refunds. Revenue impact: $23,000 in abandoned carts.
Detection Signals
- •Sudden spike in HTTP 5xx responses from a specific external API endpoint
- •Circuit breaker state changes from closed to open across multiple service instances simultaneously
- •Increased latency (>2x baseline) for requests that involve third-party API calls
- •Third-party provider status page showing degraded performance or incident in progress
Prevention
- Implement circuit breakers (Hystrix pattern, resilience4j, or gobreaker) for every external API call with sensible thresholds
- Cache third-party responses where possible. A cached exchange rate from 5 minutes ago is better than no exchange rate at all
- Maintain a dependency map that lists every third-party service, its criticality level, and the fallback strategy for each
- Evaluate SLAs and build your architecture to survive the SLA floor. If a provider promises 99.95% uptime, plan for 4.38 hours of downtime per year
- Run quarterly dependency failure drills. Literally block the third-party API at the firewall level and verify your fallback works
Key Points
- •Your availability is bounded by your least reliable dependency. If you depend on five services with 99.9% uptime each, your theoretical maximum is 99.5%
- •The 2023 Cloudflare outage took down thousands of websites that had no direct relationship with Cloudflare because they used services that used Cloudflare
- •Circuit breakers are necessary but not sufficient. You also need a meaningful degraded experience, not just a faster error message
- •Status pages are marketing documents, not monitoring tools. By the time a provider updates their status page, your customers have already noticed
- •Multi-provider strategies cost 2-3x in engineering effort but are the only real protection against provider-level outages
Common Mistakes
- ✗Treating third-party API calls the same as internal service calls, without timeouts, retries, or circuit breakers
- ✗Not testing what actually happens when a dependency fails. Teams assume the circuit breaker works but never verify the fallback path
- ✗Having a single point of failure on a 'reliable' provider. AWS, Stripe, and Cloudflare all have had multi-hour outages