Distributed Transaction Patterns
Architecture Diagram
The Four Patterns, Honestly
Not all distributed transaction patterns are equal, and the industry has spent a decade learning which ones belong where. Here is what actually works in production, not what looks clean on a whiteboard.
Two-Phase Commit (2PC) gets a bad reputation, but it has a legitimate niche. Within a single vendor's database cluster (PostgreSQL with multiple databases, Oracle RAC, Google Spanner), 2PC works reliably because the coordinator and participants share failure detection and recovery mechanisms. The problems start when you try to run 2PC across heterogeneous systems. A coordinator crash between the prepare and commit phases leaves participants holding locks indefinitely. In a microservices world where "participant" means "another team's service," that lock becomes a production incident.
Saga with Choreography uses events. Each service completes its local transaction and publishes an event. The next service picks it up and continues. No central coordinator. This works well for simple, linear flows: Order Service publishes OrderCreated, Payment Service charges the card and publishes PaymentCharged, Inventory Service reserves stock. Clean. But when the flow branches, when services need to react to multiple events, when you need to answer "what state is this order in right now?" you are assembling the answer from event logs across 5 services.
Saga with Orchestration uses a central coordinator that tells each service what to do and tracks the overall state. Temporal (evolved from Uber's Cadence) is the dominant open-source option here. The orchestrator holds the workflow definition, manages retries, handles timeouts, and persists state durably. The single point of failure concern is real but manageable. Temporal itself runs as a highly available cluster. Netflix, Stripe, and Coinbase all use orchestrated sagas in production.
Transactional Outbox solves a narrower problem: how to atomically update your database and publish an event. Write the event to an outbox table in the same database transaction as your business data. A separate process (Debezium, a polling worker) reads the outbox and publishes to Kafka. This is not a full saga pattern, but it is the building block most sagas need, and it is where most teams should start.
When Step 3 of 5 Fails
Consider an e-commerce checkout: (1) create order, (2) charge payment, (3) reserve inventory, (4) schedule shipping, (5) send confirmation. Shipping fails because the warehouse is at capacity.
With an orchestrated saga, the orchestrator detects the failure and runs compensating transactions in reverse order. Inventory gets unreserved. Payment gets refunded. Order status moves to "cancelled." Each compensation is a new forward transaction, not a database rollback.
The subtle problem: the customer's credit card was charged and then refunded. That is a different experience than never being charged. The authorization hold affects their available credit. The refund takes 3-5 business days to appear. This is why many payment-heavy systems use authorization holds instead of immediate charges. The hold reserves funds without capturing them, and if the saga fails, you simply release the hold instead of issuing a refund.
Choosing the Right Pattern
Start with the Transactional Outbox for reliable event publishing. If your workflows are simple linear chains across 2-3 services, choreographed sagas work fine. Once you have branching logic, human approval steps, long-running processes (hours or days), or more than 3 participating services, invest in an orchestration framework like Temporal. Reserve 2PC for single-vendor database clusters where the coordinator is part of the database engine itself.
The most common mistake is reaching for a saga framework before you need one. If two services need to coordinate, a direct API call with a retry and a compensating endpoint is simpler, easier to debug, and easier to operate than wiring up Temporal or building a choreography layer. Complexity should be proportional to the coordination problem you actually have.
Key Points
- •The moment you split a monolith into services, you lose ACID transactions across service boundaries. Everything after that is damage control. Accept this early and design accordingly
- •Transactional Outbox is the pattern most teams should start with. It gives you reliable event publishing without the operational burden of a full saga framework
- •Choreographed sagas are elegant in diagrams and nightmarish to debug in production. If your saga spans more than 3 services, use orchestration
- •Compensating transactions are not rollbacks. They are new forward actions that semantically undo previous work. Refunding a payment is not the same as never charging it
- •Two-phase commit works within a single database vendor's cluster. The moment you cross vendor boundaries (Postgres to Kafka, MySQL to DynamoDB), 2PC falls apart
Common Mistakes
- ✗Designing compensations as an afterthought. Teams build the happy path across 5 services, then realize they have no way to undo step 3 when step 4 fails. Compensation logic should be designed alongside the forward path
- ✗Using choreographed sagas across more than 3 services. By the time you have 6 services publishing and consuming events for a single business operation, no one can trace the full flow without a distributed tracing tool and 30 minutes of detective work
- ✗Treating the saga orchestrator as a simple state machine. Production orchestrators need durable state, retry policies, timeout handling, and dead-letter queues. Uber built Cadence (now Temporal) specifically because lightweight state machines were not enough
- ✗Skipping idempotency on saga participants. Network retries will deliver duplicate commands. Every service in a saga must handle being called twice with the same request without producing duplicate side effects