Saga Pattern: Distributed Transactions Without 2PC
A saga executes a multi-step business transaction across services as a sequence of local transactions, each with a compensating transaction that undoes it. If any step fails, run the compensations for the steps already done. Replaces two-phase commit (2PC) for the microservices era. Two flavours: orchestrated (a coordinator drives) or choreographed (each service reacts to events).
Diagram
What it is
A saga is a way to execute a business transaction that spans multiple services without using two-phase commit. The transaction becomes a sequence of local transactions, each in one service, each with a defined compensating transaction that undoes it.
If everything succeeds, the saga commits in pieces over time. If a step fails, the coordinator runs compensations for the already-completed steps in reverse order. The system reaches a consistent state either way.
This is the standard pattern for any multi-service workflow that touches multiple data stores: place an order (charge card + reserve inventory + dispatch shipping), book a trip (hotel + flight + car rental), provision a tenant (create account + allocate database + send welcome email).
Why not 2PC
Two-phase commit needs a transaction coordinator that holds locks across all participants until everyone agrees to commit. Two big problems at scale:
- Locks span the network. All participating services hold locks for the duration of the transaction. A slow service makes everyone wait. A network partition can leave locks held indefinitely.
- Strong consistency, weak availability. A coordinator failure during commit can leave participants in an inconsistent state until manual recovery.
Saga trades strong consistency for availability. Steps commit individually; readers may see "order created but not yet paid" briefly. The saga ensures eventual consistency by either completing or compensating.
The two flavours
Orchestrated: a coordinator service holds the workflow. It calls each step, handles errors, runs compensations on failure. The workflow is visible in one place. Easier to understand, easier to add steps, easier to monitor. The coordinator can be implemented as code, or with a workflow engine (Temporal, Cadence, AWS Step Functions, Camunda).
Choreographed: no coordinator. Each service publishes events when it completes work; other services subscribe and react. The workflow is implicit in the chain of subscriptions. More decoupled, but the workflow lives in many places, debugging is harder, and adding a new step means changing multiple services.
For most teams, orchestrated is the right default. Workflow engines like Temporal handle the persistence, retry, and visibility, leaving the steps as the only thing to write.
Compensations are not inverses
A compensation isn't always "the inverse operation". Consider:
- Charging a card: compensation is refund (different operation, different audit trail).
- Sending an email: compensation is sending another email ("never mind, the order was cancelled"). The original can't be unsent.
- Allocating an ID: compensation is marking the ID as released, not deleting it (other systems may have already referenced it).
Designing the compensation often takes more thought than the forward step. Get it right or the saga produces inconsistent states.
Persistence is essential
A real saga persists state between steps. Without it, a coordinator crash loses the saga's progress; the system can't tell which compensations to run. Standard practice: write the saga's current step to a database after each transition. On startup, the coordinator scans for unfinished sagas and resumes them.
This is exactly what workflow engines provide out of the box. Rolling a custom saga without persistence is the path to inconsistent production data.
Compensations have to be idempotent. Use the same idempotency-key pattern as forward operations. State must be persisted between steps, either in a custom saga store or via a workflow engine. And plan for compensation failure, because it does happen: the saga goes into a "needs manual intervention" state, alerts fire, and the manual recovery process is designed before it's needed.
For new code on Java/Go/Python, look at Temporal SDKs first. They handle the persistence and resumption logic; the application writes the activity functions and the workflow definition.
Implementations
A coordinator runs each step. If step N fails, run compensations 1..N-1 in reverse. State is persisted between steps so a coordinator crash can resume.
1 public class OrderSaga {
2 public OrderResult process(Order order) {
3 List<Runnable> compensations = new ArrayList<>();
4 try {
5 // Step 1: reserve inventory
6 String reservation = inventory.reserve(order);
7 compensations.add(() -> inventory.release(reservation));
8
9 // Step 2: charge payment
10 String charge = payments.charge(order.amount, order.card,
11 "saga-" + order.id); // idempotency key
12 compensations.add(() -> payments.refund(charge));
13
14 // Step 3: ship
15 String shipment = shipping.dispatch(order);
16 compensations.add(() -> shipping.cancel(shipment));
17
18 // Step 4: notify customer
19 notifications.sendOrderConfirmed(order);
20 // Notification has no meaningful compensation; skip
21
22 return OrderResult.success(reservation, charge, shipment);
23 } catch (Exception e) {
24 // Run compensations in REVERSE order
25 Collections.reverse(compensations);
26 for (Runnable c : compensations) {
27 try { c.run(); } catch (Exception ce) {
28 log.error("compensation failed", ce);
29 // Manual intervention; alert
30 }
31 }
32 throw new SagaFailed(e);
33 }
34 }
35 }Key points
- •Local transactions per step. Each has a compensating transaction (cancel order, refund payment, restore inventory).
- •On failure at step N, run compensations for steps N-1, N-2, ..., 1 in reverse order.
- •Compensations must be idempotent: they may be replayed. Use idempotency keys.
- •Orchestrated: one coordinator decides next step, calls each service, handles failures. Easier to reason about.
- •Choreographed: each service publishes events, others react. Decouples services but harder to debug.
Follow-up questions
▸Saga vs 2PC?
▸Orchestrated vs choreographed?
▸Why must compensations be idempotent?
▸What if a compensation itself fails?
Gotchas
- !Forgetting idempotency keys: replays double-charge, double-refund
- !Not persisting state between steps: coordinator crash loses progress
- !Choreographed saga across many services: workflow is invisible, debugging is awful
- !Treating compensations as 'just like forward, in reverse': they're not, semantics differ (refund != negative charge)
- !No alert on compensation failure: silent inconsistency