Saga Pattern
Architecture
Why Not Just Use 2PC?
Two-phase commit works when all participants are part of the same system, controlled by the same team, with similar availability targets. Your order service, inventory service, and payment service are all your PostgreSQL databases? Use 2PC.
But in a real microservice architecture, these are separate services run by separate teams, deployed independently, with different databases. The payment service might be a third-party API. The shipping service might go down for maintenance on Tuesdays. Holding distributed locks across all of these while waiting for each to say "yes, I can commit" is fragile. If any participant is slow, every participant is blocked.
Sagas take a different approach. Instead of locking everything and committing atomically, execute each step independently and roll back on failure using compensating actions. You trade atomicity for availability and loose coupling.
How Sagas Work
A saga breaks a distributed transaction into a series of local transactions. Each local transaction updates one service and publishes an event or notifies the next step. If any step fails, previously completed steps are undone by running compensating transactions in reverse order.
Example: E-commerce order flow
Forward steps:
- Order Service: Create order (status: pending)
- Inventory Service: Reserve items
- Payment Service: Charge customer
- Shipping Service: Create shipment
If Payment (step 3) fails:
- Compensate Inventory: Release reserved items
- Compensate Order: Mark order as cancelled
Each compensating action is a new operation, not a database rollback. The inventory reservation was committed. The compensation is a new transaction that releases it.
Choreography: Event-Driven Sagas
In choreography, there is no central coordinator. Each service listens for events and reacts.
Order Service creates the order and publishes "OrderCreated." Inventory Service listens for "OrderCreated," reserves items, and publishes "InventoryReserved." Payment Service listens for "InventoryReserved," charges the customer, and publishes "PaymentProcessed." Shipping Service listens for "PaymentProcessed" and creates the shipment.
If Payment fails, it publishes "PaymentFailed." Inventory Service listens for "PaymentFailed" and releases the reservation. Order Service listens and cancels the order.
When Choreography Works
- Small sagas with 2-3 steps
- Simple, linear flows without branching
- Teams that own the full flow and can reason about it end-to-end
- Systems that already use event-driven architecture extensively
When Choreography Breaks Down
With 6+ services, the event flow becomes a tangled web. Service A publishes events that trigger Service B, which triggers C, which triggers D. Understanding the full flow requires reading event handlers across multiple codebases. Debugging a stuck saga means correlating events across service logs.
Adding a new step means modifying multiple services to listen for new events and publish new ones. Removing a step risks breaking downstream listeners. The coupling is implicit (through event names) rather than explicit (through a workflow definition), which makes it harder to reason about.
Orchestration: Centralized Coordination
In orchestration, a saga coordinator (also called a saga execution coordinator or SEC) manages the flow. It knows the full sequence of steps, executes them in order, and handles failures.
The coordinator calls each service synchronously or asynchronously:
- Call Order Service to create the order. If success, proceed.
- Call Inventory Service to reserve items. If success, proceed.
- Call Payment Service to charge. If failure, run compensations.
The coordinator persists its state (current step, outcomes of previous steps) so it can recover after a crash. If the coordinator goes down between steps 2 and 3, it restarts from its last persisted state and continues.
Benefits of Orchestration
- The full workflow is visible in one place (the coordinator's definition)
- Adding, removing, or reordering steps modifies one component
- Debugging is straightforward: check the coordinator's state log
- Complex flows (branching, parallel steps, conditional logic) are natural
- Timeouts and retries are centralized
The Coordinator Is Not a Single Point of Failure
A common objection: does not the coordinator introduce a single point of failure? Not if it is implemented correctly. Platforms like Temporal persist every state transition to a durable store. If the coordinator crashes and restarts, it picks up exactly where it left off. The coordinator is stateless at runtime; all state lives in the persistent store.
Compensating Transactions: The Hard Part
Designing compensating actions is where most saga implementations get tricky.
Reversible operations: creating a reservation, placing a hold on funds, setting a status flag. The compensating action undoes them: release the reservation, remove the hold, reset the flag.
Irreversible operations: sending an email, charging a credit card (you can refund, but the charge happened), shipping a physical package. For these, you need strategies:
- Delay execution: do not send the confirmation email until the saga fully completes. Queue it as a "pending" action.
- Accept partial failure: charge the card, and if a later step fails, issue a refund. The customer sees a charge and then a refund. Not ideal, but correct.
- Human intervention: some failures require manual resolution. A shipped package that needs to be recalled is an operational problem, not an algorithmic one.
Idempotency: Non-Negotiable
Every saga step and every compensating action must be idempotent. This is not optional. Here is why.
Step 3 runs. The service processes the request and sends a response. The network drops the response. The coordinator times out and retries step 3. The service receives the request a second time.
If step 3 is "charge $50 to the customer" and it is not idempotent, the customer gets charged $100. If it is idempotent (the service checks whether this charge already exists using a unique request ID), the second call is a no-op.
Idempotency keys (unique request IDs generated by the coordinator and passed with each call) are the standard pattern. The service stores the request ID with the result. On retry, it returns the stored result without re-executing.
Temporal: The Modern Saga Platform
Temporal (open source, evolved from Uber's Cadence) is the most popular saga orchestration platform. It deserves specific attention because it changes how you think about sagas.
In Temporal, a saga is a "workflow." Each step is an "activity." The workflow code looks like regular sequential code:
order = createOrder(details)
reservation = reserveInventory(order)
try:
payment = processPayment(order)
except PaymentError:
releaseInventory(reservation)
cancelOrder(order)
raise
shipment = createShipment(order)
This looks like normal code, but Temporal adds superpowers: every function call is durably persisted. If the process crashes between reserveInventory and processPayment, Temporal restarts the workflow and replays it from the persisted state, skipping already-completed activities.
Temporal handles retries, timeouts, and idempotency at the platform level. Activities are retried automatically on transient failures (with configurable backoff). Each activity invocation has a unique ID for idempotency. The workflow state machine handles compensation logic.
AWS Step Functions: Managed Orchestration
AWS Step Functions provides saga orchestration as a managed service. You define the workflow as an Amazon States Language (ASL) JSON document with states, transitions, error handlers, and compensations.
Step Functions is simpler than Temporal (no code for the orchestrator, just configuration) but less flexible. Complex branching logic, dynamic step generation, and long-running workflows (months) are harder to express in ASL than in Temporal's code-based workflows.
For AWS-native architectures with Lambda functions as saga steps, Step Functions is the path of least resistance. For complex workflows or multi-cloud deployments, Temporal gives you more control.
Saga vs. 2PC: The Real Trade-Off
| Property | 2PC | Saga |
|---|---|---|
| Consistency | Strong (atomic) | Eventual |
| Isolation | Full (locks held) | None (intermediate states visible) |
| Availability | Blocked if any participant down | Continues with compensation |
| Coupling | Tight (all participants must respond) | Loose (each step independent) |
| Complexity | Protocol complexity | Business logic complexity |
| Use case | Same-database transactions | Cross-service workflows |
The saga gives up isolation and atomicity. Between steps, other transactions can see the intermediate state. An order that has been created but not yet paid is visible to queries. This is called the "dirty read" problem in sagas.
Countermeasures include: semantic locks (mark the order as "pending" so queries can filter it), commutative updates (design operations so the order of concurrent updates does not matter), and pessimistic views (always show the worst-case state to users).
When Sagas Are Overkill
If your "distributed transaction" is really just two database writes that could live in the same database, put them in the same database and use a local transaction. Adding saga infrastructure for something that does not need it wastes engineering time and adds runtime complexity.
If your flow is fire-and-forget (send a notification, update analytics), use a simple message queue with at-least-once delivery. Sagas are for flows where failure requires compensating previously completed work.
If your consistency requirements are strict (financial regulations, legal compliance), evaluate whether eventual consistency with compensating actions meets those requirements. Sometimes the answer is no, and you need synchronous coordination despite the availability cost.
Key Points
- •A saga is a sequence of local transactions where each step has a compensating action. If step 3 fails, steps 2 and 1 are rolled back by running their compensating actions in reverse order. This gives you eventual consistency across services without locking resources
- •Two execution styles: choreography (each service listens for events and decides what to do next) and orchestration (a central coordinator tells each service what to do). Choreography is simpler for 2-3 services. Orchestration scales better and is easier to debug
- •Sagas replace 2PC in microservice architectures because holding distributed locks across services is impractical. A payment service and a shipping service operated by different teams with different SLAs should not be coupled by a blocking transaction protocol
- •Compensating actions are not the same as rollbacks. A database rollback undoes uncommitted changes. A compensating action is a new forward operation that semantically reverses a committed change. Refunding a charge is a compensating action for processing a payment
- •Temporal (formerly Uber Cadence), AWS Step Functions, and Axon Framework are the most common saga orchestration platforms. They handle retries, timeouts, state persistence, and failure tracking so you do not have to build all of that from scratch
Used By
Common Mistakes
- ✗Assuming every operation has a clean compensating action. Some operations cannot be undone. You cannot un-send an email. You cannot un-ship a package. For these, use techniques like delayed execution (do not send until the saga completes) or accept that some manual cleanup will be needed
- ✗Not making saga steps idempotent. If step 3 fails halfway and gets retried, it might run twice. If it charges a customer without checking whether the charge already happened, you double-charge them. Every step and every compensating action must be safe to retry
- ✗Using choreography with more than 4-5 services. Without a central coordinator, the flow is spread across event handlers in multiple services. Debugging why a saga got stuck means tracing events across service logs. Orchestration keeps the full flow visible in one place
- ✗Ignoring the 'in-flight' state. Between steps, the system is in a partially completed state. A user checking their order might see inventory reserved but no payment processed. Design your UIs and APIs to handle intermediate states gracefully