Saga Pattern

Why Not Just Use 2PC?

Two-phase commit works when all participants are part of the same system, controlled by the same team, with similar availability targets. Your order service, inventory service, and payment service are all your PostgreSQL databases? Use 2PC.

But in a real microservice architecture, these are separate services run by separate teams, deployed independently, with different databases. The payment service might be a third-party API. The shipping service might go down for maintenance on Tuesdays. Holding distributed locks across all of these while waiting for each to say "yes, I can commit" is fragile. If any participant is slow, every participant is blocked.

Sagas take a different approach. Instead of locking everything and committing atomically, execute each step independently and roll back on failure using compensating actions. You trade atomicity for availability and loose coupling.

How Sagas Work

A saga breaks a distributed transaction into a series of local transactions. Each local transaction updates one service and publishes an event or notifies the next step. If any step fails, previously completed steps are undone by running compensating transactions in reverse order.

Example: E-commerce order flow

Forward steps:

Order Service: Create order (status: pending)
Inventory Service: Reserve items
Payment Service: Charge customer
Shipping Service: Create shipment

If Payment (step 3) fails:

Compensate Inventory: Release reserved items
Compensate Order: Mark order as cancelled

Each compensating action is a new operation, not a database rollback. The inventory reservation was committed. The compensation is a new transaction that releases it.

Choreography: Event-Driven Sagas

In choreography, there is no central coordinator. Each service listens for events and reacts.

Order Service creates the order and publishes "OrderCreated." Inventory Service listens for "OrderCreated," reserves items, and publishes "InventoryReserved." Payment Service listens for "InventoryReserved," charges the customer, and publishes "PaymentProcessed." Shipping Service listens for "PaymentProcessed" and creates the shipment.

If Payment fails, it publishes "PaymentFailed." Inventory Service listens for "PaymentFailed" and releases the reservation. Order Service listens and cancels the order.

When Choreography Works

Small sagas with 2-3 steps
Simple, linear flows without branching
Teams that own the full flow and can reason about it end-to-end
Systems that already use event-driven architecture extensively

When Choreography Breaks Down

With 6+ services, the event flow becomes a tangled web. Service A publishes events that trigger Service B, which triggers C, which triggers D. Understanding the full flow requires reading event handlers across multiple codebases. Debugging a stuck saga means correlating events across service logs.

Adding a new step means modifying multiple services to listen for new events and publish new ones. Removing a step risks breaking downstream listeners. The coupling is implicit (through event names) rather than explicit (through a workflow definition), which makes it harder to reason about.

Orchestration: Centralized Coordination

In orchestration, a saga coordinator (also called a saga execution coordinator or SEC) manages the flow. It knows the full sequence of steps, executes them in order, and handles failures.

The coordinator calls each service synchronously or asynchronously:

Call Order Service to create the order. If success, proceed.
Call Inventory Service to reserve items. If success, proceed.
Call Payment Service to charge. If failure, run compensations.

The coordinator persists its state (current step, outcomes of previous steps) so it can recover after a crash. If the coordinator goes down between steps 2 and 3, it restarts from its last persisted state and continues.

Benefits of Orchestration

The full workflow is visible in one place (the coordinator's definition)
Adding, removing, or reordering steps modifies one component
Debugging is straightforward: check the coordinator's state log
Complex flows (branching, parallel steps, conditional logic) are natural
Timeouts and retries are centralized

The Coordinator Is Not a Single Point of Failure

A common objection: does not the coordinator introduce a single point of failure? Not if it is implemented correctly. Platforms like Temporal persist every state transition to a durable store. If the coordinator crashes and restarts, it picks up exactly where it left off. The coordinator is stateless at runtime; all state lives in the persistent store.

Compensating Transactions: The Hard Part

Designing compensating actions is where most saga implementations get tricky.

Reversible operations: creating a reservation, placing a hold on funds, setting a status flag. The compensating action undoes them: release the reservation, remove the hold, reset the flag.

Irreversible operations: sending an email, charging a credit card (you can refund, but the charge happened), shipping a physical package. For these, you need strategies:

Delay execution: do not send the confirmation email until the saga fully completes. Queue it as a "pending" action.
Accept partial failure: charge the card, and if a later step fails, issue a refund. The customer sees a charge and then a refund. Not ideal, but correct.
Human intervention: some failures require manual resolution. A shipped package that needs to be recalled is an operational problem, not an algorithmic one.

Idempotency: Non-Negotiable

Every saga step and every compensating action must be idempotent. This is not optional. Here is why.

Step 3 runs. The service processes the request and sends a response. The network drops the response. The coordinator times out and retries step 3. The service receives the request a second time.

If step 3 is "charge $50 to the customer" and it is not idempotent, the customer gets charged $100. If it is idempotent (the service checks whether this charge already exists using a unique request ID), the second call is a no-op.

Idempotency keys (unique request IDs generated by the coordinator and passed with each call) are the standard pattern. The service stores the request ID with the result. On retry, it returns the stored result without re-executing.

Temporal: The Modern Saga Platform

Temporal (open source, evolved from Uber's Cadence) is the most popular saga orchestration platform. It deserves specific attention because it changes how you think about sagas.

In Temporal, a saga is a "workflow." Each step is an "activity." The workflow code looks like regular sequential code:

order = createOrder(details)
reservation = reserveInventory(order)
try:
    payment = processPayment(order)
except PaymentError:
    releaseInventory(reservation)
    cancelOrder(order)
    raise
shipment = createShipment(order)

This looks like normal code, but Temporal adds superpowers: every function call is durably persisted. If the process crashes between reserveInventory and processPayment, Temporal restarts the workflow and replays it from the persisted state, skipping already-completed activities.

Temporal handles retries, timeouts, and idempotency at the platform level. Activities are retried automatically on transient failures (with configurable backoff). Each activity invocation has a unique ID for idempotency. The workflow state machine handles compensation logic.

AWS Step Functions: Managed Orchestration

AWS Step Functions provides saga orchestration as a managed service. You define the workflow as an Amazon States Language (ASL) JSON document with states, transitions, error handlers, and compensations.

Step Functions is simpler than Temporal (no code for the orchestrator, just configuration) but less flexible. Complex branching logic, dynamic step generation, and long-running workflows (months) are harder to express in ASL than in Temporal's code-based workflows.

For AWS-native architectures with Lambda functions as saga steps, Step Functions is the path of least resistance. For complex workflows or multi-cloud deployments, Temporal gives you more control.

Saga vs. 2PC: The Real Trade-Off

Property	2PC	Saga
Consistency	Strong (atomic)	Eventual
Isolation	Full (locks held)	None (intermediate states visible)
Availability	Blocked if any participant down	Continues with compensation
Coupling	Tight (all participants must respond)	Loose (each step independent)
Complexity	Protocol complexity	Business logic complexity
Use case	Same-database transactions	Cross-service workflows

The saga gives up isolation and atomicity. Between steps, other transactions can see the intermediate state. An order that has been created but not yet paid is visible to queries. This is called the "dirty read" problem in sagas.

Countermeasures include: semantic locks (mark the order as "pending" so queries can filter it), commutative updates (design operations so the order of concurrent updates does not matter), and pessimistic views (always show the worst-case state to users).

When Sagas Are Overkill

If your "distributed transaction" is really just two database writes that could live in the same database, put them in the same database and use a local transaction. Adding saga infrastructure for something that does not need it wastes engineering time and adds runtime complexity.

If your flow is fire-and-forget (send a notification, update analytics), use a simple message queue with at-least-once delivery. Sagas are for flows where failure requires compensating previously completed work.

If your consistency requirements are strict (financial regulations, legal compliance), evaluate whether eventual consistency with compensating actions meets those requirements. Sometimes the answer is no, and you need synchronous coordination despite the availability cost.

Why Not Just Use 2PC?

How Sagas Work

Example: E-commerce order flow

Forward steps:

Order Service: Create order (status: pending)
Inventory Service: Reserve items
Payment Service: Charge customer
Shipping Service: Create shipment

If Payment (step 3) fails:

Compensate Inventory: Release reserved items
Compensate Order: Mark order as cancelled

Each compensating action is a new operation, not a database rollback. The inventory reservation was committed. The compensation is a new transaction that releases it.

Choreography: Event-Driven Sagas

In choreography, there is no central coordinator. Each service listens for events and reacts.

If Payment fails, it publishes "PaymentFailed." Inventory Service listens for "PaymentFailed" and releases the reservation. Order Service listens and cancels the order.

When Choreography Works

Small sagas with 2-3 steps
Simple, linear flows without branching
Teams that own the full flow and can reason about it end-to-end
Systems that already use event-driven architecture extensively

When Choreography Breaks Down

Orchestration: Centralized Coordination

In orchestration, a saga coordinator (also called a saga execution coordinator or SEC) manages the flow. It knows the full sequence of steps, executes them in order, and handles failures.

The coordinator calls each service synchronously or asynchronously:

Call Order Service to create the order. If success, proceed.
Call Inventory Service to reserve items. If success, proceed.
Call Payment Service to charge. If failure, run compensations.

Benefits of Orchestration

The full workflow is visible in one place (the coordinator's definition)
Adding, removing, or reordering steps modifies one component
Debugging is straightforward: check the coordinator's state log
Complex flows (branching, parallel steps, conditional logic) are natural
Timeouts and retries are centralized

The Coordinator Is Not a Single Point of Failure

Compensating Transactions: The Hard Part

Designing compensating actions is where most saga implementations get tricky.

Reversible operations: creating a reservation, placing a hold on funds, setting a status flag. The compensating action undoes them: release the reservation, remove the hold, reset the flag.

Irreversible operations: sending an email, charging a credit card (you can refund, but the charge happened), shipping a physical package. For these, you need strategies:

Delay execution: do not send the confirmation email until the saga fully completes. Queue it as a "pending" action.
Accept partial failure: charge the card, and if a later step fails, issue a refund. The customer sees a charge and then a refund. Not ideal, but correct.
Human intervention: some failures require manual resolution. A shipped package that needs to be recalled is an operational problem, not an algorithmic one.

Idempotency: Non-Negotiable

Every saga step and every compensating action must be idempotent. This is not optional. Here is why.

Step 3 runs. The service processes the request and sends a response. The network drops the response. The coordinator times out and retries step 3. The service receives the request a second time.

Temporal: The Modern Saga Platform

Temporal (open source, evolved from Uber's Cadence) is the most popular saga orchestration platform. It deserves specific attention because it changes how you think about sagas.

In Temporal, a saga is a "workflow." Each step is an "activity." The workflow code looks like regular sequential code:

order = createOrder(details)
reservation = reserveInventory(order)
try:
    payment = processPayment(order)
except PaymentError:
    releaseInventory(reservation)
    cancelOrder(order)
    raise
shipment = createShipment(order)

AWS Step Functions: Managed Orchestration

For AWS-native architectures with Lambda functions as saga steps, Step Functions is the path of least resistance. For complex workflows or multi-cloud deployments, Temporal gives you more control.

Saga vs. 2PC: The Real Trade-Off

Property	2PC	Saga
Consistency	Strong (atomic)	Eventual
Isolation	Full (locks held)	None (intermediate states visible)
Availability	Blocked if any participant down	Continues with compensation
Coupling	Tight (all participants must respond)	Loose (each step independent)
Complexity	Protocol complexity	Business logic complexity
Use case	Same-database transactions	Cross-service workflows

Architecture

Why Not Just Use 2PC?

How Sagas Work

Choreography: Event-Driven Sagas

When Choreography Works

When Choreography Breaks Down

Orchestration: Centralized Coordination

Benefits of Orchestration

The Coordinator Is Not a Single Point of Failure

Compensating Transactions: The Hard Part

Idempotency: Non-Negotiable

Temporal: The Modern Saga Platform

AWS Step Functions: Managed Orchestration

Saga vs. 2PC: The Real Trade-Off

When Sagas Are Overkill

Key Points

Used By

Common Mistakes

Related

Saga Pattern

Architecture

Why Not Just Use 2PC?

How Sagas Work

Choreography: Event-Driven Sagas

When Choreography Works

When Choreography Breaks Down

Orchestration: Centralized Coordination

Benefits of Orchestration

The Coordinator Is Not a Single Point of Failure

Compensating Transactions: The Hard Part

Idempotency: Non-Negotiable

Temporal: The Modern Saga Platform

AWS Step Functions: Managed Orchestration

Saga vs. 2PC: The Real Trade-Off

When Sagas Are Overkill

Key Points

Used By

Common Mistakes

Related