Saga Pattern: Distributed Transactions Without 2PC

What it is

A saga is a way to execute a business transaction that spans multiple services without using two-phase commit. The transaction becomes a sequence of local transactions, each in one service, each with a defined compensating transaction that undoes it.

If everything succeeds, the saga commits in pieces over time. If a step fails, the coordinator runs compensations for the already-completed steps in reverse order. The system reaches a consistent state either way.

This is the standard pattern for any multi-service workflow that touches multiple data stores: place an order (charge card + reserve inventory + dispatch shipping), book a trip (hotel + flight + car rental), provision a tenant (create account + allocate database + send welcome email).

Why not 2PC

Two-phase commit needs a transaction coordinator that holds locks across all participants until everyone agrees to commit. Two big problems at scale:

Locks span the network. All participating services hold locks for the duration of the transaction. A slow service makes everyone wait. A network partition can leave locks held indefinitely.
Strong consistency, weak availability. A coordinator failure during commit can leave participants in an inconsistent state until manual recovery.

Saga trades strong consistency for availability. Steps commit individually; readers may see "order created but not yet paid" briefly. The saga ensures eventual consistency by either completing or compensating.

The two flavours

Orchestrated: a coordinator service holds the workflow. It calls each step, handles errors, runs compensations on failure. The workflow is visible in one place. Easier to understand, easier to add steps, easier to monitor. The coordinator can be implemented as code, or with a workflow engine (Temporal, Cadence, AWS Step Functions, Camunda).

Choreographed: no coordinator. Each service publishes events when it completes work; other services subscribe and react. The workflow is implicit in the chain of subscriptions. More decoupled, but the workflow lives in many places, debugging is harder, and adding a new step means changing multiple services.

For most teams, orchestrated is the right default. Workflow engines like Temporal handle the persistence, retry, and visibility, leaving the steps as the only thing to write.

Compensations are not inverses

A compensation isn't always "the inverse operation". Consider:

Charging a card: compensation is refund (different operation, different audit trail).
Sending an email: compensation is sending another email ("never mind, the order was cancelled"). The original can't be unsent.
Allocating an ID: compensation is marking the ID as released, not deleting it (other systems may have already referenced it).

Designing the compensation often takes more thought than the forward step. Get it right or the saga produces inconsistent states.

Persistence is essential

A real saga persists state between steps. Without it, a coordinator crash loses the saga's progress; the system can't tell which compensations to run. Standard practice: write the saga's current step to a database after each transition. On startup, the coordinator scans for unfinished sagas and resumes them.

This is exactly what workflow engines provide out of the box. Rolling a custom saga without persistence is the path to inconsistent production data.

Compensations have to be idempotent. Use the same idempotency-key pattern as forward operations. State must be persisted between steps, either in a custom saga store or via a workflow engine. And plan for compensation failure, because it does happen: the saga goes into a "needs manual intervention" state, alerts fire, and the manual recovery process is designed before it's needed.

For new code on Java/Go/Python, look at Temporal SDKs first. They handle the persistence and resumption logic; the application writes the activity functions and the workflow definition.

Follow-up questions

▸Saga vs 2PC?

2PC (two-phase commit) gives strong consistency: all participants either commit or all abort. Requires a transaction coordinator and locks across services for the full duration. Doesn't scale beyond a few services and breaks under network partitions. Saga gives eventual consistency: each step commits its local transaction; on failure, compensations undo. Scales to many services, tolerant of partitions, but readers may see partial state during execution.

▸Orchestrated vs choreographed?

Orchestrated: coordinator service knows the workflow, calls each service in order, handles failures centrally. Pros: visible flow, easy to add steps, easy to monitor. Cons: coordinator becomes a single point of complexity. Choreographed: each service reacts to events; no coordinator. Pros: services are independent. Cons: workflow is implicit (no single place shows it), debugging spans logs across services. Default to orchestrated unless there's a strong reason otherwise.

▸Why must compensations be idempotent?

Because the saga can crash and restart. The coordinator may run a compensation, crash before recording success, restart, and run it again. Idempotent compensations mean the second run is a no-op. Use idempotency keys per compensation, the same way they're used for forward operations.

▸What if a compensation itself fails?

Now there's a 'pivot' situation: an unrecoverable inconsistency that needs human intervention. Persist the saga state, alert on-call, manually reconcile. Don't pretend it can be automated away; design for the case where compensation fails to converge and a human has to look.

What it is

Why not 2PC

Two-phase commit needs a transaction coordinator that holds locks across all participants until everyone agrees to commit. Two big problems at scale:

Locks span the network. All participating services hold locks for the duration of the transaction. A slow service makes everyone wait. A network partition can leave locks held indefinitely.
Strong consistency, weak availability. A coordinator failure during commit can leave participants in an inconsistent state until manual recovery.

The two flavours

For most teams, orchestrated is the right default. Workflow engines like Temporal handle the persistence, retry, and visibility, leaving the steps as the only thing to write.

Compensations are not inverses

A compensation isn't always "the inverse operation". Consider:

Charging a card: compensation is refund (different operation, different audit trail).
Sending an email: compensation is sending another email ("never mind, the order was cancelled"). The original can't be unsent.
Allocating an ID: compensation is marking the ID as released, not deleting it (other systems may have already referenced it).

Designing the compensation often takes more thought than the forward step. Get it right or the saga produces inconsistent states.

Persistence is essential

This is exactly what workflow engines provide out of the box. Rolling a custom saga without persistence is the path to inconsistent production data.

For new code on Java/Go/Python, look at Temporal SDKs first. They handle the persistence and resumption logic; the application writes the activity functions and the workflow definition.

Follow-up questions

▸Saga vs 2PC?

▸Orchestrated vs choreographed?

▸Why must compensations be idempotent?

▸What if a compensation itself fails?

Diagram

What it is

Why not 2PC

The two flavours

Compensations are not inverses

Persistence is essential

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Saga Pattern: Distributed Transactions Without 2PC

Diagram

What it is

Why not 2PC

The two flavours

Compensations are not inverses

Persistence is essential

Implementations

Key points

Follow-up questions

Gotchas

Related reading