API Aggregator (Backend-for-Frontend)

What it is

An API aggregator (or BFF, Backend for Frontend) is a service that calls multiple downstream services concurrently and merges their results into a single response. It exists to keep clients (especially mobile) simple: one HTTP call gets everything they need.

The shape: client → aggregator → fan out to N services → merge → respond.

Without the aggregator, the client has to make N calls, manage parallelism on a constrained device, and handle partial failures itself. With it, the aggregator does that work in the data center where it's faster and easier.

What makes a good aggregator

Concurrent fan-out. Total latency is max of downstream latencies, not sum. If user-service is 50ms, cart-service is 80ms, recs-service is 200ms, the aggregator returns at 200ms (the slowest). Sequential calls would take 330ms.

Per-call timeout. Each downstream gets its own deadline. One slow downstream shouldn't push the whole page over budget. The block-level timeout (whole-page deadline) is the upper bound; per-call timeouts cap individual downstreams.

Partial-result tolerance. Some downstreams are critical (without user-service, the page is broken). Some are optional (without recs, the page renders without that section). Decide per-downstream; structure the code so optional failures degrade gracefully.

Per-downstream resilience. Each downstream gets its own circuit breaker, its own bulkhead, its own retry. When one is unhealthy, only its calls suffer; calls to healthy ones continue.

Caching. The single biggest win for aggregator performance. Per-downstream cache (the user object is stable for seconds), page-level cache (the assembled response when client and parameters match), edge cache (CDN-level when applicable).

The structured-concurrency idiom

Modern languages make this easy:

Java 21+: StructuredTaskScope handles fan-out, cancellation on first failure, deadline.
Go: errgroup.Group with WithContext and SetLimit.
Python: asyncio.TaskGroup with asyncio.timeout.
Older Java: CompletableFuture.allOf with manual timeout management.

These primitives bake in the cancellation semantics: when one task fails, siblings are cancelled. When the deadline fires, everything is cancelled. The aggregator code stays clean.

Partial results in practice

The pattern: separate the critical fan-out from the optional fan-out.

Critical calls go through errgroup.allSuccessfulOrThrow-style structures: any failure fails the whole. These are the calls that produce data the page absolutely needs.

Optional calls run independently. Use gather(return_exceptions=True) or try/catch per call. Failure produces a default (empty list, null) rather than an error.

This split is product-driven: ask the designer "what does the page do if recs is unavailable?" If the answer is "hide the section", recs is optional. If it's "show an error", it's critical.

Caching

Three places to cache:

Per-downstream: cache the result of each downstream call by its inputs. The user object for userId 42 is the same for some seconds. Cheap to maintain (key by call inputs), invalidated when the underlying data changes.

Page-level: the assembled response if the inputs match exactly. Higher hit rate but harder to invalidate (any underlying change invalidates).

CDN/edge: for public or weakly-personalised pages. The aggregator returns cacheable headers; the CDN serves repeat requests without ever touching the aggregator. Free if applicable.

Most teams underuse the per-downstream cache because it requires plumbing (cache key derivation, TTL, invalidation). It pays off heavily on traffic-heavy aggregators.

When an aggregator is unnecessary

Single-page apps with a thin GraphQL layer (the GraphQL server does similar fan-out, just with a different API shape). Internal services that already get exactly the data they need. Low-traffic apps where the fan-out per request doesn't matter.

For mobile-facing or high-traffic web APIs with multiple backend systems, the aggregator pays for itself in client simplicity and parallelism wins.

Follow-up questions

▸What's the total-latency calculation for an aggregator?

max of the slowest downstream call. The whole point of the aggregator is parallelism. Accidental serial calls (await user; then await cart; then await recs) make latency the sum, which defeats the purpose. Always fan out concurrently.

▸How to decide which downstreams are critical vs optional?

Product/UX decision. The page that says 'username and avatar' needs the user-service; without it, the page can't render. The page section that shows 'people you might know' is optional; if it fails, hide that section but still show the rest. Talk to designers; over-tolerating failures hides real problems, but treating everything as critical makes the page brittle.

▸Should the aggregator be its own service?

Yes for non-trivial apps. Pattern: BFF (Backend for Frontend). One BFF per client (mobile, web, partner API), each shaped to that client's needs. The BFF does the fan-out, caching, and shaping. Backends stay focused on their domain. Mobile app talks to one BFF endpoint instead of orchestrating 5 calls itself.

▸How are aggregator responses cached?

Two layers. Per-downstream cache (the user-service result is the same for the same userId for some seconds). Page-level cache (the full assembled page when it's safe to cache). Per-downstream cache wins more because invalidation is per-domain; page-level cache wins on repeated identical requests.

What it is

The shape: client → aggregator → fan out to N services → merge → respond.

What makes a good aggregator

Per-downstream resilience. Each downstream gets its own circuit breaker, its own bulkhead, its own retry. When one is unhealthy, only its calls suffer; calls to healthy ones continue.

The structured-concurrency idiom

Modern languages make this easy:

Java 21+: StructuredTaskScope handles fan-out, cancellation on first failure, deadline.
Go: errgroup.Group with WithContext and SetLimit.
Python: asyncio.TaskGroup with asyncio.timeout.
Older Java: CompletableFuture.allOf with manual timeout management.

These primitives bake in the cancellation semantics: when one task fails, siblings are cancelled. When the deadline fires, everything is cancelled. The aggregator code stays clean.

Partial results in practice

The pattern: separate the critical fan-out from the optional fan-out.

Critical calls go through errgroup.allSuccessfulOrThrow-style structures: any failure fails the whole. These are the calls that produce data the page absolutely needs.

Optional calls run independently. Use gather(return_exceptions=True) or try/catch per call. Failure produces a default (empty list, null) rather than an error.

This split is product-driven: ask the designer "what does the page do if recs is unavailable?" If the answer is "hide the section", recs is optional. If it's "show an error", it's critical.

Caching

Three places to cache:

Page-level: the assembled response if the inputs match exactly. Higher hit rate but harder to invalidate (any underlying change invalidates).

CDN/edge: for public or weakly-personalised pages. The aggregator returns cacheable headers; the CDN serves repeat requests without ever touching the aggregator. Free if applicable.

Most teams underuse the per-downstream cache because it requires plumbing (cache key derivation, TTL, invalidation). It pays off heavily on traffic-heavy aggregators.

When an aggregator is unnecessary

For mobile-facing or high-traffic web APIs with multiple backend systems, the aggregator pays for itself in client simplicity and parallelism wins.

Follow-up questions

▸What's the total-latency calculation for an aggregator?

▸How to decide which downstreams are critical vs optional?

▸Should the aggregator be its own service?

▸How are aggregator responses cached?

Diagram

What it is

What makes a good aggregator

The structured-concurrency idiom

Partial results in practice

Caching

When an aggregator is unnecessary

Implementations

Key points

Follow-up questions

Gotchas

Related reading

API Aggregator (Backend-for-Frontend)

Diagram

What it is

What makes a good aggregator

The structured-concurrency idiom

Partial results in practice

Caching

When an aggregator is unnecessary

Implementations

Key points

Follow-up questions

Gotchas

Related reading