DynamoDB

For teams building on AWS with well-defined access patterns, DynamoDB is probably the right database. Full stop. It is the one NoSQL service where operations are truly handed off to AWS with no need to think about nodes, disk space, and failover. The catch is that convenience comes at a price, both in dollars and in flexibility. The data model must be designed around queries upfront, and changing access patterns later is painful.

The service grew out of Amazon's 2007 Dynamo paper. After more than a decade of running it in production for things like the Amazon.com shopping cart and Prime Video, AWS has turned it into something genuinely battle-tested. That does not mean it is the right tool for everything. But for the problems it solves, nothing else on AWS comes close.

How It Works Internally

DynamoDB stores data across SSD-backed storage nodes organized into partitions. Each table gets divided into partitions based on the hash of the partition key (consistent hashing). Every partition holds up to 10GB of data and gets replicated across three availability zones: one leader replica handles writes, two follower replicas serve eventually consistent reads. For strongly consistent reads, those always go to the leader.

The request router sits in front of everything. It authenticates every request through IAM, hashes the partition key to find the right partition, and routes the request to the correct storage node. On writes, the leader applies the change to its local B-tree storage, replicates to both followers, and acknowledges the write after two of three replicas confirm. This is a quorum write. The result: 99.999% availability for Global Tables (multi-region) and 99.99% for single-region tables.

There are two types of secondary indexes, and they behave very differently. Local Secondary Indexes (LSIs) share the base table's partition key but use a different sort key. They enable sorting data differently within a partition. Two big constraints: they must be created at table creation time, and they share the partition's 10GB limit. Global Secondary Indexes (GSIs) have an entirely different partition key. Under the hood, a GSI is basically a separate managed table. It gets its own set of partitions, updates asynchronously from the base table, is always eventually consistent, and consumes its own provisioned throughput.

Production Architecture

In practice, most teams that get good at DynamoDB end up using single-table design, a pattern Rick Houlihan popularized. Multiple entity types go into one table using partition key overloading. For example, a SaaS app might use PK=TENANT#123, SK=PROFILE for tenant data and PK=TENANT#123, SK=ORDER#2024-001 for orders. One Query call returns a tenant and all their orders. This pattern plays to DynamoDB's biggest strength: fast, targeted, single-partition queries.

Turn on DynamoDB Streams with NEW_AND_OLD_IMAGES for change data capture. Wire streams to Lambda functions for materialized view maintenance, search index synchronization (e.g., OpenSearch), analytics pipeline ingestion, and cross-service event propagation. Global Tables use Streams internally for multi-region replication with last-writer-wins conflict resolution. It works well, though last-writer-wins is something to think carefully about before relying on.

For read-heavy workloads, put a DAX cluster (minimum 3 nodes for production) between the application and DynamoDB. DAX drops read latency to microseconds with write-through caching. One thing that trips people up: DAX caches items by primary key. If most reads are Query operations with filter expressions, the cache hit rate will be lower than expected. Measure before committing to the cost of running a DAX cluster.

Capacity Planning

Each DynamoDB partition supports 3,000 read capacity units (RCU) and 1,000 write capacity units (WCU). One RCU provides one strongly consistent read per second for items up to 4KB. One WCU provides one write per second for items up to 1KB. These are the numbers to internalize.

For cost optimization, use provisioned capacity with auto-scaling for predictable workloads. Target around 70% utilization. Use on-demand for spiky or unpredictable traffic. The price difference is real: on-demand costs roughly $1.25 per million write units vs $0.00065 per WCU-hour in provisioned mode. That is about 6.5x more expensive at steady state. Monitor ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits per partition using CloudWatch Contributor Insights. This is how hot partitions get spotted before they cause throttling. Also, keep item sizes under 4KB when possible. Every 4KB costs one RCU, so bloated items drain the read budget fast.

Failure Scenarios

Scenario 1: Hot partition throttling despite sufficient table capacity. Someone uses date as the partition key for an events table. Every write for the current day hits one partition, which maxes out at 1,000 WCU no matter how much total capacity the table has. Adaptive capacity helps a bit, but it has limits. The application starts throwing ProvisionedThroughputExceededException while CloudWatch shows table-level utilization at only 20%. That gap between table-level metrics and partition-level reality is where this bug hides. Detection: turn on CloudWatch Contributor Insights to find the most throttled partition keys. Fix: add a random suffix to the partition key (e.g., 2024-01-15#3 where the suffix is 0-9), spreading writes across 10 partitions. At read time, issue 10 parallel queries and merge results client-side. This is called write sharding. It works, but it adds complexity to the read path. Worth it when there is no other option for distributing writes.

Scenario 2: GSI backpressure throttling the base table. The GSI is under-provisioned relative to the write volume on the base table. DynamoDB replicates writes to GSIs asynchronously, but when the GSI cannot keep up, a replication backlog builds. If that backlog grows large enough, DynamoDB starts throttling writes on the base table itself to keep the GSI from falling irrecoverably behind. Write throttling appears on the base table even though the base table has plenty of capacity. This one is confusing to debug without knowing to look at GSI metrics. Detection: monitor ReplicationLatency for existing GSIs and ThrottledRequests on the base table alongside GSI metrics. Fix: provision GSI write capacity to at least match the base table's write throughput. Or just use on-demand capacity mode for both the table and GSI so they scale together automatically.

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Use Cases

Architecture

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies

DynamoDB

Use Cases

Architecture

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies