DynamoDB
AWS managed NoSQL database with single-digit millisecond reads at any scale
Use Cases
Architecture
For teams building on AWS with well-defined access patterns, DynamoDB is probably the right database. Full stop. It is the one NoSQL service where operations are truly handed off to AWS with no need to think about nodes, disk space, and failover. The catch is that convenience comes at a price, both in dollars and in flexibility. The data model must be designed around queries upfront, and changing access patterns later is painful.
The service grew out of Amazon's 2007 Dynamo paper. After more than a decade of running it in production for things like the Amazon.com shopping cart and Prime Video, AWS has turned it into something genuinely battle-tested. That does not mean it is the right tool for everything. But for the problems it solves, nothing else on AWS comes close.
How It Works Internally
DynamoDB stores data across SSD-backed storage nodes organized into partitions. Each table gets divided into partitions based on the hash of the partition key (consistent hashing). Every partition holds up to 10GB of data and gets replicated across three availability zones: one leader replica handles writes, two follower replicas serve eventually consistent reads. For strongly consistent reads, those always go to the leader.
The request router sits in front of everything. It authenticates every request through IAM, hashes the partition key to find the right partition, and routes the request to the correct storage node. On writes, the leader applies the change to its local B-tree storage, replicates to both followers, and acknowledges the write after two of three replicas confirm. This is a quorum write. The result: 99.999% availability for Global Tables (multi-region) and 99.99% for single-region tables.
There are two types of secondary indexes, and they behave very differently. Local Secondary Indexes (LSIs) share the base table's partition key but use a different sort key. They enable sorting data differently within a partition. Two big constraints: they must be created at table creation time, and they share the partition's 10GB limit. Global Secondary Indexes (GSIs) have an entirely different partition key. Under the hood, a GSI is basically a separate managed table. It gets its own set of partitions, updates asynchronously from the base table, is always eventually consistent, and consumes its own provisioned throughput.
Production Architecture
In practice, most teams that get good at DynamoDB end up using single-table design, a pattern Rick Houlihan popularized. Multiple entity types go into one table using partition key overloading. For example, a SaaS app might use PK=TENANT#123, SK=PROFILE for tenant data and PK=TENANT#123, SK=ORDER#2024-001 for orders. One Query call returns a tenant and all their orders. This pattern plays to DynamoDB's biggest strength: fast, targeted, single-partition queries.
Turn on DynamoDB Streams with NEW_AND_OLD_IMAGES for change data capture. Wire streams to Lambda functions for materialized view maintenance, search index synchronization (e.g., OpenSearch), analytics pipeline ingestion, and cross-service event propagation. Global Tables use Streams internally for multi-region replication with last-writer-wins conflict resolution. It works well, though last-writer-wins is something to think carefully about before relying on.
For read-heavy workloads, put a DAX cluster (minimum 3 nodes for production) between the application and DynamoDB. DAX drops read latency to microseconds with write-through caching. One thing that trips people up: DAX caches items by primary key. If most reads are Query operations with filter expressions, the cache hit rate will be lower than expected. Measure before committing to the cost of running a DAX cluster.
Capacity Planning
Each DynamoDB partition supports 3,000 read capacity units (RCU) and 1,000 write capacity units (WCU). One RCU provides one strongly consistent read per second for items up to 4KB. One WCU provides one write per second for items up to 1KB. These are the numbers to internalize.
For cost optimization, use provisioned capacity with auto-scaling for predictable workloads. Target around 70% utilization. Use on-demand for spiky or unpredictable traffic. The price difference is real: on-demand costs roughly $1.25 per million write units vs $0.00065 per WCU-hour in provisioned mode. That is about 6.5x more expensive at steady state. Monitor ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits per partition using CloudWatch Contributor Insights. This is how hot partitions get spotted before they cause throttling. Also, keep item sizes under 4KB when possible. Every 4KB costs one RCU, so bloated items drain the read budget fast.
Failure Scenarios
Scenario 1: Hot partition throttling despite sufficient table capacity. Someone uses date as the partition key for an events table. Every write for the current day hits one partition, which maxes out at 1,000 WCU no matter how much total capacity the table has. Adaptive capacity helps a bit, but it has limits. The application starts throwing ProvisionedThroughputExceededException while CloudWatch shows table-level utilization at only 20%. That gap between table-level metrics and partition-level reality is where this bug hides. Detection: turn on CloudWatch Contributor Insights to find the most throttled partition keys. Fix: add a random suffix to the partition key (e.g., 2024-01-15#3 where the suffix is 0-9), spreading writes across 10 partitions. At read time, issue 10 parallel queries and merge results client-side. This is called write sharding. It works, but it adds complexity to the read path. Worth it when there is no other option for distributing writes.
Scenario 2: GSI backpressure throttling the base table. The GSI is under-provisioned relative to the write volume on the base table. DynamoDB replicates writes to GSIs asynchronously, but when the GSI cannot keep up, a replication backlog builds. If that backlog grows large enough, DynamoDB starts throttling writes on the base table itself to keep the GSI from falling irrecoverably behind. Write throttling appears on the base table even though the base table has plenty of capacity. This one is confusing to debug without knowing to look at GSI metrics. Detection: monitor ReplicationLatency for existing GSIs and ThrottledRequests on the base table alongside GSI metrics. Fix: provision GSI write capacity to at least match the base table's write throughput. Or just use on-demand capacity mode for both the table and GSI so they scale together automatically.
Pros
- • Fully managed, zero operational overhead
- • Single-digit millisecond latency at any scale
- • Automatic scaling with on-demand capacity
- • Built-in DAX caching layer
- • Global Tables for multi-region replication
Cons
- • Vendor lock-in to AWS
- • Expensive at large scale compared to self-managed alternatives
- • 25 GSI limit per table
- • Item size limited to 400 KB
- • Complex pricing model (RCU/WCU)
When to use
- • Building on AWS and want zero ops overhead
- • Need predictable single-digit millisecond performance
- • Access patterns are known and well-defined
- • Serverless architectures with Lambda
When NOT to use
- • Need complex relational queries or joins
- • Want multi-cloud or vendor-neutral solution
- • Data model is highly relational
- • Need full-text search capabilities
Key Points
- •Partition key design is the single most important decision. DynamoDB hashes the partition key to decide which storage partition handles the request, so key distribution is the primary factor in performance.
- •Each partition supports up to 3,000 RCU and 1,000 WCU (or 10GB of data). Exceeding any limit triggers a partition split, but hot partitions still bottleneck at per-partition throughput limits.
- •On-demand mode handles unpredictable traffic by instantly scaling to 2x the previous peak, but it costs roughly 6.5x more per request than provisioned capacity. Use provisioned with auto-scaling for steady workloads.
- •DynamoDB Streams captures every item-level change in order, enabling CDC patterns, materialized views, cross-region replication (Global Tables), and event-driven architectures with Lambda triggers.
- •DAX (DynamoDB Accelerator) delivers microsecond read latency by caching items and query results in-memory. But it caches at the item level, not query level, so aggregation queries get no benefit.
- •Single-table design patterns model multiple entity types in one table using partition key overloading (e.g., PK=USER#123, PK=ORDER#456). This reduces the number of requests by co-locating related data in the same partition.
Common Mistakes
- ✗Hot partition keys. Using a low-cardinality partition key (e.g., status, date) concentrates traffic on a few partitions, causing ProvisionedThroughputExceededException even when table-level capacity looks fine.
- ✗Not creating GSIs for secondary access patterns. Performing Scan operations to find items by non-key attributes costs O(table size) in both time and RCU. Design a GSI for every distinct query pattern.
- ✗Over-provisioning in provisioned mode. Setting WCU/RCU based on peak traffic without auto-scaling wastes money during off-peak hours. Turn on auto-scaling with a target utilization around 70%.
- ✗Scanning instead of querying. Scan reads every item in the table and filters client-side. Query targets a single partition and applies key conditions server-side, using orders of magnitude fewer RCUs.
- ✗Not using batch operations. BatchGetItem and BatchWriteItem reduce round trips by up to 25x (max 25 items per batch for writes, 100 for reads) and they should be treated as mandatory for any bulk data processing.