ADR Template & Best Practices — Decision Frameworks
Difficulty: Intermediate
Stakeholders: Staff Engineers, Architects, Engineering Managers
Key Points for ADR Template & Best Practices
- ADRs capture the context, decision, and consequences of significant architectural choices so that future reasoning stays transparent
- Keep ADRs immutable once accepted. If a decision gets superseded, create a new ADR that references the original
- Store ADRs in the repository alongside the code they affect. Decisions should live where the code lives
- Include a Status field (Proposed, Accepted, Deprecated, Superseded) to track the lifecycle of each decision
- Link related ADRs together to form a decision graph. Individual decisions rarely exist in isolation
Common Mistakes with ADR Template & Best Practices
- Writing ADRs after the fact with revisionist context. Capture the decision when it is made, including the uncertainty you felt at the time
- Making ADRs too long and detailed. A good ADR is 1-2 pages, not a design document
- Not recording rejected alternatives. The 'why not' is often more valuable than the 'why'
- Treating ADRs as bureaucratic overhead instead of team knowledge. If nobody reads them, your template is too heavy
Related to ADR Template & Best Practices
RFC Writing Guide, Trade-off Analysis Framework
API Versioning Strategy — Architecture Patterns
Difficulty: Advanced
Stakeholders: API Architects, Backend Engineers, Product Managers
Key Points for API Versioning Strategy
- URL path versioning (/v1/users) is the most visible and cache-friendly approach but forces clients to update URLs on every major version
- Header-based versioning (Stripe uses Stripe-Version: 2023-10-16) keeps URLs clean and enables per-request version pinning
- Non-breaking changes like adding fields should never require a new version. Only breaking changes like removing fields or changing types warrant a version bump
- Every API version needs a published sunset date. Stripe gives 24 months of overlap. Twilio gives 12 months minimum
- Version negotiation at the API gateway layer lets backend services stay version-unaware while the gateway handles translation
Common Mistakes with API Versioning Strategy
- Creating a new version for every minor change. This fragments the API surface and multiplies maintenance cost across teams
- Skipping deprecation headers and sunset timelines. Consumers need programmatic signals, not just changelog entries
- Maintaining parallel codepaths in the same service for each version. Use a transformation layer at the edge instead
- Treating internal and external APIs with the same versioning rigor. Internal APIs between services you own can tolerate more flexibility
Related to API Versioning Strategy
Trade-off Analysis Framework, Event-Driven vs Request-Driven
Backend for Frontend — Architecture Patterns
Difficulty: Advanced
Stakeholders: Frontend Engineers, Backend Engineers, Architects
Key Points for Backend for Frontend
- A BFF is not a proxy or a gateway. It is a dedicated aggregation and reshaping layer that knows exactly what its client needs and returns nothing more
- The team that builds the client should own its BFF. When backend teams own it, every frontend change becomes a cross-team ticket and velocity tanks
- BFF shines when your mobile app needs 3 fields from 4 services on a single screen. Without it, you either over-fetch or make the client orchestrate multiple calls on a cellular connection
- You can combine BFF with GraphQL. Some of the best setups use a thin GraphQL layer as the BFF, giving the frontend team schema ownership while keeping backend services clean
- If you only have one client type, you almost certainly do not need a BFF. A well-designed API with field selection will serve you better with half the operational surface area
Common Mistakes with Backend for Frontend
- Letting business logic creep into the BFF layer. Once someone puts a discount calculation in the web BFF, you will find a different discount calculation in the mobile BFF within a month
- Building separate BFFs when your mobile and web clients actually need the same data. BFF solves divergent needs, not hypothetical future divergence
- Skipping shared libraries for authentication, logging, and error handling across BFFs. You end up with three different auth implementations that drift apart over time
- Treating the BFF as a backend team deliverable. It sits in the frontend team's critical path and should be planned, prioritized, and deployed on their cadence
Related to Backend for Frontend
GraphQL vs REST Decision, API Versioning Strategy
C4 Architecture Model — Architecture Patterns
Difficulty: Advanced
Stakeholders: Staff Engineers, Architects, Tech Leads
Key Points for C4 Architecture Model
- C4 gives your team four zoom levels for architecture diagrams: Context, Container, Component, and Code. Most teams only need the first two or three. Level 4 is rarely worth maintaining by hand
- The Context diagram is the most valuable and the most skipped. If you cannot draw one, you do not understand your system's boundaries well enough to make good architectural decisions
- Diagrams that nobody updates are worse than no diagrams. They create false confidence. Tie diagram reviews to quarterly planning or major releases, not to good intentions
- C4 works because it separates audience from detail. The Context diagram is for everyone. The Component diagram is for the team that owns the service. Stop showing the same diagram to every audience
Common Mistakes with C4 Architecture Model
- Jumping straight to Component or Code diagrams without a Context diagram. This is like writing functions before understanding the requirements. You end up with detailed views of a system nobody has mapped at the high level
- Treating C4 diagrams as one-time documentation artifacts. Architecture evolves. If your diagrams reflect the system from eighteen months ago, they are actively misleading people
- Putting every service and database on a single diagram. The whole point of zoom levels is to manage complexity. If your Container diagram has forty boxes, split it
- Using C4 for systems that do not need it. A single service with a database and a queue does not need four layers of diagrams. Use judgment
Related to C4 Architecture Model
ADR Template & Best Practices, Cell-Based Architecture
Cache Invalidation Strategies — Architecture Patterns
Difficulty: Advanced
Stakeholders: Backend Engineers, Staff Engineers, SRE
Key Points for Cache Invalidation Strategies
- Cache-aside (lazy loading) is the safest default pattern. The application checks cache first, falls back to the database, and populates cache on miss
- TTL-based expiration is simple but trades freshness for simplicity. Set TTLs based on how stale the data can be, not on arbitrary round numbers
- Event-driven invalidation using CDC (Change Data Capture) from the database gives near-real-time cache consistency without coupling write paths to cache logic
- Cache stampede happens when a popular key expires and hundreds of concurrent requests hit the database simultaneously. Use locking, probabilistic early expiration, or background refresh to prevent it
- CDN cache purging has propagation delays of 1-30 seconds depending on the provider. Design your system to tolerate this window
Common Mistakes with Cache Invalidation Strategies
- Setting the same TTL for all cache keys regardless of access patterns and staleness tolerance. A user profile can be stale for 5 minutes; an account balance cannot be stale at all
- Caching database query results without tracking which records contribute to that result. When a record changes, you cannot invalidate the right cache entries
- Ignoring cache warming on deploy. A cold cache after deployment can overload the database during the first few minutes of traffic
Related to Cache Invalidation Strategies
Event-Driven vs Request-Driven, Trade-off Analysis Framework
Cell-Based Architecture — Architecture Patterns
Difficulty: Expert
Stakeholders: Principal Engineers, Platform Engineers, SRE Leads
Key Points for Cell-Based Architecture
- Cells are isolation boundaries, not scaling units. The primary value is blast radius reduction. A bad deploy, a runaway query, or a poisoned config change only affects one cell's users, not your entire customer base
- Cell sizing is a business decision disguised as a technical one. Slack's cells serve roughly 50K concurrent users each. DoorDash sized theirs by metro region. The right boundary depends on your failure cost per customer segment
- Cross-cell communication must be treated as a foreign API call with circuit breakers, retries, and explicit contracts. The moment cells start sharing state liberally, you have a distributed monolith with extra network hops
- Cell-aware routing at the edge is the linchpin. If your routing layer cannot deterministically map a request to the correct cell within 1-2ms, you will eat the latency budget before your application code runs
- Operational tooling cost dwarfs the infrastructure cost. You need per-cell dashboards, per-cell deployment pipelines, per-cell runbooks, and engineers who can reason about 20+ independent environments simultaneously
Common Mistakes with Cell-Based Architecture
- Sharing a database across cells. This is the single most common way to accidentally couple cells together. Each cell needs its own data store, even if that means duplicating reference data
- Making cells too small. A team that built 200 cells for 50K users discovered they spent more time on cell management than on product development. Cell count should grow with customer scale, not ahead of it
- Deploying to all cells simultaneously. Staggered rollouts across cells (deploy to 2 cells, observe for 30 minutes, continue) give you a natural canary mechanism. Deploying everywhere at once defeats the isolation benefit
- Ignoring cell rebalancing. Customers grow, usage patterns shift, and cells become hot. Without automated rebalancing or at least clear playbooks for cell migration, you end up with severe skew within 6-12 months
Related to Cell-Based Architecture
Multi-Region Architecture, Zero-Downtime Deployment Patterns
CQRS & Event Sourcing — Data Architecture
Difficulty: Expert
Stakeholders: Staff Engineers, Architects, Data Engineers
Key Points for CQRS & Event Sourcing
- CQRS separates read and write models so each can be optimized independently. Write models enforce business invariants, read models are denormalized for query performance
- Event sourcing stores state as a sequence of immutable events rather than mutable rows. This gives you a complete audit trail and enables temporal queries
- CQRS and event sourcing are independent patterns. You can use CQRS without event sourcing, and event sourcing without CQRS, though they pair well together
- Event sourcing makes your system's history a first-class citizen. You can reconstruct state at any point in time by replaying events up to that timestamp
- Projection management is the hidden operational cost. Each read model requires a projector that processes events and updates the denormalized view
Common Mistakes with CQRS & Event Sourcing
- Applying CQRS/ES to every service. These patterns add real complexity and only pay off for domains with audit requirements, complex business rules, or temporal query needs
- Using the event store as the source of truth for queries. The event store is the write model. Build separate read-optimized projections for queries
- Designing events that are too granular. 'FieldXChanged' events create noise. Prefer domain events that capture business intent like 'OrderShipped'
- Not planning for event schema evolution. Your event store will contain events written years ago, and you need upcasting strategies to handle those old formats
- Ignoring projection lag. Read models are eventually consistent with the write model. If your domain requires immediate read-your-writes consistency, you need a synchronous path or causal consistency mechanisms
Related to CQRS & Event Sourcing
Event-Driven vs Request-Driven Architecture, Trade-off Analysis Framework
Data Mesh Architecture — Data Architecture
Difficulty: Expert
Stakeholders: Data Engineers, Architects, Engineering Directors
Key Points for Data Mesh Architecture
- Data mesh solves an organizational bottleneck, not a technical one. If your centralized data team can fulfill requests in under two weeks, you probably do not need it
- Domain ownership without a mature self-serve platform just distributes the burden. You need the platform before you need the org change
- Federated governance must be automated from day one. If governance depends on human review, it will fail at exactly the scale where you needed it most
- Start with one domain that already has strong data engineering skills and a clear data product. Expanding before you have a working reference implementation creates chaos
- The real cost is not infrastructure. It is the ongoing investment in data literacy across every domain team, which requires training, hiring, and protected time
Common Mistakes with Data Mesh Architecture
- Relabeling existing ETL pipelines as 'data products' without changing ownership, SLAs, or discoverability. This changes nothing except the Jira labels
- Building the self-serve platform from scratch when dbt, Airflow, Datahub, and your cloud provider's managed services cover 80% of what you need
- Forcing data mesh on an organization under 200 engineers. At that scale, a centralized data team with good prioritization is simpler, cheaper, and faster
- Skipping the 'data as a product' mindset shift. If domain teams treat their data outputs as side effects of their services rather than first-class products with SLAs, consumers will not trust them
Related to Data Mesh Architecture
ML Pipeline & Data Architecture, CQRS & Event Sourcing
Database Selection Framework — Decision Frameworks
Difficulty: Intermediate
Stakeholders: Staff Engineers, Data Engineers, Architects
Key Points for Database Selection Framework
- Start with query patterns, not the database. List the top 10 queries your application will run and let that drive the selection
- PostgreSQL is the safe default for most applications. It handles JSON, full-text search, geospatial, and time-series reasonably well before you need a specialist
- Operational complexity matters more than benchmarks. A database your team can operate confidently at 2 AM is better than one that wins synthetic benchmarks
- Multi-model databases reduce operational burden but sacrifice peak performance. DynamoDB is fast for key-value but painful for ad-hoc analytics
- Data gravity is real. Once you have terabytes in one system, migration cost dominates all other considerations
Common Mistakes with Database Selection Framework
- Choosing a database based on hype or conference talks instead of matching it to actual access patterns
- Running benchmarks on empty datasets. Performance characteristics change dramatically at 100GB, 1TB, and 10TB
- Ignoring the backup, restore, and disaster recovery story. A database you cannot reliably restore is a liability
Related to Database Selection Framework
CQRS & Event Sourcing, Trade-off Analysis Framework
Distributed Transaction Patterns — Architecture Patterns
Difficulty: Expert
Stakeholders: Staff Engineers, Architects, Backend Engineers
Key Points for Distributed Transaction Patterns
- The moment you split a monolith into services, you lose ACID transactions across service boundaries. Everything after that is damage control. Accept this early and design accordingly
- Transactional Outbox is the pattern most teams should start with. It gives you reliable event publishing without the operational burden of a full saga framework
- Choreographed sagas are elegant in diagrams and nightmarish to debug in production. If your saga spans more than 3 services, use orchestration
- Compensating transactions are not rollbacks. They are new forward actions that semantically undo previous work. Refunding a payment is not the same as never charging it
- Two-phase commit works within a single database vendor's cluster. The moment you cross vendor boundaries (Postgres to Kafka, MySQL to DynamoDB), 2PC falls apart
Common Mistakes with Distributed Transaction Patterns
- Designing compensations as an afterthought. Teams build the happy path across 5 services, then realize they have no way to undo step 3 when step 4 fails. Compensation logic should be designed alongside the forward path
- Using choreographed sagas across more than 3 services. By the time you have 6 services publishing and consuming events for a single business operation, no one can trace the full flow without a distributed tracing tool and 30 minutes of detective work
- Treating the saga orchestrator as a simple state machine. Production orchestrators need durable state, retry policies, timeout handling, and dead-letter queues. Uber built Cadence (now Temporal) specifically because lightweight state machines were not enough
- Skipping idempotency on saga participants. Network retries will deliver duplicate commands. Every service in a saga must handle being called twice with the same request without producing duplicate side effects
Related to Distributed Transaction Patterns
CQRS & Event Sourcing, Event-Driven vs Request-Driven Architecture
Domain-Driven Design — Architecture Patterns
Difficulty: Advanced
Stakeholders: Staff Engineers, Architects, Tech Leads
Key Points for Domain-Driven Design
- Bounded contexts are the single most valuable concept in DDD. Get these right and most of your service boundary problems disappear. Get them wrong and you end up with a distributed monolith that is worse than what you started with.
- Ubiquitous language is not a documentation exercise. It is the shared vocabulary your team uses in code, conversations, tickets, and design docs. When the code uses different words than the business, bugs hide in the translation layer.
- Keep aggregates small. An Order aggregate that pulls in Customer, Product, Inventory, and Shipping is not an aggregate, it is your entire database with extra steps.
- DDD is a tool for complex domains, not a universal architecture style. If your app is basically CRUD with some validation rules, you are adding ceremony for no benefit.
- Event storming with actual domain experts in the room will teach you more about your system in two hours than six months of reading source code.
Common Mistakes with Domain-Driven Design
- Creating bounded contexts that map to technical layers (API context, database context, messaging context) instead of business capabilities
- Building one canonical data model shared across all services, which forces every team to agree on field names and data shapes for entities they use differently
- Making aggregates too large by stuffing related entities together, which causes contention, slow writes, and transaction failures under load
- Skipping the domain modeling phase and jumping straight to microservices, then discovering six months later that the service boundaries are in the wrong places
Related to Domain-Driven Design
Monolith to Microservices, CQRS & Event Sourcing, Data Mesh Architecture
Event-Driven vs Request-Driven Architecture — Architecture Patterns
Difficulty: Advanced
Stakeholders: Staff Engineers, Architects
Key Points for Event-Driven vs Request-Driven Architecture
- Request-driven (synchronous) is simpler to reason about and debug. Choose it as the default unless you have a concrete reason to go with events
- Event-driven architecture works best when producers should not know about consumers. It enables real service independence and extensibility
- Eventual consistency is the price you pay for event-driven decoupling. Make sure your domain can actually tolerate it before you commit
- Hybrid architectures are what most production systems look like. Use synchronous calls for queries and commands that need immediate consistency, events for side effects and cross-domain coordination
- Event schemas are your API contract. Invest in schema registries and backward-compatible evolution from day one
Common Mistakes with Event-Driven vs Request-Driven Architecture
- Using events for everything, including simple request-response flows. Events add latency, complexity, and debugging difficulty for no benefit when a synchronous call would do the job
- Not defining event ownership. Every event type must have exactly one producing service that owns its schema
- Building event chains that create implicit coupling. If Service A emits event X which triggers Service B which emits event Y which triggers Service C, you have a distributed monolith with extra steps
- Ignoring poison pill messages. A single malformed event can block an entire consumer partition if you do not have dead-letter queue handling
Related to Event-Driven vs Request-Driven Architecture
CQRS & Event Sourcing, Monolith to Microservices Migration
Feature Flag Architecture — Architecture Patterns
Difficulty: Advanced
Stakeholders: Platform Engineers, Product Engineers, Engineering Managers
Key Points for Feature Flag Architecture
- Feature flags come in four types: release flags (temporary, for deployment decoupling), experiment flags (A/B tests), ops flags (circuit breakers, kill switches), and permission flags (premium features)
- Local SDK evaluation is 100x faster than remote evaluation. LaunchDarkly and Unleash both support local evaluation by streaming flag configurations to the SDK
- Flag lifecycle management is critical. Every flag needs an owner, a creation date, and an expiration date. Flags without expiration become permanent tech debt
- Progressive delivery uses feature flags to gradually expose changes: internal users, then beta users, then 1%, 5%, 25%, 100%. This catches issues at each stage with minimal blast radius
- Build your own flag system only if you have fewer than 20 flags and no targeting requirements. Beyond that, buy a solution. The maintenance cost of a custom system grows faster than expected
Common Mistakes with Feature Flag Architecture
- Leaving release flags in the code for months after the feature is fully launched. This creates dead code paths that confuse new engineers and increase testing surface area
- Nesting feature flags inside other feature flags. The combinatorial explosion of states makes testing impossible and reasoning about behavior unreliable
- Evaluating flags on every function call instead of once at the request boundary. This adds latency and makes behavior inconsistent if a flag changes mid-request
- Not logging flag evaluations. When debugging production issues, knowing which flag values a user received is essential for reproducing the problem
Related to Feature Flag Architecture
ADR Template & Best Practices, RFC Writing Guide
GraphQL vs REST Decision — Decision Frameworks
Difficulty: Intermediate
Stakeholders: API Architects, Frontend Engineers, Backend Engineers
Key Points for GraphQL vs REST Decision
- GraphQL reduces over-fetching and under-fetching by letting clients request exactly the fields they need. This matters most for mobile apps on slow networks
- REST has simpler caching semantics. HTTP caching works out of the box with GET requests, ETags, and CDN integration. GraphQL POST requests require custom cache strategies
- The N+1 query problem in GraphQL is real and requires explicit solutions like DataLoader. Without batching, a single GraphQL query can trigger hundreds of database calls
- GraphQL federation (Apollo Federation, GraphQL Mesh) lets multiple teams own parts of a unified schema. This is its strongest advantage in microservice architectures
- Public APIs are almost always better served by REST. Documentation is simpler, caching works natively, and consumers do not need to learn a query language
Common Mistakes with GraphQL vs REST Decision
- Adopting GraphQL because it is trendy without a concrete problem like mobile over-fetching or multi-team schema ownership to justify the complexity
- Exposing your database schema directly through GraphQL types. The API schema should represent your domain model, not your table structure
- Allowing unbounded query depth without complexity limits. A malicious or careless client can craft a query that brings down your server
Related to GraphQL vs REST Decision
Event-Driven vs Request-Driven, Trade-off Analysis Framework
Idempotency & Exactly-Once Processing — Architecture Patterns
Difficulty: Expert
Stakeholders: Staff Engineers, Backend Engineers, Architects
Key Points for Idempotency & Exactly-Once Processing
- There is no exactly-once delivery in a distributed system. Networks lose packets, brokers crash, consumers restart. What you can build is exactly-once processing, and the burden falls entirely on the consumer
- Stripe's idempotency key pattern (client-generated UUID, server-side dedup with 24h TTL) is the gold standard for API idempotency. Copy it. Seriously. Their engineering blog post from 2017 remains the best practical reference
- Kafka's exactly-once semantics (EOS) guarantees atomic writes across partitions within a single Kafka cluster. It does not guarantee exactly-once processing in your application. Your consumer still needs idempotency logic
- The Transactional Outbox pattern with CDC (Debezium reading the WAL) solves the dual-write problem: updating your database and publishing an event atomically without distributed transactions
- Non-deterministic operations (timestamps, UUIDs, external API calls) are idempotency's hardest edge case. If retrying an operation produces a different result each time, deduplication alone is not enough. You need to capture and replay the original result
Common Mistakes with Idempotency & Exactly-Once Processing
- Relying on Kafka consumer group offsets for exactly-once guarantees. Consumer commits offset, processes message, crashes before completing side effects. On restart, the message is skipped. You now have data loss, not duplication
- Setting idempotency key TTLs too short. A 1-hour TTL means a client retrying after a network timeout 2 hours later will create a duplicate. Stripe uses 24 hours. Payment systems should consider 48-72 hours
- Implementing idempotency at the API gateway level but not in downstream services. The gateway deduplicates HTTP requests, but internal retries from service mesh, queue consumers, or scheduled jobs bypass the gateway entirely
- Treating database unique constraints as a complete idempotency solution. A unique constraint prevents duplicate inserts, but it does not prevent duplicate side effects like sending emails, calling third-party APIs, or publishing events
Related to Idempotency & Exactly-Once Processing
Distributed Transaction Patterns, CQRS & Event Sourcing
LLM Integration Architecture Patterns — Architecture Patterns
Difficulty: Advanced
Stakeholders: Staff Engineers, Architects, ML Engineers
Key Points for LLM Integration Architecture Patterns
- LLM API calls are fundamentally different from traditional APIs: non-deterministic, slow (seconds not milliseconds), expensive per call, and outputs need validation before you can use them
- A gateway pattern centralizes retry logic, rate limiting, cost tracking, prompt versioning, and model routing in one layer instead of scattering these concerns across services
- Semantic caching on embeddings of similar prompts can cut LLM costs by 40-70% for applications with repetitive query patterns
- Design for model portability from day one by abstracting provider-specific APIs behind a unified interface. The model landscape shifts every few months
- Treat all LLM outputs as untrusted input that must be validated, sanitized, and constrained before it reaches business logic or end users
Common Mistakes with LLM Integration Architecture Patterns
- Scattering raw LLM API calls throughout business logic, making it impossible to swap models, track costs, or enforce consistent prompt patterns
- No cost controls on LLM usage. A single runaway loop or retry storm can burn thousands of dollars in minutes without anyone noticing
- Treating LLM responses as deterministic, then building brittle parsing logic that breaks when the model rephrases its output slightly
- Skipping output validation and feeding raw model responses directly into databases, APIs, or user-facing surfaces without sanitization
- Over-engineering with autonomous agents and complex chains before validating that a simple single-prompt pattern actually solves the problem
Related to LLM Integration Architecture Patterns
Event-Driven vs Request-Driven Architecture, Trade-off Analysis Framework
ML Pipeline & Feature Store Architecture — Data Architecture
Difficulty: Expert
Stakeholders: Staff Engineers, Data Engineers, ML Engineers
Key Points for ML Pipeline & Feature Store Architecture
- Feature stores solve the training-serving skew problem by providing a single source of truth for feature computation logic, used in both offline training and online inference
- ML pipelines operate in two fundamentally different modes (batch and real-time), each with different latency, throughput, and consistency requirements that call for different infrastructure
- Data versioning and lineage tracking are not optional extras. Without them, you cannot reproduce results, debug model regressions, or meet audit requirements
- The feature computation layer is typically the most expensive component in the ML stack. Optimizing it through incremental computation and caching delivers the highest cost impact
- Model serving needs the same deployment rigor as code: canary rollouts, rollback capability, traffic splitting, and health checks are non-negotiable
Common Mistakes with ML Pipeline & Feature Store Architecture
- Letting data scientists build ad-hoc notebook pipelines for prototyping and then trying to push those same pipelines to production without re-engineering them
- Building separate feature computation logic for training and serving, which guarantees subtle numerical differences that degrade model performance in production
- Treating ML infrastructure as a data science problem instead of a platform engineering problem. This leads to fragile, undocumented systems that only their creator can operate
- Ignoring data quality monitoring, which means your model silently degrades for weeks before anyone notices that upstream data changes broke input distributions
Related to ML Pipeline & Feature Store Architecture
CQRS & Event Sourcing, Event-Driven vs Request-Driven Architecture
Monolith to Microservices Migration — Migration Strategies
Difficulty: Advanced
Stakeholders: Architects, VPs of Engineering, Directors
Key Points for Monolith to Microservices Migration
- The Strangler Fig pattern is the safest migration strategy. Wrap the monolith with a proxy and gradually route traffic to new services
- Extract services along domain boundaries, not technical layers. A 'User Service' is better than a 'Database Service'
- Data is the hardest part. Shared databases between the monolith and services create coupling that defeats the purpose of decomposition
- You need robust observability before you start splitting. You cannot debug distributed systems with monolith-era logging
- Most organizations that fail at microservices migration fail because they decompose too aggressively too early. Start with 2-3 services, not 20
Common Mistakes with Monolith to Microservices Migration
- Big bang rewrite, where you attempt to rewrite the entire monolith at once instead of incrementally migrating bounded contexts
- Distributed monolith, where extracted services still share a database, deploy together, and cannot function independently
- Ignoring the shared library trap. Common libraries that embed business logic create hidden coupling between services
- Not establishing service contracts before splitting. Changing APIs after services are deployed is exponentially harder
- Underestimating the operational cost. Each new service needs its own CI/CD pipeline, monitoring, alerting, and on-call rotation
Related to Monolith to Microservices Migration
Event-Driven vs Request-Driven Architecture, CQRS & Event Sourcing
Multi-Region Architecture — Architecture Patterns
Difficulty: Expert
Stakeholders: Principal Engineers, Platform Engineers, SRE Leads
Key Points for Multi-Region Architecture
- Active-passive is 10x simpler than active-active. Start with active-passive and promote to active-active only when latency requirements demand it
- Cross-region replication lag is bounded by physics. US-East to EU-West is roughly 80ms round-trip. Your conflict resolution strategy must account for this
- Data sovereignty laws like GDPR may require that certain data never leaves a geographic boundary, which constrains replication topology
- Multi-region doubles or triples infrastructure cost. Budget for it explicitly and justify it against the business value of the uptime improvement
- DNS failover with Route 53 or Cloudflare health checks is the simplest entry point. You can build more sophisticated routing later
Common Mistakes with Multi-Region Architecture
- Building active-active without a clear conflict resolution strategy. Two regions accepting writes to the same record simultaneously will produce data corruption
- Testing failover only during planned exercises. Chaos engineering practices like randomly failing one region during business hours reveal gaps that planned tests miss
- Assuming the database vendor handles multi-region automatically. CockroachDB and Spanner do, but most databases require careful configuration and ongoing tuning
- Ignoring the blast radius of shared global services. A single global authentication service defeats the purpose of multi-region isolation
Related to Multi-Region Architecture
CQRS & Event Sourcing, Trade-off Analysis Framework
RFC Writing Guide — Communication Patterns
Difficulty: Intermediate
Stakeholders: Staff Engineers, Architects, Tech Leads
Key Points for RFC Writing Guide
- An RFC is a proposal document that invites structured feedback before committing to a design direction
- The problem statement is the most important section. If reviewers do not agree on the problem, they cannot evaluate the solution
- Always include at least two alternatives with honest trade-off analysis. This proves you explored the design space
- Define explicit review timelines and approval criteria upfront to prevent RFCs from lingering indefinitely
- A rejected RFC is a success. It means you avoided a costly mistake before writing any code
Common Mistakes with RFC Writing Guide
- Burying the ask. Put what you need from reviewers (approval, feedback, alternative suggestions) in the first paragraph
- Writing a 20-page RFC when a 2-page ADR would be enough. Match the document weight to the decision's impact
- Not including a rollout plan. Reviewers need to understand how you will ship this incrementally, not just what the end state looks like
- Treating the RFC as a spec. An RFC proposes a direction; a design doc specifies implementation details
Related to RFC Writing Guide
ADR Template & Best Practices, Trade-off Analysis Framework
Schema Evolution Governance — Architecture Patterns
Difficulty: Advanced
Stakeholders: Staff Engineers, Platform Engineers, Architects
Key Points for Schema Evolution Governance
- A required field added to a shared Kafka event will break every downstream consumer that uses permissive deserialization. Schema evolution is a coordination problem, not a serialization problem
- Protobuf and Avro solve different evolution challenges. Protobuf gives you better tooling, type safety, and performance. Avro gives you schema evolution by default with its reader-writer schema resolution. Pick based on your ecosystem, not blog post benchmarks
- Schema compatibility modes (backward, forward, full) are not academic categories. They map directly to deployment order. Backward compatibility means you can deploy consumers before producers. Forward means the opposite. Full means deploy in any order
- Consumer-driven contract testing catches breaking changes that schema registries miss. A field that is technically compatible can still break a consumer if the semantics change (e.g., a timestamp field switching from UTC to local time)
Common Mistakes with Schema Evolution Governance
- Treating schema compatibility checks as optional. Without CI enforcement, a developer will eventually push a breaking change to a shared topic at 4pm on a Friday, and three downstream teams will spend their evening debugging silent data loss
- Using JSON without a schema at all. Teams love the flexibility of schemaless JSON events until they discover that Producer A sends 'user_id' as a string while Producer B sends 'userId' as an integer, and both have been writing to the same topic for six months
- Assuming backward compatibility is always sufficient. If your producers deploy before consumers (common in platform teams that ship shared events), you need forward compatibility so that old consumers can read new messages
- Skipping semantic versioning for schemas. A schema that adds an optional field feels safe, but if that field changes the interpretation of existing fields, consumers need to know. Version your schemas explicitly and document what changed
Related to Schema Evolution Governance
API Versioning Strategy, Event-Driven vs Request-Driven Architecture
Serverless vs Containers Decision — Decision Frameworks
Difficulty: Intermediate
Stakeholders: Platform Engineers, Architects, Engineering Managers
Key Points for Serverless vs Containers Decision
- Serverless wins below roughly 1 million requests per month on cost. Above that threshold, containers on reserved instances become significantly cheaper
- Cold starts on AWS Lambda average 200-500ms for Python/Node and 1-3 seconds for Java. This matters for user-facing latency budgets
- Containers give you full control over the runtime, networking, and debugging. Serverless trades that control for zero infrastructure management
- Stateful workloads like WebSocket connections, long-running batch jobs, and ML inference are poor fits for serverless
- Vendor lock-in with serverless is real and concentrates in trigger bindings, IAM policies, and orchestration (Step Functions, EventBridge). Your business logic is portable; the glue around it is not
Common Mistakes with Serverless vs Containers Decision
- Comparing Lambda cost against on-demand EC2 pricing instead of reserved or spot instances. The cost crossover shifts dramatically with commitment discounts
- Deploying latency-sensitive APIs on serverless without provisioned concurrency, then blaming the platform for cold start performance
- Building complex orchestrations with Step Functions when a container running a simple loop would be easier to debug and cheaper to run
Related to Serverless vs Containers Decision
ADR Template & Best Practices, Trade-off Analysis Framework
Strangler Fig Pattern — Migration Strategies
Difficulty: Advanced
Stakeholders: Staff Engineers, Architects, Engineering Managers
Key Points for Strangler Fig Pattern
- The strangler fig pattern replaces a legacy system incrementally by routing traffic through a facade that delegates to either the old or new system per route
- Start with the highest-value, lowest-risk routes. A read-only endpoint with clear inputs and outputs is a better first migration target than a complex transactional workflow
- Anti-corruption layers translate between the legacy system's domain model and the new system's model. Without this boundary, the new system inherits legacy assumptions
- Parity verification is non-negotiable. Run both systems in parallel and compare outputs before cutting over each route. Differences reveal undocumented behavior
- Expect the migration to take 12-24 months for a system of significant size. Plan for this timeline and secure organizational commitment upfront
Common Mistakes with Strangler Fig Pattern
- Trying to understand the entire legacy system before starting migration. You will never fully understand it. Start with one route, learn as you go, and iterate
- Building the new system as a complete rewrite behind the facade instead of migrating incrementally. This defeats the purpose of the pattern and reintroduces big-bang risk
- Neglecting to decommission migrated routes from the legacy system. Without active decommissioning, you end up maintaining two systems indefinitely
- Underestimating the undocumented behavior embedded in the legacy system. Edge cases that nobody remembers will surface during parity testing
Related to Strangler Fig Pattern
Monolith to Microservices, ADR Template & Best Practices
Trade-off Analysis Framework — Decision Frameworks
Difficulty: Advanced
Stakeholders: Staff Engineers, Architects, VPs of Engineering
Key Points for Trade-off Analysis Framework
- Every architectural decision involves trade-offs. The goal is not to eliminate them but to make them explicit and intentional
- Use weighted scoring matrices to compare options objectively. Assign weights based on business priorities, not personal preferences
- Classify decisions by reversibility: Type 1 (irreversible, high cost) decisions deserve deep analysis, while Type 2 (reversible) decisions should be made quickly
- Blast radius assessment determines how many teams, services, or users are affected if the decision turns out to be wrong
- Document the trade-offs you accepted, not just the option you chose. That context is invaluable when you revisit decisions later
Common Mistakes with Trade-off Analysis Framework
- Analysis paralysis, where you spend 3 weeks analyzing a Type 2 decision that could be reversed in a day
- Optimizing for a single dimension (usually performance) while ignoring operational complexity, team expertise, and hiring implications
- Using gut feeling for Type 1 decisions. Irreversible choices deserve structured analysis even when you have strong intuition
- Not revisiting decisions when the context changes. A trade-off that made sense 18 months ago may no longer hold
Related to Trade-off Analysis Framework
ADR Template & Best Practices, CQRS & Event Sourcing
Zero-Downtime Deployment Patterns — Migration Strategies
Difficulty: Advanced
Stakeholders: Platform Engineers, SRE, Backend Engineers
Key Points for Zero-Downtime Deployment Patterns
- Blue-green deployments give you instant rollback by switching traffic between two identical environments. The cost is maintaining double the infrastructure during deployment
- Canary releases route a small percentage of traffic (1-5%) to the new version first, validating metrics before full rollout. This catches issues that staging environments miss
- Database schema changes are the hardest part of zero-downtime deploys. The expand-contract pattern splits migrations into backward-compatible steps
- Health checks must verify application readiness, not just process liveness. A container that starts but cannot connect to its database should not receive traffic
- Every deployment needs a rollback plan that can execute in under 60 seconds. If rollback requires a database migration, it is not a real rollback plan
Common Mistakes with Zero-Downtime Deployment Patterns
- Running ALTER TABLE statements that lock the table during deployment. PostgreSQL locks the table for ADD COLUMN with a DEFAULT value in versions before 11
- Deploying database migrations and application code in the same step. Separate them so the old code works with the new schema and vice versa
- Skipping canary validation metrics. Deploying to canary and immediately promoting to 100% defeats the purpose
- Not testing rollback procedures. The first time you roll back should not be during an incident
Related to Zero-Downtime Deployment Patterns
Monolith to Microservices, ADR Template & Best Practices