ADR Template & Best Practices — Decision Frameworks
Difficulty: Intermediate
Stakeholders: Staff Engineers, Architects, Engineering Managers
Key Points for ADR Template & Best Practices
- ADRs capture the context, decision, and consequences of significant architectural choices — making future reasoning transparent
- Keep ADRs immutable once accepted — if a decision is superseded, create a new ADR that references and supersedes the original
- Store ADRs in the repository alongside the code they affect — decisions should live where the code lives
- Include a Status field (Proposed, Accepted, Deprecated, Superseded) to track the lifecycle of each decision
- Link related ADRs together to form a decision graph — individual decisions rarely exist in isolation
Common Mistakes with ADR Template & Best Practices
- Writing ADRs after the fact with revisionist context — capture the decision when it is made, including the uncertainty you felt at the time
- Making ADRs too long and detailed — a good ADR is 1-2 pages, not a design document
- Not recording rejected alternatives — the 'why not' is often more valuable than the 'why'
- Treating ADRs as bureaucratic overhead instead of team knowledge — if nobody reads them, your template is too heavy
Related to ADR Template & Best Practices
RFC Writing Guide, Trade-off Analysis Framework
CQRS & Event Sourcing — Data Architecture
Difficulty: Expert
Stakeholders: Staff Engineers, Architects, Data Engineers
Key Points for CQRS & Event Sourcing
- CQRS separates read and write models, allowing each to be optimized independently — write models enforce business invariants, read models are denormalized for query performance
- Event sourcing stores state as a sequence of immutable events rather than mutable rows — this provides a complete audit trail and enables temporal queries
- CQRS and event sourcing are independent patterns — you can use CQRS without event sourcing, and event sourcing without CQRS, though they complement each other well
- Event sourcing makes the system's history a first-class citizen — you can reconstruct state at any point in time by replaying events up to that timestamp
- Projection management is the hidden operational cost — each read model requires a projector that processes events and updates the denormalized view
Common Mistakes with CQRS & Event Sourcing
- Applying CQRS/ES to every service — these patterns add significant complexity and are only justified for domains with audit requirements, complex business rules, or temporal query needs
- Making the event store the source of truth for queries — the event store is the write model; build separate read-optimized projections for queries
- Designing events that are too granular — 'FieldXChanged' events create noise; prefer domain events that capture business intent like 'OrderShipped'
- Not planning for event schema evolution — your event store will contain events written years ago; you need upcasting strategies to handle old event formats
- Ignoring projection lag — read models are eventually consistent with the write model; if your domain requires immediate read-your-writes consistency, you need a synchronous path or causal consistency mechanisms
Related to CQRS & Event Sourcing
Event-Driven vs Request-Driven Architecture, Trade-off Analysis Framework
Event-Driven vs Request-Driven Architecture — Architecture Patterns
Difficulty: Advanced
Stakeholders: Staff Engineers, Architects
Key Points for Event-Driven vs Request-Driven Architecture
- Request-driven (synchronous) is simpler to reason about and debug — choose it as the default unless you have a specific reason for events
- Event-driven architecture shines when producers should not know about consumers — it enables true service independence and extensibility
- Eventual consistency is the price you pay for event-driven decoupling — ensure your domain can tolerate it before committing
- Hybrid architectures are the norm in production — use synchronous for queries and commands that need immediate consistency, events for side effects and cross-domain coordination
- Event schemas are your API contract — invest in schema registries and backward-compatible evolution from day one
Common Mistakes with Event-Driven vs Request-Driven Architecture
- Using events for everything including simple request-response flows — events add latency, complexity, and debugging difficulty without benefit when synchronous calls would suffice
- Not defining event ownership — every event type must have exactly one producing service that owns its schema
- Building event chains that create implicit coupling — if Service A emits event X which triggers Service B which emits event Y which triggers Service C, you have a distributed monolith with extra steps
- Ignoring poison pill messages — a single malformed event can block an entire consumer partition if you do not have dead-letter queue handling
Related to Event-Driven vs Request-Driven Architecture
CQRS & Event Sourcing, Monolith to Microservices Migration
LLM Integration Architecture Patterns — Architecture Patterns
Difficulty: Advanced
Stakeholders: Staff Engineers, Architects, ML Engineers
Key Points for LLM Integration Architecture Patterns
- LLM API calls are fundamentally different from traditional APIs: non-deterministic, slow (seconds not milliseconds), expensive per call, and outputs require validation before use
- A gateway pattern centralizes retry logic, rate limiting, cost tracking, prompt versioning, and model routing in one layer instead of scattering these concerns across services
- Semantic caching on embeddings of similar prompts can cut LLM costs by 40-70% for applications with repetitive query patterns
- Design for model portability from day one by abstracting provider-specific APIs behind a unified interface, because the model landscape shifts every few months
- Treat all LLM outputs as untrusted input that must be validated, sanitized, and constrained before reaching business logic or end users
Common Mistakes with LLM Integration Architecture Patterns
- Scattering raw LLM API calls throughout business logic, making it impossible to swap models, track costs, or enforce consistent prompt patterns
- No cost controls on LLM usage, where a single runaway loop or retry storm can burn thousands of dollars in minutes without anyone noticing
- Treating LLM responses as deterministic, then building brittle parsing logic that breaks when the model rephrases its output slightly differently
- Skipping output validation and feeding raw model responses directly into databases, APIs, or user-facing surfaces without sanitization
- Over-engineering with autonomous agents and complex chains before validating that a simple single-prompt pattern solves the actual problem
Related to LLM Integration Architecture Patterns
Event-Driven vs Request-Driven Architecture, Trade-off Analysis Framework
ML Pipeline & Feature Store Architecture — Data Architecture
Difficulty: Expert
Stakeholders: Staff Engineers, Data Engineers, ML Engineers
Key Points for ML Pipeline & Feature Store Architecture
- Feature stores solve the training-serving skew problem by providing a single source of truth for feature computation logic used in both offline training and online inference
- ML pipelines operate in two fundamentally different modes, batch and real-time, each with different latency, throughput, and consistency requirements that demand different infrastructure
- Data versioning and lineage tracking are not optional extras. Without them, you cannot reproduce results, debug model regressions, or meet audit requirements
- The feature computation layer is typically the most expensive component in the ML stack, and optimizing it through incremental computation and caching has the highest cost impact
- Model serving needs the same deployment rigor as code: canary rollouts, rollback capability, traffic splitting, and health checks are non-negotiable
Common Mistakes with ML Pipeline & Feature Store Architecture
- Letting data scientists build ad-hoc notebook pipelines for prototyping and then trying to push those same pipelines to production without re-engineering them
- Building separate feature computation logic for training and serving, which guarantees subtle numerical differences that degrade model performance in production
- Treating ML infrastructure as a data science problem instead of a platform engineering problem, leading to fragile, undocumented systems that only their creator can operate
- Ignoring data quality monitoring, which means your model silently degrades for weeks before anyone notices upstream data changes broke input distributions
Related to ML Pipeline & Feature Store Architecture
CQRS & Event Sourcing, Event-Driven vs Request-Driven Architecture
Monolith to Microservices Migration — Migration Strategies
Difficulty: Advanced
Stakeholders: Architects, VPs of Engineering, Directors
Key Points for Monolith to Microservices Migration
- The Strangler Fig pattern is the safest migration strategy — wrap the monolith with a proxy and gradually route traffic to new services
- Extract services along domain boundaries, not technical layers — a 'User Service' is better than a 'Database Service'
- Data is the hardest part — shared databases between monolith and services create coupling that defeats the purpose of decomposition
- You need robust observability before you start splitting — you cannot debug distributed systems with monolith-era logging
- Most organizations that fail at microservices migration fail because they decompose too aggressively too early — start with 2-3 services, not 20
Common Mistakes with Monolith to Microservices Migration
- Big bang rewrite — attempting to rewrite the entire monolith at once instead of incrementally migrating bounded contexts
- Distributed monolith — extracting services that still share a database, deploy together, and cannot function independently
- Ignoring the shared library trap — common libraries that embed business logic create hidden coupling between services
- Not establishing service contracts before splitting — changing APIs after services are deployed is exponentially harder
- Underestimating the operational cost — each new service needs its own CI/CD pipeline, monitoring, alerting, and on-call rotation
Related to Monolith to Microservices Migration
Event-Driven vs Request-Driven Architecture, CQRS & Event Sourcing
RFC Writing Guide — Communication Patterns
Difficulty: Intermediate
Stakeholders: Staff Engineers, Architects, Tech Leads
Key Points for RFC Writing Guide
- An RFC is a proposal document that invites structured feedback before committing to a design direction
- The problem statement is the most important section — if reviewers do not agree on the problem, they cannot evaluate the solution
- Always include at least two alternatives with honest trade-off analysis — this proves you explored the design space
- Define explicit review timelines and approval criteria upfront to prevent RFCs from lingering indefinitely
- A rejected RFC is a success — it means you avoided a costly mistake before writing code
Common Mistakes with RFC Writing Guide
- Burying the ask — put what you need from reviewers (approval, feedback, alternative suggestions) in the first paragraph
- Writing a 20-page RFC when a 2-page ADR would suffice — match the document weight to the decision impact
- Not including a rollout plan — reviewers need to understand how you will ship this incrementally, not just what the end state looks like
- Treating the RFC as a spec — an RFC proposes direction, a design doc specifies implementation details
Related to RFC Writing Guide
ADR Template & Best Practices, Trade-off Analysis Framework
Trade-off Analysis Framework — Decision Frameworks
Difficulty: Advanced
Stakeholders: Staff Engineers, Architects, VPs of Engineering
Key Points for Trade-off Analysis Framework
- Every architectural decision involves trade-offs — the goal is not to eliminate trade-offs but to make them explicit and intentional
- Use weighted scoring matrices to compare options objectively — assign weights based on business priorities, not personal preferences
- Classify decisions by reversibility: Type 1 (irreversible, high cost) decisions deserve deep analysis; Type 2 (reversible) decisions should be made quickly
- Blast radius assessment determines how many teams, services, or users are affected if the decision turns out to be wrong
- Document the trade-offs you accepted, not just the option you chose — this context is invaluable when revisiting decisions later
Common Mistakes with Trade-off Analysis Framework
- Analysis paralysis — spending 3 weeks analyzing a Type 2 decision that can be reversed in a day
- Optimizing for a single dimension (usually performance) while ignoring operational complexity, team expertise, and hiring implications
- Using gut feeling for Type 1 decisions — irreversible choices deserve structured analysis even when you have strong intuition
- Not revisiting decisions when the context changes — a trade-off that made sense 18 months ago may no longer hold
Related to Trade-off Analysis Framework
ADR Template & Best Practices, CQRS & Event Sourcing