Migration Program Management
A Migration Story Worth Learning From
In 2018, a fintech company I advised started migrating from a Django monolith to Go microservices. The plan was 12 months. Engineering leadership approved it based on a thorough technical proposal, clear phases, and a compelling argument about scaling limits.
The first three phases went well. By month five, they had extracted auth, payments, and notifications into separate services. Traffic was routing cleanly. The team was ahead of schedule. They celebrated at an all-hands.
Then they hit the order processing service. It had 340 database tables, implicit coupling to six other domains, and business logic that predated anyone still at the company. The engineer who originally built it had left two years prior. Documentation was sparse. Every attempt to extract a bounded context revealed another hidden dependency.
Month 12 arrived. The migration was 65% complete. Feature development had slowed to a crawl because the remaining services were the ones that touched everything. Product leadership started asking hard questions. By month 18, the VP of Engineering was replaced. The new VP paused the migration at 75%, leaving three services running on the old monolith indefinitely. The company had spent roughly $4M in engineering time and captured maybe 60% of the expected value.
The migration did not fail because of technology. It failed because the team scoped the project based on the easy extractions and assumed the hard ones would follow a similar pattern.
Why Migrations Actually Fail
The last 20% problem. The first services you extract are the ones with clean boundaries. Auth, notifications, feature flags. These move quickly and build false confidence. The remaining services are the tangled ones, the ones with circular dependencies, shared database tables, and business logic encoded in SQL views. Budget at least 40% of total effort for the final 20% of scope.
Organizational patience runs out. Stripe's migration from Ruby monolith to services took years. Twitter's shift from Ruby to Scala was a multi-year effort. Airbnb's move from monolith to SOA is ongoing after several years. These are companies with deep engineering benches and strong executive commitment. Your timeline will not be shorter. If your migration plan depends on sustained executive patience beyond 18 months, build in explicit value-delivery milestones that renew that patience.
The "almost done" death spiral. Without hard completion criteria, migrations enter a phase where they are perpetually 90% complete. Teams move on to other priorities. The old system stays running because someone still depends on it. Dual systems create operational burden that slowly drains the team. This is worse than not migrating at all, because you now maintain two systems instead of one.
Migration Readiness Checklist
Before committing to a migration, run through these hard criteria. If you cannot answer yes to all of them, you are not ready.
Ownership clarity. Is there a single engineering leader (director or above) who owns the migration outcome and has authority to allocate people across teams? If migration responsibility is distributed across multiple managers with competing priorities, it will lose every staffing negotiation.
Dependency map complete. Have you mapped every service-to-service and service-to-database dependency for the systems being migrated? Not at the architecture diagram level. At the "which tables does this service read from that it does not own" level. If you discover dependencies during migration instead of before it, every discovery adds weeks.
Rollback plan per phase. Can you revert each phase independently without data loss? This means dual-write capability, traffic splitting infrastructure, and tested rollback procedures. If your rollback plan is "re-deploy the old code," you do not have a rollback plan.
Staffing committed. Do you have named engineers assigned to the migration with explicit percentage allocations in their sprint commitments? "The team will work on migration when they have bandwidth" means the migration will never finish.
Business case documented with kill criteria. Is there a written document that states: "This migration is justified because [specific business outcome]. If by [date] we have not achieved [measurable milestone], we will reassess continuation." Without kill criteria, failing migrations become zombie projects that drain resources for years.
Phasing for Real Value Delivery
Each phase must deliver standalone value. If the migration gets cancelled after any phase, the completed work should still justify its cost.
Bad phasing: Phase 1 sets up infrastructure. Phase 2 extracts services. Phase 3 migrates traffic. Phase 4 decommissions old code. This fails because phases 1 and 2 deliver zero user-facing value. If the project gets cut after phase 2, you have new infrastructure running nothing and old infrastructure running everything.
Good phasing: Phase 1 extracts the auth service and routes 100% of auth traffic through it. The monolith auth code is deleted. Measurable outcome: auth latency drops from 200ms to 50ms, and the auth team can deploy independently. Even if nothing else gets migrated, this phase paid for itself.
Define completion criteria that are unambiguous. "Service X handles 100% of production traffic for the user authentication domain, the monolith code path is deleted, all tests pass against the new service, and on-call runbooks are updated" is a completion criterion. "Auth service is mostly working and we are planning to cut over soon" is not.
Protecting Feature Velocity
Dedicate 60-70% of engineering capacity to migration and 30-40% to feature work. This extends the timeline but preserves organizational support. The moment product leadership feels that the migration has halted all customer-facing progress, they will start lobbying to pause it.
Staff the migration team with engineers who want to do infrastructure work. Drafting your best feature engineers against their will produces resentful people writing careless migration code. Worse, they leave mid-project, taking context with them.
Create visible progress metrics that non-engineers can understand. "72% of traffic now served by new services" is meaningful to a product VP. "We refactored the repository layer and updated 340 integration tests" is not. Both are real progress, but only one buys you continued support.
When to Cut Losses
Specific signals that a migration should be reassessed:
Cost math has flipped. Recalculate quarterly: estimated remaining cost (people, time, infrastructure) versus estimated remaining benefit (performance, reliability, developer productivity). If remaining cost exceeds remaining benefit by more than 30%, stop. You can recalculate this with real numbers. If you have spent $2M and estimate $3M more to finish, but the total expected benefit was $4M, the remaining $2M of benefit does not justify $3M of cost.
Key engineers have left. If the two or three people with the deepest context on the migration have departed and backfill hiring is taking longer than eight weeks, pause. New engineers will take months to rebuild that context, and the error rate during that ramp-up period is high.
The target architecture has shifted. If you started migrating to Kubernetes and the industry (or your company) has moved toward serverless for the relevant workloads, continuing the original plan is throwing good money after bad. Reassess the target, not just the execution.
Milestones are consistently 2x over estimate. One missed milestone is normal. Three consecutive milestones taking twice as long as planned is a systemic estimation failure that will not self-correct.
Stopping a migration is a leadership act, not a failure. Present the data: total spent, remaining estimate, updated ROI calculation, risks of continuing, and a plan for stabilizing the current hybrid state. Preserve what you have built, archive what you have not, and document why you stopped so the next leader does not repeat the same path.
Key Points
- •The last 20% of any migration contains 80% of the pain. Budget and staff for it explicitly, or it will drag on for years as a zombie project
- •Feature velocity must stay above 60% during migration. Drop below that and you lose executive sponsorship faster than you lose engineers
- •A migration without a dedicated program manager is a migration that will stall at the first cross-team dependency
- •Define 'done' for each phase with measurable criteria: traffic percentage, latency targets, old code paths deleted. 'Mostly migrated' is not a milestone
- •The decision to stop a failing migration should be based on remaining-cost-vs-remaining-benefit math, not sunk cost
Common Mistakes
- ✗Estimating the migration timeline based on the easy parts. The first 50% goes smoothly; the last 20% involves the services nobody wants to touch, the edge cases nobody documented, and the integrations nobody fully understands
- ✗Running the migration as 'extra work' on top of normal sprint commitments instead of staffing a dedicated team with explicit allocation
- ✗Celebrating early milestones too loudly, which sets unrealistic expectations for the pace of later phases when complexity increases
- ✗Skipping the dual-write/shadow-read verification phase because it 'slows things down.' This is where you catch data consistency bugs before they reach production