Strategies for Successful Monolith Splitting

In our previous post, we explored how to identify business domains in your monolith and create a plan for splitting. Now, let’s dive into the strategies for executing this plan effectively. We’ll cover modularization techniques, handling ongoing development during the transition, and measuring your progress.

If you are in the early stages of the chart, you can probably look into Modularization, if you are however towards the right hand side (like we were), you will need to take some more drastic action.

If you are on the right hand side, they your monolith is at the point you need to stop writing code there NOW.

There’s 2 things to consider:

  • for new domains, or significant new features in existing domains start them outside straight away
  • for existing domains, build a new system for each of them, and move the code out

Once your code is in a new system, you get all the benefits straight away on that code. You aren’t waiting for an entire system to migrate before you see results to your velocity. This is why we say start with the high volume change areas and domains first.

How to stop writing code there “now”? Apply the open closed principle at the system level

  1. Open for extension: Extend functionality by consuming events and calling APIs from new systems
  2. Closed for modification: Limit changes to the monolith, aim to get to the point where it’s only crucial bug fixes

This pattern encourages you to move to the new high development velocity systems.

Modularization: The First Step for those on the Left of the chart

Before fully separating your monolith into distinct services, it’s often beneficial to start with modularization within the existing system. This approach, sometimes called the “strangler fig pattern,” can be particularly effective for younger monoliths.

Modularization is a good strategy when:

  • Your monolith is relatively young and manageable
  • You want to gradually improve the system’s architecture without a complete overhaul
  • You need to maintain the existing system while preparing for future splits

However, be wary of common pitfalls in this process:

  • Avoid over-refactoring; focus on creating clear boundaries between modules
  • Ensure your modularization efforts align with your identified business domains

For ancient monoliths with extremely slow velocity, a more drastic “lift and shift” approach into a new system is recommended.

Integrating New Systems with the Monolith, for those to the Right

When new requirements come in, especially for new domains, start implementing them in new systems immediately. This approach helps prevent your monolith from growing further while you’re trying to split it.

Integrating new systems with your monolith requires these considerations:

  1. Add events for everything that happens in your monolith, especially around data or state changes
  2. Listen to these events from new systems
  3. When new systems need to call back to the monolith, use the monolith’s APIs

This event-driven approach allows for loose coupling between your old and new systems, facilitating a smoother transition.

Existing Domains: The Copy-Paste Approach for those to the Right

If your monolith is in particularly bad shape, sometimes the best approach is the simplest, build a new system then copy, paste, and call it step one from the L7 router. Don’t get bogged down trying to improve everything right away. Focus on basic linting and formatting, but avoid major refactoring or upgrades at this stage. The goal is to get the code into the new system first, then improve it incrementally.

However, this approach comes with its own set of challenges. Here are some pitfalls to watch out for:

Resist the urge to upgrade everything: A common mistake is trying to upgrade frameworks or libraries during the split. For example, one team, 20% into their split, decided to upgrade React from version 16 to 18 and move all tests from Enzyme to React Testing Library in the new system. This meant that for the remaining 80% of the code, they not only had to move it but also refactor tests and deal with breaking React changes. They ended up reverting to React 16 and keeping Enzyme until further into the migration.

Remember the sooner your code gets into the new system the sooner you get faster.

Don’t ignore critical issues: While the “just copy-paste” approach can be efficient, it’s not an excuse to ignore important issues. In one case, a team following this advice submitted a merge request that contained a privilege escalation security bug, which was fortunately caught in code review. When you encounter critical issues like security vulnerabilities, fix them immediately – don’t wait.

Balance speed with improvements: It’s okay to make some improvements as you go. Simple linting fixes that can be auto-applied by your IDE or refactoring blocking calls into proper async/await patterns are worth the effort. It’s fine to spend a few extra hours on a multi-day job to make things a bit nicer, as long as it doesn’t significantly delay your migration.

The key is to find the right balance. Move quickly, but don’t sacrifice the integrity of your system. Make improvements where they’re easy and impactful, but avoid getting sidetracked by major upgrades or refactors until the bulk of the migration is complete.

Measuring Progress and Impact: Part 1 Velocity

Your goal is to have business impact, impact comes from the velocity game to start with, so taht’s where our measurements start.

Number of MRs on new vs old systems: Initially, focus on getting as many engineers onto the new (high velocity) systems as possible, compare your number of MRs on old vs new over time and monitor the change to make sure you are having the impact here first

Overall MR growth: If the total number of MRs across all systems is growing significantly, it might indicate incorrect splitting or dragging incremental work.

Work tracking across repositories: Ask engineers to use the same JIRA ID (or equivalent) for related work across repositories in the branch name or MR Title or something, to track units of work spanning both old and new systems.

Velocity Metrics on old vs new: Don’t “assume” your new systems will always be better, compare old vs new on velocity metric and make sure you are seeing the difference.

Ok, now when you ht critical mass on the above, for us we called it at about 80%, you will need to shift, the long tail there will be less ROI on velocity, it’ll become a support game, and you need to face it differently.

Measuring Progress and Impact: Part 2 Traffic

So at this time its best to look at traffic, moving high volume traffic pages/endpoints in theory should reduce the impact if there’s an issue with the legacy system thereby reducing the support, this might not be true for your systems, you may have valuable endpoints with low traffic, so you need to work it out the best way for you.

Traffic distribution: Looking per page or per endpoint where the biggest piece of the pie is.

Low Traffic: Looking per page or per endpoint where there is low traffic, this may lead you to find features you can deprecate.

As you move functionality to new services, you may discover features in the monolith that are rarely used. Raise with product and stakeholders, ask “Whats the value this brings vs the effort to migrate and maintain it?”

  1. deprecating the page or endpoint
  2. combining functionality into other similar pages/endpoints to reduce codebase size

Remember, every line of code you don’t move is a win for your migration efforts.

Conclusion

Splitting a monolith is a complex process that requires a strategic approach tailored to your system’s current state. Whether you’re dealing with a younger, more manageable monolith or an ancient system with slow velocity, there’s a path forward.

The key is to stop adding to the monolith immediately, start new development in separate systems, and approach existing code pragmatically – sometimes a simple copy-paste is the best first step. As you progress, shift your focus from velocity metrics to traffic distribution and support impact.

Remember, the goal is to improve your system’s overall health and development speed. By thoughtfully planning your split, building new features in separate systems, and closely tracking your progress, you can successfully transition from a monolithic to a microservices architecture.

In our next and final post of this series, we’ll discuss how to finish strong, including strategies for cleaning up your codebase, maintaining momentum, and ensuring you complete the splitting process. Stay tuned!

Your Load Balancer Will Kill You

Let’s start by talking about the traditional way people scale applications.

You have a startup, you have a new idea, so you throw something out there fast, maybe it’s a Rails app with a mongoDB backend if you’re unlucky, or something like that

Now thing are going pretty good, maybe you have time and you re-write into something sensible at this point as you business grows and you get more devs, so now you’re on a nice react website with dotnet or node backend or something. But your going slow due to too many users, so you start to scale, first thing people do is this, load balancer in front, horizontally scale the web layer

Now that doesn’t seem too bad, with a few cache optimizations you’re probably handling a few thousands users simultaneous and felling happy. But you keep growing and now you need to handle tens of thousands, so the architecture starts to break out vertically.

So lets imagine something like the below, and if we’ve been good little vegemites we have a good separation of domains, so are able to scale the db by separating the domains out into microservices on the backend with their own independent data.

Our web site then ends up looking a bit like a BFF (Backend For Frontend), and we scale nicely and are able to start to scale up to tens or into the hundreds of thousands of users. And if you are using AWS especially you are going have these lovely Elastically Scaling Load balancers everywhere.

Now when everything is working its fine, but let’s look at a failure scenario.

One of the API B server’s goes offline, like dead, total hardware failure. What happens in the seconds that follow.

To start, let’s look at load balancer redundancy methods, LBs use a health-check endpoint, an aggressive setting would be to ping it every 2 seconds, then after 2 consecutive failures failures take the node offline.

Let’s also take the example we are getting 1,000 requests per second from our BFF.

Second 1
Lose 333 Requests

Second 2
Lose 333 Requests
Health check fails first time

Second 3
Lose 333 Requests

Second 4
Lose 333 Requests
Health check fails second time and LB stops sending traffic to node

So in this scenario we’ve lost about 1300 requests, but we’ve recovered.

Now you say, but how about we get more aggressive with the health check? This only goes so far.

At scale, the more common outage are not ones where things going totally offline (although this does happen), they are ones where things go “slow”.

Now imagine we have aggressive LB health checks (the ones above are already very aggressive so you cant get much more usually), and things are going “slow” to the point health checks are randomly timing out, you’ll start to see nodes pop on and offline randomly, your load will get unevenly distributed to the point usually where you may even have periods of no nodes online, 503s FTW!. I’ve witnessed this first hand, it happens with agressive health checks 🙂

Next is, what happens if your load balancer goes offline? While load balancers are generally very reliable, things like config updates and firmware updates are times when they most commonly fail, but even then, they still can succumb to hardware failure.

If you are running in e-commerce like I have been for the last 15 odd years then traffic is money, every bit of traffic you lose can potentially be costing you money.

Also when you start to get into very large scale, the natural entropy on hardware means hardware failure becomes more common. For example, if you have say 5,000 physical server in your cloud, how often will you have a failure that takes applications offline.

And it doesn’t matter if you are running AWS cloud, kubernetes, etc hardware failure still takes things offline, your VMs and containers may restart with little to no data loss, but they still go offline for periods.

How do we deal with this then? How about Client-side Weighted round-robin?

WTF is that? I hear you say. Good Question!

It’s were we move the load balancing mechanism to the client that is calling the backend. There is several advantages to doing this.

This is usually coupled with a service discovery system, we use consul these days, but there is lots out there.

The basic concept is that the client get a list of all available nodes for a given service. They will the maintain they own in memory list of them and round robin through them similar to a load balancer.

This removes infrastructure (i.e. cost and complexity)

The big differences comes though that the client can retry the request on a different node. You can implement retries when you have a load balancer in front of you, but you are in effect rolling the dice, having the knowledge of the endpoint on the client side means that the retries can be targeted at a different server to the one that errored, or timed out.

What’s the Weighting part though?

Each client maintains its own list of failures, so for example if a client got a 500 or timeout from a node, it would weight him down and start to call him less, this cause a node specific back off, which is extremely important in the more common outage at scale of its “slow”, so if a particular node has been smashed a bit too much by something and is overloaded the clients will slow back off that guy.

Let’s look at the same scenario as before with API B and a node going offline. We usually set our timeouts to 500ms to 1 second for our API requests, so let’s say 1 second, as the requests start to fail they will retry on the next node in the list, and weight down the offline server in the local clients list of servers/weighting, so here’s what it looks like:

Second 1
220 Requests take 1 second longer

Second 2
60 Requests take 1 second longer

Second 3
3 Requests take 1 second longer

Second 4
3 Requests take 1 second longer

Second 5

The Round robin weighting kicks in at the first failure, as we only have 3 web servers in this scenario and they are high volume the back-off isn’t decremented in periods of seconds its in number of requests.

Eventually we get to the point that we are trying the API once every few seconds with a request from each web server until he comes back online, or until the service discover system kicks in and tells us he’s dead (which usually takes under 10 seconds)

But the end result is 0 lost requests.

And this is why I don’t use load balancers any more 🙂