Given that there would only be one service, it made sense to move all the destination code into one repo, which meant merging all the different dependencies and tests into a single repo. We knew this was going to be messy.
For each of the 120 unique dependencies, we committed to having one version for all our destinations. As we moved destinations over, we’d check the dependencies it was using and update them to the latest versions. We fixed anything in the destinations that broke with the newer versions.
With this transition, we no longer needed to keep track of the differences between dependency versions. All our destinations were using the same version, which significantly reduced the complexity across the codebase. Maintaining destinations now became less time consuming and less risky.
We also wanted a test suite that allowed us to quickly and easily run all our destination tests. Running all the tests was one of the main blockers when making updates to the shared libraries we discussed earlier.
Fortunately, the destination tests all had a similar structure. They had basic unit tests to verify our custom transform logic was correct and would execute HTTP requests to the partner’s endpoint to verify that events showed up in the destination as expected.
Recall that the original motivation for separating each destination codebase into its own repo was to isolate test failures. However, it turned out this was a false advantage. Tests that made HTTP requests were still failing with some frequency. With destinations separated into their own repos, there was little motivation to clean up failing tests. This poor hygiene led to a constant source of frustrating technical debt. Often a small change that should have only taken an hour or two would end up requiring a couple of days to a week to complete.
The outbound HTTP requests to destination endpoints during the test run was the primary cause of failing tests. Unrelated issues like expired credentials shouldn’t fail tests. We also knew from experience that some destination endpoints were much slower than others. Some destinations took up to 5 minutes to run their tests. With over 140 destinations, our test suite could take up to an hour to run.
To solve for both of these, we created Traffic Recorder. Traffic Recorder is built on top of yakbak, and is responsible for recording and saving destinations’ test traffic. Whenever a test runs for the first time, any requests and their corresponding responses are recorded to a file. On subsequent test runs, the request and response in the file is played back instead requesting the destination’s endpoint. These files are checked into the repo so that the tests are consistent across every change. Now that the test suite is no longer dependent on these HTTP requests over the internet, our tests became significantly more resilient, a must-have for the migration to a single repo.
I remember running the tests for every destination for the first time, after we integrated Traffic Recorder. It took milliseconds to complete running the tests for all 140+ of our destinations. In the past, just one destination could have taken a couple of minutes to complete. It felt like magic.
Once the code for all destinations lived in a single repo, they could be merged into a single service. With every destination living in one service, our developer productivity substantially improved. We no longer had to deploy 140+ services for a change to one of the shared libraries. One engineer can deploy the service in a matter of minutes.
The proof was in the improved velocity. In 2016, when our microservice architecture was still in place, we made 32 improvements to our shared libraries. Just this year we’ve made 46 improvements. We’ve made more improvements to our libraries in the past 6 months than in all of 2016.
The change also benefited our operational story. With every destination living in one service, we had a good mix of CPU and memory-intense destinations, which made scaling the service to meet demand significantly easier. The large worker pool can absorb spikes in load, so we no longer get paged for destinations that process small amounts of load.
Moving from our microservice architecture to a monolith overall was huge improvement, however, there are trade-offs:
-
Fault isolation is difficult. With everything running in a monolith, if a bug is introduced in one destination that causes the service to crash, the service will crash for all destinations. We have comprehensive automated testing in place, but tests can only get you so far. We are currently working on a much more robust way to prevent one destination from taking down the entire service while still keeping all the destinations in a monolith.
-
In-memory caching is less effective. Previously, with one service per destination, our low traffic destinations only had a handful of processes, which meant their in-memory caches of control plane data would stay hot. Now that cache is spread thinly across 3000+ processes so it’s much less likely to be hit. We could use something like Redis to solve for this, but then that’s another point of scaling for which we’d have to account. In the end, we accepted this loss of efficiency given the substantial operational benefits.
-
Updating the version of a dependency may break multiple destinations. While moving everything to one repo solved the previous dependency mess we were in, it means that if we want to use the newest version of a library, we’ll potentially have to update other destinations to work with the newer version. In our opinion though, the simplicity of this approach is worth the trade-off. And with our comprehensive automated test suite, we can quickly see what breaks with a newer dependency version.
Our initial microservice architecture worked for a time, solving the immediate performance issues in our pipeline by isolating the destinations from each other. However, we weren’t set up to scale. We lacked the proper tooling for testing and deploying the microservices when bulk updates were needed. As a result, our developer productivity quickly declined.
Moving to a monolith allowed us to rid our pipeline of operational issues while significantly increasing developer productivity. We didn’t make this transition lightly though and knew there were things we had to consider if it was going to work.
-
We needed a rock solid testing suite to put everything into one repo. Without this, we would have been in the same situation as when we originally decided to break them apart. Constant failing tests hurt our productivity in the past, and we didn’t want that happening again.
-
We accepted the trade-offs inherent in a monolithic architecture and made sure we had a good story around each. We had to be comfortable with some of the sacrifices that came with this change.
When deciding between microservices or a monolith, there are different factors to consider with each. In some parts of our infrastructure, microservices work well but our server-side destinations were a perfect example of how this popular trend can actually hurt productivity and performance. It turns out, the solution for us was a monolith.