Engineers & Developers

How to plan an SMS MFA migration that affects thousands of users

In this blog, we share what went right and where we had to pivot in our plan to upgrade the API we use for SMS 2FA, affecting thousands of people.

September 28, 2022

By Jordan Kohl

Want to stay updated on Segment launches, events, and updates? Subscribe below to keep in touch.

Thank you for subscribing!

We'll share the latest and greatest Segment content, events, and updates straight to your inbox.

Get started in Segment

It’s free to begin connecting your sources and destinations with a single API.

Keep updated

I had been working at Twilio Segment for less than a year when I was tasked with upgrading the API we used for SMS two-factor authentication (2FA) in our main application. I’m a Staff Software Engineer on the Security Features Team, so this was something I could handle, but it was still nerve-wracking! Twilio Segment receives over one trillion events per month from thousands of customers that trust us to keep their data safe. At Twilio Segment, we believe that building good security into our software is just as essential as reliability or scalability.

Two-factor, also called multi-factor authentication (MFA), is a critical part of how over 24,000 users authenticate our app. I was nervous about pressing the wrong button at any point in the upgrade process, which could lock users out of their account entirely. Accidentally leaking the phone number of thousands of individuals was also a concern! I needed a rock-solid plan for this upgrade to work smoothly. So, I made a plan, but it all went wrong anyway. I had to switch to plan B and then plan C.

For multi-factor authentication, the Segment app uses Authy: an API service for verifying a user by sending a code via SMS. The original Authy API functions by creating a user in Authy by sending the phone number, which is stored on Authy’s servers. Authy returns an Authy ID, which is stored by the Segment app. After the initial request, only the user’s Authy ID is used to initiate an SMS. This means that for all 24,000+ users with SMS MFA enabled, Segment has never stored their phone number.

The new version of Authy’s SMS API, called Twilio Verify (Twilio acquired Authy in 2015), provides a lot of improvements in error handling, logging, and validation. In addition, the Twilio Segment team was working towards a major milestone that required us to make the upgrade sooner rather than later. Twilio is also deprecating the old API starting in November 2022 and will disable it entirely after May 2023.

One major difference with the new API is that it takes the user’s phone number as input on every request. This means that the path to upgrade from the old Authy API to the new one requires exporting each user’s phone number to be stored in our local database.

Even though I was leading this task, I had a lot of support from my team. They provided feedback during the design phase, reviewed all of my pull requests, and helped run the migration scripts. Going forward, I’ll use “we” instead of “I” because it was a team effort! We needed to define a strategy to seamlessly migrate this bit of sensitive user data into our database and upgrade our code, without interrupting the login process for anyone. This turned out to be a bigger challenge than we imagined!

Why is it hard to solve?

As anyone who has done large-scale database migrations knows, timing is everything. You can’t remove a column from a table if there is code that depends on it. You also can’t create code that depends on a new field until that field has been populated. When dealing with a production database, nothing is static. Users are actively modifying the data, which means what counts as “all the data” is a moving target.

Even assuming the best case scenario of being able to export all the data from Authy and import it into our own system quickly, without error, there would still be a gap during which users are being added or removed because they are using our app and enabling or disabling SMS MFA on their own.

The basics of the process involves exporting 24K entries from an external API, then updating our internal database. When dealing with 24K entries, that process can take hours, during which users will still be making changes to their SMS MFA settings, leading to an inconsistent dataset that has to be cleaned up. So our strategy had to deal with that gap. Ideally, every database migration also has a way to revert the changes in the event of a problem. Accounting for this potential scenario means your time window is now doubled.

The risks

Not only is it time consuming, but dealing with that amount of Personal Identifying Information (PII) is very risky. You don’t want to expose any user data. You don’t want to miss any users. You don’t want to block development any longer than necessary while running the migration. You don’t want to make a mistake that prevents a user from securely authenticating into your system (as it turns out, we did just that).

The strategy

There are three basic steps to SMS MFA. First the user has to give you their phone number with their already authenticated account. A code is sent via SMS to their phone number. They verify they own the device by entering the code that was sent before it expires.

In order to accomplish this fairly large migration, we broke the project down into three broad phases:

Update the code so all new SMS MFA signups would use Twilio Verify
Migrate existing users
Remove older references to the Authy API from the codebase entirely

Stage one: add Twilio Verify

This stage was fairly simple. We had to:

switch our SMS setup methods to use Verify instead of Authy
split SMS send into: Verify or Authy depending on the user setup
split SMS verify into: Verify or Authy depending on the user setup

SMS setup updates

Since we didn’t want newly enrolled SMS MFA users to use the older API, we removed Authy from the SMS setup code entirely. All new setups for SMS MFA would go through the Twilio Verify API. To do this, we had to add a new column to our DB: the user's SMS phone number. Previously, this was sent directly to Authy so we never had to store it. But, since this is extremely private user information, we wanted to encrypt it at the row level of the database.

We ended up with a solution where the users give us their phone number, we validate it’s a North American phone number using the Twilio Lookup API, encrypt it, and store it in the database. When a user attempts to sign in with SMS MFA, our internal API retrieves it from the database, decrypts it, and sends it to Twilio Verify. All of this takes place within our authentication service, so the user's phone number isn’t passed around internally and it is only stored in memory while unencrypted.

SMS send updates

This is where it starts to get complicated. In order to support users with both APIs, we needed to create some conditional code that checks for the existence of a stored phone number before deciding to use the new Verify API or the old Authy API. It does add to the overall complexity of the code, but it supports existing users, allows us to migrate users at our own pace, and makes it simple to revert a user if needed.

SMS verify updates

The code changes are nearly identical for the verification of the user’s SMS send. If they have a stored phone number, we attempt to verify with Verify, if they don’t, we know to try verification with Authy.

Stage two: migrate users

Initially, we considered migrating everyone at once. While it would be nice to get the entire thing done and move on, there would be a high risk of missing customers in this process. Without shutting down SMS MFA, there would be no way to ensure new users didn’t sign up with Authy in the middle of migration. Also considering the sheer numbers, it would be a lengthy process that could break down in the middle. Exporting users from Authy would take several hours, to avoid hitting the frequency rate limits. Importing those phone numbers into our production database would also take several hours to loop through the update statements. Catching errors and preventing timeouts during that process would require a lot of temporary scripting. It would be a lot of effort with a lot of downsides.

An alternative, which is the plan we initially went with, would be to only migrate users in code — no manual migrations in the database. There are many different ways this could be accomplished, depending on your timeline. It was important that we speed up the process, so we decided to migrate users in the code every time they had an interaction with SMS. This meant that on send, instead of defaulting to sending via Authy, we would:

export the single user’s phone number via the Authy Export API
update the user record in our database
send it using the Verify API

In theory, this would be a seamless transition for the user. It would avoid issues with mass exporting and importing thousands of users at once. Since we were planning to split the code with conditional logic anyway, this seemed like a relatively small addition.

In practice, this failed miserably.

Hitting bumps in the road

The problem is that it all depends on the Authy Export API, which, by default, did not give its customers access to its API endpoint. This makes sense, because it allows you to programmatically access Personal Identifying Information of all your users. So even as a customer you only get access for a limited time and the rate limits for this specific endpoint are far more restrictive than the full API. You can only export a user three times within a month.

If our code all worked perfectly, this wouldn’t be a problem. But perfect, completely bug free code requires a lot of time and testing. Unfortunately we had a small amount of both because of the limits on the export API. Just doing a sample run for a single user meant we only had two more attempts for the user within the next 30 days.

It was made more difficult by the nature of our complex conditional code that was needed to handle both versions of the API. We thought the code was ready for production. What ended up happening is our users ran into errors, but kept trying, which caused them to run into the rate limit very quickly. Once a user submits their credentials to login, we automatically initiate the MFA logic, which in this case includes a call to the export API. When this call failed, because they hit the rate limit of 3 requests per month, they were blocked from being able to login or authenticate with a different method. The only solution was to manually fix their account to unblock them. It was not a good situation to be in.

Now, with lots more testing, more robust planning, I still think this plan could work. But ultimately, the limitations on the export API make it less ideal for use in production. It is meant to be something you use once and never again.

A new plan

After reverting back to only using Verify for new SMS MFA setups, we had to come up with a new plan. This was option C. We’d keep the code split, but we’d run multiple migrations in the background to get existing users over to the new API.

As mentioned earlier, trying to do a migration of our entire user base was not a good idea because of the amount of time it took. It was too risky to attempt to do it all at once. So we started with a pilot of internal users only, about 300. After that was a success, we moved on to a plan of migrating 1,000 users at a time, about once a week as scheduling permitted. This allowed us to continue to work on other features. We weren’t blocked trying to finish this migration. It allowed us to handle any issues that came up with a smaller subset of users, in a staggered approach. We weren’t dealing with any bugs that affected 24K users. Instead it only affected 4,000 to 5,000 at once.

We did have to write some custom migration scripts in Node. For example, the script to export users from Authy needed to:

loop through an array of Authy IDs
make a single request for each one
encrypt the exported phone number
throttle the next request to prevent hitting the rate limit
be able to resume should the script timeout or hit an error

Similarly, our database update script needed to be split into chunks of one thousand users to avoid timing out in the middle. So far, this approach has been very successful with only a few hiccups along the way.

Conclusion

Now that we’re using the Twilio Verify API in production, we’re enjoying the benefits of the upgrade. We’re able to use the Node.js SDK to reduce the amount of custom code. We have access to expand our verification methods to Voice, WhatsApp, email, and more. The new Twilio Verify console provides much more detailed logs and insights so we’re better able to track and troubleshoot our usage of SMS MFA.

What did we learn from this process? First, upgrading to a new API is hard, especially when you combine it with migrating sensitive user data. Second, it’s always good to have a plan B and C. Spending time upfront on a Software Design Document that accounts for hitting these potential roadblocks will help mitigate disasters when they happen. Having several people review the plan will also make it robust. Measure twice, cut once.

Test drive Segment CDP today

It’s free to connect your data sources and destinations to the Segment CDP. Use one API to collect analytics data across any platform.

Get started