RDS MySQL 8 Upgrades Without Downtime

Upgrades are an inevitable part of databases lifecyle, and when they arrive, they often present a delicate situation. At Cabify, we currently manage over 180 RDS MySQL 8.0 instances (in addition to other types of databases). These instances did not always run on MySQL8; we transitioned from version 5.7, and in the future, they may evolve further.

Like many other companies, Cabify adheres to strict Service Level Objectives (SLOs). Even minor downtimes during migration can have a significant overall impact. Therefore, we must upgrade our databases with minimal downtime (spoiler alert: we do this without any downtime).

This article aims to explain how we achieve this and how we keep our database instances consistently updated without impacting the business.

NOTE: If you’re only interested in the technical details of how we upgrade without downtime, feel free to jump to this section.

High level diagram of the components used to migrate from MySQL 5 to MySQL 8 without downtime

Our journey

Our infrastructure has been in constant evolution, and so have our databases. Back in 2020, all our infrastructure was running on GCP. There was no defined process for database upgrades or migrations, which meant we tried to avoid them as much as possible, accepting downtime when necessary. However, at that time a decision was made to migrate everything to AWS. The details of this migration deserve an article of their own. One of the critical points was: how can we migrate all our databases from one provider to another without downtime and without losing any data? No public tool performed this task easily back then. So we decided to design a new tool utilizing the functionalities of ProxySQL. Since then, this tool has been an integral part of our infrastructure team and has evolved to cover more use cases. Nowadays, we use it as the main tool for making MySQL upgrades without downtime.

The current scenario

To make correct decisions, it’s essential to evaluate the trade-offs, which cannot be done without proper context. Therefore, let’s begin by explaining our current scenario in detail.

Service Levels

At cabify we group all our services into Tier Levels (from 1 to 3), with Tier 1 being the most critical one:

Tier 1 What’s needed for complete a drop-off. A failure on this tier will impact core use cases, such as journey creation, transitions, and finished journeys, as well as matching between riders and drivers.
Tier 2 What’s needed to collect payment from journeys.
Tier 3 encompasses other use cases that do not impact the previous tiers.

Databases are not an exception. These are the number of MySQL instances by tier and version:

Number of MySQL instances by tier level

Each tier level has its own SLOs, allowing some downtime. That classification drives some design decisions: for example, RDS MySQL8 instances in Tier 1 and Tier 2 are configured in Multi-AZ setup (secondary instance on another AZ ready to provide failover at all times), while Tier 3 instances are not. This allows us to provide a better SLI when a database instance fails, keeping our expenses controlled and focused where it matter most.

Certain Tier 1 instances that can tolerate downtime, typically because they are associated with business branches that only operate in specific regions, resulting in periods of minimal or no traffic.

Currently, we see that there are 32 instances that cannot afford downtime without impacting the core use cases of the company:

Number of MySQL instances Tier1 which can not afford downtime

As we can see, the issue is not just a technical one; it also involves management. It is crucial to understand the impact of each database to mitigate risks. Therefore, at Cabify, we empower product teams with the responsibility of creating and managing their own databases, as no one knows better than they do how to tag and define the databases appropriately, always keeping the business perspective in mind.

RDS Types

Not all RDS instances have the same resource requirements. Some may demand more CPU, while others may require more disk space. Migrating an instance with 1TB of data is quite different from migrating one with just a few MBs. Even AWS-managed rollouts vary considerably based on these factors, see:

Automatic upgrades incur downtime. The length of the downtime depends on various factors, including the DB engine type and the size of the database.

It’s also important to note that having Multi-AZ enabled significantly affects the upgrade process. When Multi-AZ is enabled, the rollouts of minor versions follow this procedure (doc):

Reader DB instances are upgraded one at a time
One of them is switched to become the new writer DB instance.
The old writer instance is upgraded, which is now a reader instance.

Downtime during the upgrade is limited to the duration it takes for one of the reader DB instances to become the new writer DB instance.

Upgrade Reasons

Let’s explore now the main reasons why we typically need to upgrade our MySQL8 instances. There are essentially four primary motivations:

Upgrade MySQL Version: AWS only maintains certain versions, requiring periodic upgrades in order to continue using supported versions. Learn more here.
Update RDS Static Parameters: In the case of MySQL8, this could include parameters such as gtid-mode.
Maintenance on Underlying Hardware: AWS could perform maintenance over the hardware.
Instance Type Upgrades: This may involve changing the instance type as the database requires more resources or, as we have experienced, migrating our databases to Graviton instances to obtain a better performance/cost ratio.

Read-Only Replicas

Some of our databases are configured with read-only replicas for various reasons (explaining when to use a read-only replica is beyond the scope of this article). This means that when an upgrade occurs, the resulting scenario will include the main instance along with all related read-only replicas, all configured with the new desired settings.

Vendor Lock-In

Another critical aspect that is often overlooked is vendor lock-in. In the past, Cabify successfully migrated all its infrastructure from GCP to AWS without downtime (main reason for start developing this MySQL migration tool). The future is uncertain, which is why it’s essential to avoid dependency on a single provider. While this topic could warrant a separate article, in summary, at Cabify we keep vendor lock-in with our providers as a risk in the decision making process, just like cost or availability. That allows us to ask questions like “how would we migrate all our MySQL instances out of AWS if necessary?” This solution mitigates that risk and provides us with the flexibility to chose a new provider if needed.

Upgrade Strategies

Now, let’s explore the strategies we can employ to upgrade our RDS MySQL8 instances.

In-Place Upgrade

This is a straightforward solution that works for us in most cases. It offers a significant benefit: there’s no manual work to do, as we allow AWS to handle the upgrade process, which translates to substantial savings in man-hours. Tier 3 instances are upgraded using this technique. Essentially, we wait for AWS to upgrade the instances when the version support end date has reached, accepting the associated downtime. The expected downtime here is a few minutes, depending largely on the instance size.

The same applies to Tier 1 instances, that can accept downtime and Tier 2 instances. However, keep in mind that these instances have Multi-AZ enabled, so downtime is reduced to the failover time, which should be less than one minute (usually around 30 seconds). This downtime could potentially be reduced to approximately one second by using RDS Proxy. However, this solution requires additional implementation to achieve that. Spoiler alert: this is similar to our current approach for achieving zero downtime.

Read-Replica Promotion

This is another viable alternative. It is fast and easy to implement, but it comes with some clear drawbacks:

Promoting a read replica is not instantaneous, resulting in a few seconds of downtime.
The read replica will have a new endpoint, which means that client services will need to coordinate to switch from the old endpoint to the new one. This can pose a challenge if the client services are not prepared for such changes, often resulting in additional downtime.

You can find an AWS migration guide here.

Blue-Green Deployment

AWS offers a blue-green deployment strategy, which creates a staging environment that synchronizes with the main instance. Once the new staging instance is ready, you can promote it. However, this process is not atomic and does require some downtime (typically under a minute, although they were just few seconds in the demos we conducted).

As of the time of writing, Terraform support for this approach is far from perfect (terraform doc). For example, if the upgrade fails for any reason, it does not clean up the resulting mess. Additionally, it does not support the upgrade of instances with read replicas, which is relevant to some of our databases.

In conclusion, while this method appears to be a promising tool for the future, it is not the best option for us at this time. Also, keep in mind the issue of vendor lock-in.

No-Downtime Approach

Now, let’s delve into the interesting part: how we handle Tier 1 MySQL RDS instance upgrades/migrations without downtime and without vendor lock-in.

Consider the following general overview (don’t worry, we’ll dive into the details in the sections that follow):

High level diagram of the MySQL migrator components

The main components involved in our solution are:

Coordinator: It’s responsible for the logic related to the databases:
- Move data from one database to another (based on mysqldump)
- Configure the new database as a slave
- Monitor database states and dynamically calculate routing weights to control traffic distribution during migration phases
Connector: Retrieves the weights from the Coordinator and applies them to ProxySQL.
ProxySQL: Routes traffic to the correct instance based on the weights set by the Connector.

We define two main steps in this approach:

Sync: Make sure the new instance has the same data as the old one.
Swap: Change the traffic routing from the old instance to the new one.

Sync

The sync phase ensures that the new MySQL instance contains exactly the same data as the old one before we switch traffic over. This is critical for maintaining data consistency throughout the migration process. We have developed two distinct approaches to handle different database sizes and configurations:

Mysqldump Approach

This approach is suitable for smaller to medium-sized databases where the dump process can complete within a reasonable timeframe. The process works as follows:
1. Create a Read Replica: We begin by creating a read replica from the old instance. This allows us to perform the data dump without impacting the performance of the production instance that continues serving live traffic.
2. Stop Replication on the Replica: We stop the replication on the read replica at a specific point in time. This step is crucial because we need to know the exact binary log position where replication stopped, enabling the new instance to resume replication from that precise point.
3. Execute Mysqldump: We perform a mysqldump operation, transferring all data from the read replica to the new instance. During this process, both the read replica and the new instance are isolated from production traffic, which continues to flow to the original instance.
4. Configure Replication on the new instance: After the dump completes, we configure the new instance as a slave of the original instance, starting replication from the exact position where the read replica stopped.
Snapshot Approach

For larger databases where mysqldump would take an impractical amount of time, we use the snapshot approach:
1. Create New Instance From Snapshot: We create the new instance directly from a snapshot of the old instance. This approach is significantly faster for large datasets as it leverages AWS’s underlying storage technology.
Important Limitation: The snapshot approach requires GTID_MODE to be enabled on both old and new instances. Without GTID (Global Transaction Identifier), the new instance cannot determine the correct starting point for replication, making this approach unfeasible.
Final Sync Configuration

Regardless of the approach used, the final step involves establishing replication between the old and new instances:
1. Configure Master-Slave Replication: The new instance is configured as a slave of the original instance. For setups without GTID enabled (only possible with the mysqldump approach), the coordinator retrieves the MasterLogFile and ReadMasterLogPos from the read replica to configure replication accurately.
2. Monitor Replication Lag: We continuously monitor the replication lag between the old and new instances. Once the coordinator verifies that the replication lag is zero, the sync phase is considered complete, and both instances are perfectly synchronized.
This synchronization ensures that when we eventually switch traffic, no data will be lost, and the new instance will be an exact replica of the old one.

Refer to the following flow diagram for the sync process of the mysqldump approach:

Sync workflow

Swap

The swap phase is the most critical part of our zero-downtime migration process. With both instances perfectly synchronized, we now need to seamlessly redirect all database traffic from the old instance to the new one. This process requires precise coordination and timing to ensure no data loss or service interruption.

ProxySQL Connection: We route all traffic through ProxySQL (Connector) to the old instance. We request teams to modify their service client’s in order to point to the ProxySQL. From the service perspective, nothing changes, but this setup gives us complete control over traffic routing as the migration proceeds.
Pre-Swap Verification: The coordinator performs a final verification that the replication lag between the old and new instances is exactly zero. This ensures that both instances contain identical data before proceeding.
Traffic Suspension: The coordinator temporarily sets the configuration to stop all traffic routing. The connector, which continuously polls this configuration, immediately updates ProxySQL to set all database weights to 0. This creates a brief period (typically a few milliseconds) where no new traffic is routed to either instance.

Important: This traffic suspension does not constitute downtime from the application’s perspective. ProxySQL queues incoming connections during this brief period, resulting in a slight latency increase rather than connection failures.
Final Replication Check: The coordinator performs one last verification that replication lag is zero, confirming that all data has been successfully synchronized.
Stop Replication: Once confirmed, replication between the old and new instances is terminated. At this point, the new instance contains a complete and exact copy of all data from the old instance. Since no new traffic is flowing to the old instance, no additional data will be generated that could cause inconsistency.
Route to New Instance: The coordinator updates the configuration to direct all traffic to the new instance. The connector detects this change and configures ProxySQL to route 100% of database traffic to the new instance, setting its weight to 1 while keeping the old instance weight at zero.
Traffic Resume: Database connections that were queued during the brief pause are now processed by the new instance, completing the migration process.

WARNING: The swap process requires careful handling of existing ProxySQL connections. Active transactions, table locks, or long-running queries may prevent some connections from being immediately redirected to the new instance. These “sticky” connections could continue executing commands on the old instance even after replication has stopped, potentially causing data loss.

To mitigate this risk, we verify that all connections have multiplexing enabled before doing the switch.

For more details on this behavior, refer to the ProxySQL multiplexing documentation.

The entire swap process typically completes within a few hundred milliseconds, during which applications experience only a slight increase in database connection latency rather than any actual downtime. This approach allows us to achieve true zero-downtime database migrations while maintaining complete data consistency.

Here is a complete flow diagram of the swap process:

Migrator process overview

Conclusion

While there are several alternatives available for database migrations, we sought a solution that minimizes downtime, avoids vendor lock-in, and provides the flexibility to migrate between instances across different cloud providers or database types (e.g. RDS vs Aurora). The combination of ProxySQL and a bit of custom code allows us to achieve this.

The challenges of database migrations are far from over. We aim to empower product teams to upgrade their MySQL databases independently; however, that is a story for the future.

Provider	Purpose
Google Ads	A. Data processing based on consent: - Store or access information on a device - Build a personalised ad profile - Select personalised ads B. Based on legitimate interest: - Personalise content - Improve products - Measure ad performance - Select basic ads - Select personalised content - Use market research to generate audience insights Extra processing: - Match and combine offline data - Ensure security, prevent fraud and debug - Technically deliver ads or content - Link devices Google Advertising Products follow the IAB Transparency & Consent Framework. More in their Privacy Policy, business safety & privacy site and Terms of Service.
Facebook	Based on consent: store or access information on a device. Learn more in their Privacy Policy.
LinkedIn	Based on consent: - Store/access info - Build a personalised ad profile - Select personalised ads Based on legitimate interest: - Improve products - Measure ad performance - Select basic ads Extra processing: - Ensure security, prevent fraud and debug - Technically deliver LinkedIn content LinkedIn follows the IAB Framework. See their Privacy Policy.
X	Based on consent: store or access information on a device. More info in their Privacy Policy.
Taboola	Based on consent: store or access information on a device. Based on legitimate interest: - Personalise content and ads - Measure performance - Link devices - Improve products - Use offline data Taboola follows the IAB Framework. Read their Privacy Policy.
TikTok	Based on consent: - Store or access information - Build a personalised ad profile - Select personalised ads Based on legitimate interest: - Measure ad performance - Select basic ads - Improve products Extra processing: - Prevent fraud - Ensure security - Deliver ads TikTok follows the IAB Framework. See their Privacy Policy.
Microsoft Advertising	Uses the UET tag to track site usage and optimise campaigns. Helps personalise ads and measure effectiveness. Follows the IAB Framework. More in their Privacy Policy.
StackAdapt	Based on consent: uses cookies to uniquely identify users for retargeting, conversion tracking, and lookalike profiles. Tracks campaign performance and engagement. Follows the IAB Framework. Read their Privacy Policy.
Criteo	Based on consent: stores or accesses information on a device for personalised advertising and retargeting. Uses device identifiers and browsing data to show relevant ads based on your interests and previous shopping behaviour. This includes creating personalised advertising profiles and measuring ad performance. Criteo follows the IAB Transparency & Consent Framework. More info in their Privacy Policy.

Provider	Purpose
Google Analytics	Google’s tool for measuring site use. It uses cookies (like “_ga”) to track visits, without identifying individuals. Data may be used with advertising cookies to personalise and measure ads across Google and the web. More info. See also Google’s business safety & privacy site and Terms of Service.
Hotjar	Behavioural analytics tool that tracks user interactions like clicks and scrolling. It helps us improve usability and design. More info.
Amplitude	Tracks how users navigate our site, what features they use, and what actions they take — all to help us improve. More info.

RDS MySQL 8 Upgrades Without Downtime

Our journey

The current scenario

Service Levels

RDS Types

Upgrade Reasons

Read-Only Replicas

Vendor Lock-In

Upgrade Strategies

In-Place Upgrade

Read-Replica Promotion

Blue-Green Deployment

No-Downtime Approach

Sync

Swap

Conclusion

Xabier Martinez

Bringing our culture to life through stories—discover it in our blog.

Mobile App Observability via OpenTelemetry

GitLab + Bitrise: Together but Not Mixed

From Webpack/Jest to Vite/Vitest: Data-Driven Approach

Data-driven web performance optimization

How Cabify Ads reached more than 300 Million impressions

How image AI transformed our workflow at Cabify

RDS MySQL 8 Upgrades Without Downtime

Why we still do performance reviews at Cabify