Why do you need Switchback for experimentation

Business is evolving fast. We, as engineers, build features we expect to improve business performance, but their actual impact is uncertain until measured. This is where experimentation comes in.

The key points of experimentation are:

Randomization of the sample into groups.
Independence of the groups.
Evaluation of the effect of the treatment in one or more target metrics.

If you develop software applications, you will likely have heard about A/B testing, which measures the impact of a change by comparing the performance of two versions of your application.

At Cabify, we run A/B experiments regularly. In the example shown below, users see different messages before ordering a journey. Those in the treatment group see a message like more than 6 minutes when the estimated wait time exceeds that threshold. The control group sees the standard message. The target metric here was the conversion rate. We wanted to check how much the ETA affected the user’s decision to request a journey. So we measured and compared the conversions of both groups.

ETA A/B experiment example

Although A/B testing is a powerful tool and something of a standard in the industry, there are situations where its limitations make it a bad choice. Notice that showing the ETA threshold to the treatment group does not affect the control group. Both groups are independent. What if we want to measure the impact on journey success rate (the percentage of journeys that complete successfully)? Journey success rate is a marketplace-level metric: it is affected by all users, not just those in one group. A higher conversion in the treatment group means more journeys created, which saturates the fleet and reduces supply for the control group, making their journeys less likely to complete. This cross-group interference is called cannibalization, and it makes A/B testing misleading for this type of metric. Switchback experimentation solves this issue.

Dependency example

Characteristics of Switchback testing

As its name suggests, a switchback experiment constantly switches all users between the two versions of the application. Unlike A/B testing, where group membership is fixed per user, here the group is determined by the time of the interaction, so a user may experience both versions over the course of the experiment.

Let’s go back to our example, this time using a switchback design, as illustrated below. We will define the group time interval as 53 minutes (a prime number, chosen to avoid cycling effects, as we will explain later). Once we deploy the feature, we will serve the treatment version of the application to all our users for 53 minutes. Then, we will serve the control version for 53 minutes. And so on. We can measure the performance during each interval for the whole experiment duration. Afterward, we will compare the performance of the two groups.

Switchback experiment example

This way, both groups remain independent because we are randomly taking snapshots of the marketplace. With some adjustments in the analysis and the experiment definition, which we will explain later, we are able to measure the impact on marketplace metrics.

Special considerations of a Switchback experiment

Although it might seem that the switchback experiment is a silver bullet, there are scenarios where it is not the best option. The obvious question is: should I use it for every experiment? The answer is no.

Take a look at the following table you can use to decide which type of experiment to use.

Question	Switchback	A/B
Are all users affected by the feature?	Yes, creating dependency	No, it can be isolated
Is the main metric at user-level or it is a global one?	Global metric for all user groups	Local metrics tied to the user group
Is randomization possible without interferences?	No, control and treatment would cannibalize each other	Yes, we can randomly get isolated groups of samples

So, the A/B test would be more effective to answer questions like “Do users who can directly check out make more orders?” On the other hand, switchback experiments are more effective to answer questions like “Is the marketplace more balanced when we assign the closest drivers to the riders?” or “What is the impact of applying a 10% price reduction for the regular ride-hailing product?”

Hence, the first thing to consider is which question you want to answer. Then, determine whether you can split users into independent groups.

If you find switchback experimentation a good fit for your experiment, there are some things you need to consider.

Experiment duration

Like any other kind of experiment, switchback experiments need a minimum duration to get enough samples. Otherwise, we cannot measure precisely the performance of the two groups.

This is no different from A/B testing. The statistics beneath the decision to stop the experiment are the same. This is where the Minimum Detectable Effect (MDE) comes into play. Basically, the tinier the difference you want to notice, the more samples you need and the longer the experiment will last. MDE is the smallest change in your metric that your experiment can detect with statistical significance.

To determine the MDE, you need to understand your baseline metrics and your expectations on statistical significance. Then, you can calculate the MDE using a calculator like this one. Theoretically, you can determine the sample size you need. However, there are three main factors to consider when setting the experiment duration:

The cost of running an experiment. The longer the experiment, the higher the cost.
The uncertainty regarding the sample generation rate. In the real world, the orders, bookings, etc. are not generated at a constant rate. So, the forecast of how long the experiment must last to reach the desired sample size will not be completely accurate.
Blocking experiments. Running several experiments at the same time is not always possible. You need to either block experiments or split them. For instance, in marketplace scenarios, it is common to split the population by region.

Lastly, take into account that, in switchback experiments, each interval is one sample, so you need longer experiments to get significant results.

Interval time

The defining characteristic of switchback experiments is that group assignment is based on time: all users are assigned to the same group for a fixed interval before switching. Therefore, our next question is: how long should this interval be?

Again, there is no magical number to solve this question. That said, the very nature of time-based group switching introduces a complication we need to address.

In our ride-hailing example, imagine we want to test a feature that increases the number of drivers available for a rider. We expect this to have a positive impact on the total number of journeys performed. If we are in the last minutes of the treatment interval, and a rider starts a journey, it may end after the control interval began. The rider will be affected by the control version. Moreover, the journey drop-off will be finished during the control group interval. So, even if the drop-off was made due to the treatment change, it is counted as a control success.

This effect of early treatment affecting later outputs is called Carryover effect. It affects the switchback interval duration because the more persistent the effect, the less frequently you should switch intervals (Bojinov, Simchi-Levi, and Zhao 2023). This affects our measurements, which is why we apply washout periods to remove the effect of the early treatment.

The interval also needs to be long enough to capture the full lifecycle of the actions you want to measure. For journey-level metrics, that means covering typical assignment and completion times; for shorter signals like driver acceptance rates, a shorter interval suffices.

With all of this said, a good interval duration should:

Contain samples from all possible day and hour combinations: peak hours, valley hours, etc.
Avoid cycling effects by distributing samples across all hours and days of the week.
Be coherent with the desired experiment duration. The higher the interval, the longer the experiment, because each interval is one sample.

Common choices are 58 minutes and 112 minutes. It is also common to use prime numbers like 53 minutes or 29 minutes, which work well to avoid those cycling effects. However, recent studies propose Poisson-distribution-based dynamical interval lengths to improve the accuracy of the analysis (Ruoxuan Xiong, Alex Chin and Sean J. Taylor, 2024).

Conclusion

Not every experiment fits the A/B model. When a feature affects all users at once and your key metrics operate at the marketplace level, group isolation is impossible, and A/B results will be misleading.

Switchback experimentation solves this by using time as the grouping dimension. It is a more complex tool though: interval length and experiment duration require deliberate planning, and the carryover effect needs to be accounted for in both the design and the analysis.

The decision is straightforward: if your metric is local to a user, use A/B. If it is global, consider switchback.

Provider	Purpose
Google Ads	A. Data processing based on consent: - Store or access information on a device - Build a personalised ad profile - Select personalised ads B. Based on legitimate interest: - Personalise content - Improve products - Measure ad performance - Select basic ads - Select personalised content - Use market research to generate audience insights Extra processing: - Match and combine offline data - Ensure security, prevent fraud and debug - Technically deliver ads or content - Link devices Google Advertising Products follow the IAB Transparency & Consent Framework. More in their Privacy Policy, business safety & privacy site and Terms of Service.
Facebook	Based on consent: store or access information on a device. Learn more in their Privacy Policy.
LinkedIn	Based on consent: - Store/access info - Build a personalised ad profile - Select personalised ads Based on legitimate interest: - Improve products - Measure ad performance - Select basic ads Extra processing: - Ensure security, prevent fraud and debug - Technically deliver LinkedIn content LinkedIn follows the IAB Framework. See their Privacy Policy.
X	Based on consent: store or access information on a device. More info in their Privacy Policy.
Taboola	Based on consent: store or access information on a device. Based on legitimate interest: - Personalise content and ads - Measure performance - Link devices - Improve products - Use offline data Taboola follows the IAB Framework. Read their Privacy Policy.
TikTok	Based on consent: - Store or access information - Build a personalised ad profile - Select personalised ads Based on legitimate interest: - Measure ad performance - Select basic ads - Improve products Extra processing: - Prevent fraud - Ensure security - Deliver ads TikTok follows the IAB Framework. See their Privacy Policy.
Microsoft Advertising	Uses the UET tag to track site usage and optimise campaigns. Helps personalise ads and measure effectiveness. Follows the IAB Framework. More in their Privacy Policy.
StackAdapt	Based on consent: uses cookies to uniquely identify users for retargeting, conversion tracking, and lookalike profiles. Tracks campaign performance and engagement. Follows the IAB Framework. Read their Privacy Policy.
Criteo	Based on consent: stores or accesses information on a device for personalised advertising and retargeting. Uses device identifiers and browsing data to show relevant ads based on your interests and previous shopping behaviour. This includes creating personalised advertising profiles and measuring ad performance. Criteo follows the IAB Transparency & Consent Framework. More info in their Privacy Policy.

Provider	Purpose
Google Analytics	Google’s tool for measuring site use. It uses cookies (like “_ga”) to track visits, without identifying individuals. Data may be used with advertising cookies to personalise and measure ads across Google and the web. More info. See also Google’s business safety & privacy site and Terms of Service.
Amplitude	Tracks how users navigate our site, what features they use, and what actions they take — all to help us improve. More info.
Microsoft Clarity	Behavioural analytics tool from Microsoft that uses heatmaps and session recordings to understand how users interact with our site. Tracks clicks, scrolling, mouse movements and navigation patterns to help us improve user experience and site design. More info.
Hotjar	Behavioural analytics tool that tracks user interactions like clicks and scrolling. It helps us improve usability and design. More info.

Why do you need Switchback for experimentation

Characteristics of Switchback testing

Special considerations of a Switchback experiment

Experiment duration

Interval time

Conclusion

Albert Rego

Bringing our culture to life through stories—discover it in our blog.

The CREP Experience: Building at scale as a fresh graduate

From Polling to gRPC Streaming at Cabify

From SQL Queries to Real-Time Search

Where is my domain?

Why do you need Switchback for experimentation

Cabify Ciudad: a proprietary, accessible, and expressive typeface

How design critiques elevate product quality at Cabify

An interview about the User Week