### Changing assignment weights with time-based confounders

by ALEXANDER WAKIM

There are two common reasons assignment weights may change during an OCE. The first is a strategy called

Although there are many ways to do ramp-up, we assume that once a user is assigned an arm, they will stay assigned to this arm for the duration of the experiment. When assignment weights change in a ramp-up experiment, there are periods of constant assignment weights that we define as epochs. We consider the assignment weights for each epoch to be the proportion of previously unassigned users that are assigned to each arm upon first visit during that epoch. The assignment weights within epochs may or may not add up to 100%. If they do not add up to 100%, we call this a partial traffic ramp-up and a user who visits during that epoch may be unassigned and could be assigned on a future visit in a later epoch. When they add up to 100%, we call this a full traffic ramp-up.

There is a lot of consideration that ought go into the design and analysis of a specific ramp-up strategy. Describing the precise decision criteria for different ramp-up designs is out of scope for this post. Instead, we focus on the case where an experimenter has decided to run a full traffic ramp-up experiment and wants to use the data from all of the epochs in the analysis. The fantasy football and video hosting examples, which we will discuss in more detail later, highlight situations where this design might be considered, despite potential complexity in the analysis.

Complexity arises in the presence of time-based confounders, when taking an unweighted average across data that is aggregated across epochs can lead to biased estimates (whether it’s a partial or full traffic ramp-up strategy). At a high level, this is because changing the assignment weights causes the distribution of the time-based confounder in each experiment arm to be different than the overall population. Using the fantasy football example, we will simulate a situation where the bias is so strong that an inferior decision on which arm to deploy is made. We will offer a weighted average solution that allows an experimenter to do full traffic ramp-up and aggregate data across epochs to get an unbiased estimate of the effect of each arm.

Sometimes a MAB is run to maximize reward during an OCE with the goal of eventually ending the MAB, determining which arm is better, and deploying the better arm. Other times a MAB is run perpetually to maximize reward and not to necessarily find a winning arm. Just as in ramp-up, making inferences while ignoring the complexity of time-based confounders that are present can lead to biased estimates. Moreover, certain MAB algorithms can be counterproductive towards maximizing reward if assumptions aren't met. This post will discuss how to use data from a MAB to get unbiased estimates. We will offer strategies, considerations, and things to think about when running a MAB to maximize reward.

However, when there are changing assignment weights, then an unweighted average of data across the epochs can be a biased estimate. The reason is that changing assignment weights introduces a dependence between time of first visit and probability of being assigned to an arm. If time of first visit is also correlated with reward, this introduces bias. Time of first visit and reward may be correlated due to time-based confounders, which we define as a variable that is correlated with assignment time and reward. To summarize, changing assignment weights causes the distribution of the time-based confounder for those assigned to each arm to be unrepresentative of the overall population of users who visit during the experiment. Possible time-based confounders include day-of-week, time-of-day, or frequent vs. infrequent users.

When there are changing assignment weights and time-based confounders, this complication must be considered either in the analysis or the experimental design. An option that can sometimes work is to use only the data from users assigned in the final epoch. This leads to valid inference so long as the final epoch is run long enough to be representative with respect to the time-based confounder. If the time-based confounder is something like day-of-week, then making sure the final epoch is run for at least a full 7-day cycle can do the trick. However, if the confounder is frequent/infrequent user, this option may not work if one is unwilling to reassign users as they ramp-up (there may be statistical or product reasons for not wanting to do this). This is because users assigned in the final epoch are less likely to be frequent users than the users who visit at any point in the experiment. In the case of full traffic ramp-up, users assigned in the final epoch must not have visited in earlier epochs and hence are less likely to be frequent users. In the case of partial traffic ramp-up, a proportion of users who visited during earlier epochs were assigned in those earlier epochs. If the proportion is large, users assigned in the final epoch are a lot less likely to have visited in earlier epochs. If the proportion is small, users assigned in the final epoch are only slightly less likely to have visited in earlier epochs and the bias may have little practical significance.

Using data from the final epoch is just one way to analyze experiments that use ramp-up strategies in the presence of time-based confounders — other experimental designs or ways to analyze the data may also provide valid inference. With that being said, full traffic ramp-up and combining data across epochs would sometimes be attractive, provided one could get unbiased estimates. We offer two examples where this may be the case.

For the first example, consider a small website that is a platform for content on personal finance. Imagine you run this site and want to test a new video hosting feature to hopefully increase engagement on your site. How willing your users are to engage with personal finance content depends on whether or not it’s the weekend. Here, day-of-week is a time-based confounder.

Although increasing engagement is the goal of this new video hosting feature, you also want to make sure your small website can handle this. So you initially give this new feature an assignment weight of 25%, then if all goes well 50%, then eventually 75%. Since so many users were given the feature in earlier epochs, there’s a lot of information being left on the table if you were to ignore these epochs. So combining data from epochs is attractive.

For the second example, we will go into more detail in the next section.

Imagine you are experimenting on a fantasy football site and you want to show some advanced player statistics in order to aid users in picking the players to play that week. This change gives users a teaser of what they would get if they signed up for a premium account. The goal of this change is to increase the number of non-premium users who sign up for a premium account. You decide to use a ramp-up strategy to mitigate the risk of your non-premium users having strong preferences for the traditional player statistics that are currently shown. However, ramp-up adds complications that you ought to consider. For example, some of your users visit frequently to tinker with their lineups, check scores, check rankings, etc. and other users visit infrequently. Since the frequent users visit more often, it’s more likely for their first visit to be earlier in the experiment. The frequent vs. infrequent user variable is a time-based confounder if, in addition to being correlated with time of first visit, it has an effect on the probability of signing up for a premium account.

Fantasy football follows a weekly schedule based on the real football games that are played on Thursday, Sunday, and Monday. We call a Tuesday through Monday cycle a “football week” to distinguish it from an arbitrary 7 day period that could span two football weeks. Due to the weekly schedule, both frequent and infrequent users will likely visit at least once during a football week, with each football week's visitors being representative of the overall population of users the site is interested in. So when running an experiment on a fantasy football site, it’s a good idea to consider data from a complete football week in order to have a representative sample of the population.

If you wanted to mitigate risk and do ramp-up, using full traffic ramp-up and combining data across epochs is an attractive option in part due to the weekly schedule of fantasy football. With this option, you could do ramp-up and run the experiment for just one football week. If instead you modified the ramp-up design and considered only the final epoch, earlier epochs would need to be planned such that the final epoch ran for a full football week. Although not insurmountable, this does add some complexity in the experimental design and could cause the experiment to take meaningfully longer than a week to finish.

We simulate data using this fantasy football example to illustrate how naive aggregation of data in ramp-up can lead to biased estimates. We do the exact same simulation twice, but in one simulation the hypothetical experimenter changes the assignment weights part way through and in the second they don’t. We will see that the bias caused by naive aggregation with changing assignment weights can be so strong that it changes the outcome of the experiment.

In this simulation, all users who visit during the experiment are randomly assigned an arm on their first visit and stay assigned to that arm for the duration of the experiment. The experiment will run for a full football week in order to get a representative sample and to make a decision before next week's games. We simulate 2,000,000 users who visit during the experiment. The number of simulated users is large enough so that the simulation gives nearly identical answers each time it's run. For this reason we don’t report uncertainty measures or statistical significance in the results of the simulation. Time of first visit and time of assignment for user i are represented by the same variable $Z_i$ and is simulated by a uniform distribution $$

Z_i \sim \mathrm{Uniform}(0, 1)

$$ where experiment duration is normalized to be between $0$ and $1$.

Let $A_i =1$ if user $i$ is a “frequent user” and $A_i =0$ otherwise. A “frequent user” is someone who frequently visits the website, so their first visit is more likely to come earlier within the football week than “infrequent users”. Hence, the frequent users are more likely to be assigned earlier. The relationship between these variables is simulated by $$

A_i \sim \mathrm{Bernoulli}(1- Z_i)

$$A user can sign up for a premium account only once. Whether or not user $i$ converts during the experiment is simulated by the probabilities in Fig 3.

Notice that the new arm performs better on frequent users and the current arm performs better on infrequent users. This is not an example of Simpson’s paradox since one arm doesn't uniformly perform better on both frequent and infrequent users.

The experimenter in this simulation wants to make sure the new arm isn’t going to alienate users who perhaps have strong preferences for the current player statistics that are shown. In order to mitigate risk, the experimenter decides to “play it safe” and not assign too many users to the new arm. So the experimenter decides to initially assign 90% of users to the current arm and 10% to the new arm.

Half-way through the experiment, early data suggests that the new arm isn’t a complete disaster. Consider two hypothetical experimenters: one that ramps-up the assignment weights and one that does not. These two experimenters will use the simulated data as described above.

T_i \sim \mathrm{Bernoulli}(0.1)\quad\mbox{for all $Z_i$}

$$ where $T_i=0$ means user i is assigned to the current arm and $T_i=1$ if assigned to the new arm. In this scenario, the following are sample conversion rates at the end of the experiment:

Here the experimenter decides to keep the current arm.

T_i \sim \mathrm{Bernoulli}(0.1)\quad\mbox{if $Z_i<0.5$}\\

T_i \sim \mathrm{Bernoulli}(0.5)\quad\mbox{if $Z_i>0.5$}

$$ where $T_i=0$ means user $i$ is assigned to the current arm and $T_i=1$ if assigned to the new arm. Notice that there is a dependence between which arm user $i$ is assigned to and when they are assigned. However, when they are assigned is also correlated with which arm they prefer. Thus the resulting sample conversion rates will be biased. Indeed, in our simulations we get the following sample conversion rates at the end of the experiment:

Here, the experimenter ends up deploying an inferior arm. The bias caused by changing assignment weights and not considering this in the analysis changes the experimenter’s decision.

To get an unbiased estimate for the average reward in arm $t$, we can infer the arm’s reward in each epoch and take a weighted average across the epochs with equal weights in the arms $$

\hat{\theta}_{t, \mathrm{adjusted}} = \sum_j \hat{E}[Y_i | T_i=t, E_i=j] P(E_i=j)

$$Here $Y_i$ is the observed reward for user $i$ and $\hat{E}$, $\hat{P}$ indicate estimates of the corresponding expected value and probability. Within an epoch, assignment weights are constant and so a sample mean using data from the appropriate epoch and arm is an unbiased estimate of $E[Y_i | T_i=t, E_i=j]$. Notice that the weights $P(E_i=j)$ do not depend on the assigned arm, so should be estimated from all users regardless of arm assignment. Intuitively the adjusted estimator reweights each epoch estimate in proportion to the percentage of the overall population that arrived in that epoch. Contrast this with an unadjusted estimate where each arm’s sample mean is only representative of the population of users assigned to the given arm. These populations are not the same across arms when there are changing assignment weights and time-based confounders.

The average treatment effect (ATE) is formally defined as the expected difference in potential outcomes between users assigned to different arms $$

ATE =E[Y_i(1)] - E[Y_i(0)]

$$ATE is important in experiments because it tells us about the causal effect of each arm. In a changing assignment weights scenario, we must rely on two assumptions to show that our adjusted estimate is unbiased for ATE [5]. The first is called consistency,

Consistency generally means that users assigned to an arm will experience the assigned arm and no other arm. In the simulation we are assuming that users can only see the version of the website to which they are assigned - hence, consistency. The second assumption is called conditional ignorability

This post discusses the solution in the context of an experiment where random assignment and reward both happen at the user level. The solution generalizes trivially if both random assignment and reward happen at some other unit of analysis. In a more complicated experiment where random assignment is at the user level and reward can happen several times for a user, the solution also generalizes.

We can formally define Thompson sampling in the case of a two-arm experiment. Let $T_i=1$ if user $i$ is assigned to arm 1 and $T_i=0$ if assigned to arm 0. With Thompson sampling, $T_i$ is determined by $$

T_i \sim \mathrm{Bernoulli}(\Phi_i)\\

\Phi_i = P(\theta_1> \theta_0 | Y_1 \cdots Y_{i-1})

$$where $\theta_0$ and $\theta_1$ are the average rewards for arm 0 and arm 1 respectively and $\Phi_i$ is the posterior probability that arm 1 has better reward than arm 0 given all the data observed thus far. Notice that this probability will change with each additional sample, so assignment weights will change many times throughout the MAB.

Although there are a wide variety of MAB algorithms, many MAB algorithms are like Thompson sampling and require models to produce probabilities of each arm being the best. These MAB algorithms are great at maximizing reward when the models are perfectly specified and probabilities are accurate. However, these probabilities can be sensitive to model misspecification. Inaccurate probabilities can cause the MAB to do a poor job of maximizing future reward. This is because the MAB maximizes future reward by using previous users to infer the reward of future users as if they were assigned to each arm. If there are time-based confounders, then users who visit earlier will not be representative of later users. If the model is missing these time-based confounders then the inference will be poor and the MAB will do a poor job of maximizing future reward.

Often a MAB will be run to maximize reward during an experiment. Here the MAB eventually ends, a winner is declared, and the winning arm is deployed. If one does inference to determine the winning arm while ignoring the complexity of time-based confounders that are present, then the inference will be biased. The reason is much the same as in ramp-up. Changing assignment weights causes the distribution of time-based confounders to be systematically different across arms. Since time-based confounders affect reward, not considering time-based confounders that are present will produce biased results on the effect of each arm.

We continue with the fantasy football example and see that if a model omits a time-based confounder then MAB algorithms like Thompson sampling can be counterproductive towards maximizing reward. We will also use the example to illustrate how the naive model can lead to inferior decisions on which arm to deploy at the end of the MAB.

The MAB experimenter will use Thompson sampling. Since Thompson sampling requires Bayesian posteriors, the hypothetical experimenters will infer the conversion rate for each arm j with the following simple Bayesian model\begin{align*}

Y_i &\sim \mathrm{Bernoulli}(\theta_j)\\

\theta_j &\sim \mathrm{Beta}(1, 1)

\end{align*}Although the model is perhaps overly simple in the real world, remember that the data generating process in the simulation is also quite simple. The data generating process differs from the model only by the omission of a single variable (frequent vs. infrequent user) that is correlated with both assignment time and reward. Omitting confounding variables isn’t usually a concern in a traditional A/B test because the law of large numbers and random assignment often make these variables equally distributed among the arms. As we'll see in the results, this intuition doesn't apply with time-based confounders in MABs, since the assignment depends on time.

Not only does the MAB experimenter usually pick the wrong arm, we see in Fig 6 that the MAB experimenter also has fewer total conversions than the traditional A/B experimenter in 267 out of 300 simulations.

*The bimodality in Fig 5 and Fig 6 is caused by whether or not the MAB overconfidently goes into full “exploit” mode by sending nearly all users to the new arm (the preferred arm among frequent users). Occasionally the MAB “explores” just enough to send users to the current arm and eventually becomes confident that it is the winning arm (the preferred arm among infrequent users).

P(Y_i(t) | \theta )= \int P(Y_i | Z_i=z, T_i=t, \theta) P(Z_i=z | \theta) dz

$$where are model parameters. Conditioning on continuous assignment time Zi, rather than discrete assignment epoch, is necessary because the assignment weights can change with each additional sample in a MAB. Notice that this means a model that conditions on continuous assignment time is required. Conditioning on continuous assignment time is much more difficult in practice than conditioning on discrete epoch. For example, what happens if one doesn’t have a good understanding of how continuous assignment time impacts reward? Will the inference still be solid?

This solves the bias problem but doesn’t fix the problem of maximizing reward during the experiment when the model omits time-based confounders. Since maximizing reward is the reason to use a MAB, fixing the bias problem isn’t reason enough to run a MAB if it can’t also maximize reward. For maximizing reward during the experiment when there are time-based confounders, one should consider a class of MABs called restless multi-armed bandits which don’t assume reward is constant over time [8]. However, restless multi-armed bandits add new assumptions about how reward changes over time. Considering how counterproductive things can be when assumptions are not met in Thompson sampling, one should thoroughly understand and test these assumptions before implementing.

When using MABs, an experimenter needs to critically assess whether or not the model is a good fit to the data at hand. Often simple models won’t work. Even when the model is mostly correct, but is missing just one time-based confounder, the MAB can lead to both bad inference and inferior reward. Moreover, if a MAB ends and one of the arms in consideration is deployed, then the MAB is being used like an A/B test. Given all this, perhaps one should focus on valid inference rather than taking on the complexities required to also maximize reward.

We offer a solution for an unbiased estimator for the average treatment effect (ATE) when an experimenter does full traffic ramp-up and aggregates data across epochs in the presence of changing assignment weights and time-based confounders. The solution is a straightforward weighted average of reward during epochs that allows the experimenter to use data across the entire experiment. For MABs, the strategy of conditioning on assignment time is a bit trickier and doesn’t fix the problem of maximizing reward. With this in mind, one needs to consider why they are using a MAB. If the goal is really to maximize reward, one should thoroughly understand and critically test the assumptions of the MAB algorithm and any models the algorithm may require before implementing. If the goal is to eventually deploy one of the arms in consideration, perhaps one shouldn’t be taking on the complexities required for using a MAB.

Although this blog post makes some specific points about changing assignment weights in an A/B experiment, there is a more general takeaway as well. A/B testing isn’t simple just because data is big — the law of large numbers doesn’t take care of everything! Even with big data, A/B tests require thinking deeply and critically about whether or not the assumptions made match the data. Simply trusting methods that have been used in the past, either by the experimenter or the industry, can often lead experimenters astray.

[2] Scott, Steven L. "Multi‐armed bandit experiments in the online service economy." Applied Stochastic Models in Business and Industry 31.1 (2015): 37-45.

[3] Hill, Daniel N., et al. "An efficient bandit algorithm for realtime multivariate optimization." Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017.

[4] Audibert, Jean-Yves, and Sébastien Bubeck. "Best arm identification in multi-armed bandits." 2010.

[5] Imbens, Guido W., and Donald B. Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.

[6] Thompson, William R. "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples." Biometrika 25.3/4 (1933): 285-294.

[7] Thompson, William R. "On the theory of apportionment." American Journal of Mathematics 57.2 (1935): 450-456.

[8] Whittle, P. “Restless Bandits: Activity Allocation in a Changing World.” Journal of Applied Probability, vol. 25, no. A, 1988, pp. 287–298., doi:10.2307/3214163.

*Ramp-up and multi-armed bandits (MAB) are common strategies in online controlled experiments (OCE). These strategies involve changing assignment weights during an experiment. However, if one changes assignment weights when there are time-based confounders, then ignoring this complexity can lead to biased inference in an OCE. In the case of MABs, ignoring this complexity can also lead to poor total reward,**making it counterproductive towards its intended purpose. In this post we discuss the problem, a solution, and practical considerations.*

## Background

### Online controlled experiments

An online controlled experiment (OCE) randomly assigns different versions of a website or app to different users in order to see which version causes more of some desired action. In this post, these “versions” are called

**arms**and the desired action is called the**reward**(arms are often called “treatments” and reward is often called the “dependent variable” in other contexts). Examples of rewards for an OCE include the amount of product bought, the number of people who sign up for a newsletter, the number of people who start an account, etc. This post considers a common design for an OCE where a user may be randomly assigned an arm on their first visit during the experiment, with**assignment weights**referring to the proportion that are randomly assigned to each arm.There are two common reasons assignment weights may change during an OCE. The first is a strategy called

**ramp-up**and is advised by many experts in the field [1]. The second common reason is multi-armed bandit algorithms (MAB) that maximize reward by assigning more users to a winning arm sooner in order to take advantage of it sooner.

### Ramp-up

Ramp-up is when the experiment initially gives a small assignment weight to the new arm and, as the experiment continues, increases the new arm's assignment weight. One reason to do ramp-up is to mitigate the risk of never before seen arms. For example, imagine a fantasy football site is considering displaying advanced player statistics. A ramp-up strategy may mitigate the risk of upsetting the site’s loyal users who perhaps have strong preferences for the current statistics that are shown. Another reason to use ramp-up is to test if a website's infrastructure can handle deploying a new arm to all of its users. For example, consider a smaller website that is considering adding a video hosting feature to increase engagement on the site. The website wants to make sure they have the infrastructure to handle the feature while testing if engagement increases enough to justify the infrastructure.

Although there are many ways to do ramp-up, we assume that once a user is assigned an arm, they will stay assigned to this arm for the duration of the experiment. When assignment weights change in a ramp-up experiment, there are periods of constant assignment weights that we define as epochs. We consider the assignment weights for each epoch to be the proportion of previously unassigned users that are assigned to each arm upon first visit during that epoch. The assignment weights within epochs may or may not add up to 100%. If they do not add up to 100%, we call this a partial traffic ramp-up and a user who visits during that epoch may be unassigned and could be assigned on a future visit in a later epoch. When they add up to 100%, we call this a full traffic ramp-up.

Fig 1. A simple example of a full traffic ramp-up strategy where 90% of users are assigned the current arm and 10% are assigned the new arm in epoch 1. In epoch 2, 50% of previously unassigned users are assigned each the current and new arm. |

There is a lot of consideration that ought go into the design and analysis of a specific ramp-up strategy. Describing the precise decision criteria for different ramp-up designs is out of scope for this post. Instead, we focus on the case where an experimenter has decided to run a full traffic ramp-up experiment and wants to use the data from all of the epochs in the analysis. The fantasy football and video hosting examples, which we will discuss in more detail later, highlight situations where this design might be considered, despite potential complexity in the analysis.

Complexity arises in the presence of time-based confounders, when taking an unweighted average across data that is aggregated across epochs can lead to biased estimates (whether it’s a partial or full traffic ramp-up strategy). At a high level, this is because changing the assignment weights causes the distribution of the time-based confounder in each experiment arm to be different than the overall population. Using the fantasy football example, we will simulate a situation where the bias is so strong that an inferior decision on which arm to deploy is made. We will offer a weighted average solution that allows an experimenter to do full traffic ramp-up and aggregate data across epochs to get an unbiased estimate of the effect of each arm.

### Multi-armed bandits

Multi-armed bandits (MAB) are a class of algorithms that maximize reward by assigning more users to better performing arms sooner in order to take advantage of them sooner. MAB algorithms are popular across many of the large web companies. Companies like Google [2], Amazon [3], and Microsoft [4] have all published scholarly articles on this topic.Sometimes a MAB is run to maximize reward during an OCE with the goal of eventually ending the MAB, determining which arm is better, and deploying the better arm. Other times a MAB is run perpetually to maximize reward and not to necessarily find a winning arm. Just as in ramp-up, making inferences while ignoring the complexity of time-based confounders that are present can lead to biased estimates. Moreover, certain MAB algorithms can be counterproductive towards maximizing reward if assumptions aren't met. This post will discuss how to use data from a MAB to get unbiased estimates. We will offer strategies, considerations, and things to think about when running a MAB to maximize reward.

Fig 2. Multi-armed bandit algorithms are often thought of by imagining a gambler sitting at several slot machines, deciding which machine to play in order to maximize winnings. |

## Naive aggregation of data in ramp-up can lead to biased estimates

In an OCE, the average reward for each arm is often estimated by the average of the observed outcomes in that arm. In an OCE with constant assignment weights and a representative sample, this is an unbiased estimator.However, when there are changing assignment weights, then an unweighted average of data across the epochs can be a biased estimate. The reason is that changing assignment weights introduces a dependence between time of first visit and probability of being assigned to an arm. If time of first visit is also correlated with reward, this introduces bias. Time of first visit and reward may be correlated due to time-based confounders, which we define as a variable that is correlated with assignment time and reward. To summarize, changing assignment weights causes the distribution of the time-based confounder for those assigned to each arm to be unrepresentative of the overall population of users who visit during the experiment. Possible time-based confounders include day-of-week, time-of-day, or frequent vs. infrequent users.

When there are changing assignment weights and time-based confounders, this complication must be considered either in the analysis or the experimental design. An option that can sometimes work is to use only the data from users assigned in the final epoch. This leads to valid inference so long as the final epoch is run long enough to be representative with respect to the time-based confounder. If the time-based confounder is something like day-of-week, then making sure the final epoch is run for at least a full 7-day cycle can do the trick. However, if the confounder is frequent/infrequent user, this option may not work if one is unwilling to reassign users as they ramp-up (there may be statistical or product reasons for not wanting to do this). This is because users assigned in the final epoch are less likely to be frequent users than the users who visit at any point in the experiment. In the case of full traffic ramp-up, users assigned in the final epoch must not have visited in earlier epochs and hence are less likely to be frequent users. In the case of partial traffic ramp-up, a proportion of users who visited during earlier epochs were assigned in those earlier epochs. If the proportion is large, users assigned in the final epoch are a lot less likely to have visited in earlier epochs. If the proportion is small, users assigned in the final epoch are only slightly less likely to have visited in earlier epochs and the bias may have little practical significance.

Using data from the final epoch is just one way to analyze experiments that use ramp-up strategies in the presence of time-based confounders — other experimental designs or ways to analyze the data may also provide valid inference. With that being said, full traffic ramp-up and combining data across epochs would sometimes be attractive, provided one could get unbiased estimates. We offer two examples where this may be the case.

For the first example, consider a small website that is a platform for content on personal finance. Imagine you run this site and want to test a new video hosting feature to hopefully increase engagement on your site. How willing your users are to engage with personal finance content depends on whether or not it’s the weekend. Here, day-of-week is a time-based confounder.

Although increasing engagement is the goal of this new video hosting feature, you also want to make sure your small website can handle this. So you initially give this new feature an assignment weight of 25%, then if all goes well 50%, then eventually 75%. Since so many users were given the feature in earlier epochs, there’s a lot of information being left on the table if you were to ignore these epochs. So combining data from epochs is attractive.

For the second example, we will go into more detail in the next section.

### Example: fantasy football website

Fantasy football is a virtual game where users pick professional football players to play on their team at the beginning of the season. Before each week's games, a user chooses a subset of these players to “play” that week. The better the picked football players perform in a real football game, the more points the user gets.Imagine you are experimenting on a fantasy football site and you want to show some advanced player statistics in order to aid users in picking the players to play that week. This change gives users a teaser of what they would get if they signed up for a premium account. The goal of this change is to increase the number of non-premium users who sign up for a premium account. You decide to use a ramp-up strategy to mitigate the risk of your non-premium users having strong preferences for the traditional player statistics that are currently shown. However, ramp-up adds complications that you ought to consider. For example, some of your users visit frequently to tinker with their lineups, check scores, check rankings, etc. and other users visit infrequently. Since the frequent users visit more often, it’s more likely for their first visit to be earlier in the experiment. The frequent vs. infrequent user variable is a time-based confounder if, in addition to being correlated with time of first visit, it has an effect on the probability of signing up for a premium account.

Fantasy football follows a weekly schedule based on the real football games that are played on Thursday, Sunday, and Monday. We call a Tuesday through Monday cycle a “football week” to distinguish it from an arbitrary 7 day period that could span two football weeks. Due to the weekly schedule, both frequent and infrequent users will likely visit at least once during a football week, with each football week's visitors being representative of the overall population of users the site is interested in. So when running an experiment on a fantasy football site, it’s a good idea to consider data from a complete football week in order to have a representative sample of the population.

If you wanted to mitigate risk and do ramp-up, using full traffic ramp-up and combining data across epochs is an attractive option in part due to the weekly schedule of fantasy football. With this option, you could do ramp-up and run the experiment for just one football week. If instead you modified the ramp-up design and considered only the final epoch, earlier epochs would need to be planned such that the final epoch ran for a full football week. Although not insurmountable, this does add some complexity in the experimental design and could cause the experiment to take meaningfully longer than a week to finish.

We simulate data using this fantasy football example to illustrate how naive aggregation of data in ramp-up can lead to biased estimates. We do the exact same simulation twice, but in one simulation the hypothetical experimenter changes the assignment weights part way through and in the second they don’t. We will see that the bias caused by naive aggregation with changing assignment weights can be so strong that it changes the outcome of the experiment.

In this simulation, all users who visit during the experiment are randomly assigned an arm on their first visit and stay assigned to that arm for the duration of the experiment. The experiment will run for a full football week in order to get a representative sample and to make a decision before next week's games. We simulate 2,000,000 users who visit during the experiment. The number of simulated users is large enough so that the simulation gives nearly identical answers each time it's run. For this reason we don’t report uncertainty measures or statistical significance in the results of the simulation. Time of first visit and time of assignment for user i are represented by the same variable $Z_i$ and is simulated by a uniform distribution $$

Z_i \sim \mathrm{Uniform}(0, 1)

$$ where experiment duration is normalized to be between $0$ and $1$.

Let $A_i =1$ if user $i$ is a “frequent user” and $A_i =0$ otherwise. A “frequent user” is someone who frequently visits the website, so their first visit is more likely to come earlier within the football week than “infrequent users”. Hence, the frequent users are more likely to be assigned earlier. The relationship between these variables is simulated by $$

A_i \sim \mathrm{Bernoulli}(1- Z_i)

$$A user can sign up for a premium account only once. Whether or not user $i$ converts during the experiment is simulated by the probabilities in Fig 3.

Fig 3. Conversion rates used in the simulation for frequent and infrequent users for the current and new arm. |

Notice that the new arm performs better on frequent users and the current arm performs better on infrequent users. This is not an example of Simpson’s paradox since one arm doesn't uniformly perform better on both frequent and infrequent users.

The experimenter in this simulation wants to make sure the new arm isn’t going to alienate users who perhaps have strong preferences for the current player statistics that are shown. In order to mitigate risk, the experimenter decides to “play it safe” and not assign too many users to the new arm. So the experimenter decides to initially assign 90% of users to the current arm and 10% to the new arm.

Half-way through the experiment, early data suggests that the new arm isn’t a complete disaster. Consider two hypothetical experimenters: one that ramps-up the assignment weights and one that does not. These two experimenters will use the simulated data as described above.

### Experimenter 1: No ramp-up

Experimenter 1 keeps the assignment weights constant throughout the duration of the experiment. This is simulated by $$T_i \sim \mathrm{Bernoulli}(0.1)\quad\mbox{for all $Z_i$}

$$ where $T_i=0$ means user i is assigned to the current arm and $T_i=1$ if assigned to the new arm. In this scenario, the following are sample conversion rates at the end of the experiment:

Current Arm: 0.0390

New Arm: 0.0371

### Experimenter 2: Ramp-up

Experimenter 2 decides to ramp-up the assignment weights to the new arm and assigns 50% of users to the current arm and 50% to the new arm in the second epoch. This is simulated by $$T_i \sim \mathrm{Bernoulli}(0.1)\quad\mbox{if $Z_i<0.5$}\\

T_i \sim \mathrm{Bernoulli}(0.5)\quad\mbox{if $Z_i>0.5$}

$$ where $T_i=0$ means user $i$ is assigned to the current arm and $T_i=1$ if assigned to the new arm. Notice that there is a dependence between which arm user $i$ is assigned to and when they are assigned. However, when they are assigned is also correlated with which arm they prefer. Thus the resulting sample conversion rates will be biased. Indeed, in our simulations we get the following sample conversion rates at the end of the experiment:

Current Arm: 0.0362

New Arm: 0.0395

### Ramp-up solution: measure epoch and condition on its effect

If one wants to do full traffic ramp-up and use data from all epochs, they must use an adjusted estimator to get an unbiased estimate of the average reward in each arm. By conditioning on a user’s epoch, we can get an unbiased estimate without needing to know and observe all possible time-based confounders. Let’s formally define a user’s epoch.**Epoch**: If assignment weights are changed at times $Z^*_1, ..., Z^*_J$ then the assignment weights are constant during $[Z^*_j, Z^*_{j+1})$. The period of constant assignment weights $[Z^*_j, Z^*_{j+1})$ will be called

**epoch**$j$. The epoch of user $i$ is determined by their assignment time and will be denoted by $E_i$.

To get an unbiased estimate for the average reward in arm $t$, we can infer the arm’s reward in each epoch and take a weighted average across the epochs with equal weights in the arms $$

\hat{\theta}_{t, \mathrm{adjusted}} = \sum_j \hat{E}[Y_i | T_i=t, E_i=j] P(E_i=j)

$$Here $Y_i$ is the observed reward for user $i$ and $\hat{E}$, $\hat{P}$ indicate estimates of the corresponding expected value and probability. Within an epoch, assignment weights are constant and so a sample mean using data from the appropriate epoch and arm is an unbiased estimate of $E[Y_i | T_i=t, E_i=j]$. Notice that the weights $P(E_i=j)$ do not depend on the assigned arm, so should be estimated from all users regardless of arm assignment. Intuitively the adjusted estimator reweights each epoch estimate in proportion to the percentage of the overall population that arrived in that epoch. Contrast this with an unadjusted estimate where each arm’s sample mean is only representative of the population of users assigned to the given arm. These populations are not the same across arms when there are changing assignment weights and time-based confounders.

Indeed, when we use this solution in the simulation, we get back the same result as if we hadn't changed assignment weights. To use the solution, let’s first look at the observed conversion rates in each epoch

Next notice that 50% of all users in the experiment were assigned in Epoch 1 and 50% were assigned in Epoch 2 so $\hat{P}(E_i=1 )=0.5$ and $\hat{P}(E_i=2 )=0.5$ (first arrival time is uniform and the epochs are equal lengths). Combining all these together as described above,

We see with this solution that the changing assignment weights problem disappears. An experimenter who changes assignment weights gets the same answer as the experimenter who doesn’t change assignment weights (modulo some rounding errors) so long as they use the adjusted estimator.

Current Arm: $\hat{\theta}_{0, \mathrm{adjusted}} = 0.0295 \times 0.5 + 0.0483 \times 0.5 =0.0389$

New Arm: $\hat{\theta}_{1, \mathrm{adjusted}} = 0.0340 \times 0.5 + 0.0406 \times 0.5=0.0373$

Using sample proportions, as we do here, is a simple way to estimate $\hat{E}[Y_i | T_i=1,E_i=j]$ and $\hat{P}(E_i=j )$. In practice, one may want to use more complex models to make these estimates. For example, one may want to use a model that can pool the epoch estimates with each other via hierarchical modeling (a.k.a. partial pooling) if one has many short epochs.

### Justification of the solution

In the above simulation we see that when a hypothetical experimenter changes assignment weights and uses the proposed solution, they get the same answer as if they hadn’t changed assignment weights. Here we formally justify that when assignment weights are changed in an experiment, this solution is an unbiased estimator of what is called the average treatment effect (ATE). To formally define ATE, we first introduce potential outcomes,**Potential outcome**: Let $Y_i (t)$ denote the realized reward for user $i$ if they had been given arm $t$. Potential outcomes are counterfactual conditional statements about what would have happened if the user was assigned a particular arm.

The average treatment effect (ATE) is formally defined as the expected difference in potential outcomes between users assigned to different arms $$

ATE =E[Y_i(1)] - E[Y_i(0)]

$$ATE is important in experiments because it tells us about the causal effect of each arm. In a changing assignment weights scenario, we must rely on two assumptions to show that our adjusted estimate is unbiased for ATE [5]. The first is called consistency,

**Consistency**: the observed outcome if assigned to arm $t$ is the same as the potential outcome if assigned to arm $t$, i.e $Y_i=Y_i(t)$ if $T_i=t$.

Consistency generally means that users assigned to an arm will experience the assigned arm and no other arm. In the simulation we are assuming that users can only see the version of the website to which they are assigned - hence, consistency. The second assumption is called conditional ignorability

**Conditional ignorability:**The potential outcomes of user $i$ if they had been assigned to arm $t$ are conditionally independent of which arm is actually assigned given some other variable, i.e $Y_i(t)$ is conditionally independent of $T_i$ given some other variable.

In the simulation, the potential outcome is conditionally independent of arm assignment given epoch, since assignment weights are constant within an epoch. Thus we have conditional ignorability no matter how many time-based confounders are present. With both consistency and conditional ignorability

\begin{align*}

E[Y_i(t)] &= \sum_j E[Y_i(t) | E_i=j] P(E_i=j) \quad \mbox{(law of total probability)}\\

&= \sum_j E[Y_i(t)| E_i=j] P(E_i=j)\quad (Y \perp T | E) \\

&= \sum_j E[Y_i| E_i=j, T_i=t]P(E_i=j)\quad\mbox{(consistency)}

\end{align*} So to get an unbiased estimate of the expected potential outcome one can combine unbiased estimates for $E[Y_i | T_i=t, E_i=j]$ and $P(E_i=j)$ as we do in $\hat{\theta}_{t, \mathrm{adjusted}}$. From a Bayesian perspective, one can combine joint posterior samples for $E[Y_i | T_i=t, E_i=j]$ and $P(E_i=j)$, which provides a measure of uncertainty around the estimate.

\begin{align*}

E[Y_i(t)] &= \sum_j E[Y_i(t) | E_i=j] P(E_i=j) \quad \mbox{(law of total probability)}\\

&= \sum_j E[Y_i(t)| E_i=j] P(E_i=j)\quad (Y \perp T | E) \\

&= \sum_j E[Y_i| E_i=j, T_i=t]P(E_i=j)\quad\mbox{(consistency)}

\end{align*} So to get an unbiased estimate of the expected potential outcome one can combine unbiased estimates for $E[Y_i | T_i=t, E_i=j]$ and $P(E_i=j)$ as we do in $\hat{\theta}_{t, \mathrm{adjusted}}$. From a Bayesian perspective, one can combine joint posterior samples for $E[Y_i | T_i=t, E_i=j]$ and $P(E_i=j)$, which provides a measure of uncertainty around the estimate.

## Naive implementation of multi-armed bandits can lead to biased estimates and inferior reward

In addition to ramp-up, multi-armed bandits (MABs) are a common reason for changing assignment weights. MABs are a class of algorithms that maximize reward (conversion rate, sales, etc) by assigning more users to better performing arms sooner in order to take advantage of them sooner. One such MAB algorithm is Thompson sampling which assigns a user to an arm according to the probability that the arm is best given the data seen thus far [6], [7].We can formally define Thompson sampling in the case of a two-arm experiment. Let $T_i=1$ if user $i$ is assigned to arm 1 and $T_i=0$ if assigned to arm 0. With Thompson sampling, $T_i$ is determined by $$

T_i \sim \mathrm{Bernoulli}(\Phi_i)\\

\Phi_i = P(\theta_1> \theta_0 | Y_1 \cdots Y_{i-1})

$$where $\theta_0$ and $\theta_1$ are the average rewards for arm 0 and arm 1 respectively and $\Phi_i$ is the posterior probability that arm 1 has better reward than arm 0 given all the data observed thus far. Notice that this probability will change with each additional sample, so assignment weights will change many times throughout the MAB.

Although there are a wide variety of MAB algorithms, many MAB algorithms are like Thompson sampling and require models to produce probabilities of each arm being the best. These MAB algorithms are great at maximizing reward when the models are perfectly specified and probabilities are accurate. However, these probabilities can be sensitive to model misspecification. Inaccurate probabilities can cause the MAB to do a poor job of maximizing future reward. This is because the MAB maximizes future reward by using previous users to infer the reward of future users as if they were assigned to each arm. If there are time-based confounders, then users who visit earlier will not be representative of later users. If the model is missing these time-based confounders then the inference will be poor and the MAB will do a poor job of maximizing future reward.

Often a MAB will be run to maximize reward during an experiment. Here the MAB eventually ends, a winner is declared, and the winning arm is deployed. If one does inference to determine the winning arm while ignoring the complexity of time-based confounders that are present, then the inference will be biased. The reason is much the same as in ramp-up. Changing assignment weights causes the distribution of time-based confounders to be systematically different across arms. Since time-based confounders affect reward, not considering time-based confounders that are present will produce biased results on the effect of each arm.

We continue with the fantasy football example and see that if a model omits a time-based confounder then MAB algorithms like Thompson sampling can be counterproductive towards maximizing reward. We will also use the example to illustrate how the naive model can lead to inferior decisions on which arm to deploy at the end of the MAB.

### Example: fantasy football website

Let’s go back to the fantasy football example and use the same simulation as described above. But now consider one hypothetical experimenter who runs a MAB to maximize reward during the experiment and the other who runs a traditional A/B experiment with constant and equal assignment weights. For the experimenter who runs a MAB, there’s more sampling variation than in the simulation for ramp-up because of the path dependence between what happens in the beginning and end of the experiment. For this reason, we run the simulation with 2,000,000 users 300 times.The MAB experimenter will use Thompson sampling. Since Thompson sampling requires Bayesian posteriors, the hypothetical experimenters will infer the conversion rate for each arm j with the following simple Bayesian model\begin{align*}

Y_i &\sim \mathrm{Bernoulli}(\theta_j)\\

\theta_j &\sim \mathrm{Beta}(1, 1)

\end{align*}Although the model is perhaps overly simple in the real world, remember that the data generating process in the simulation is also quite simple. The data generating process differs from the model only by the omission of a single variable (frequent vs. infrequent user) that is correlated with both assignment time and reward. Omitting confounding variables isn’t usually a concern in a traditional A/B test because the law of large numbers and random assignment often make these variables equally distributed among the arms. As we'll see in the results, this intuition doesn't apply with time-based confounders in MABs, since the assignment depends on time.

Just as in the previous ramp-up simulation, changing assignment weights causes some serious issues. Fig 5 shows the estimated ATE for each of the 300 simulations. We see that the MAB experimenter incorrectly picks the new arm as the winner in 267 out of 300 simulations. In fact, even when the MAB experimenter correctly picks the current arm as the winner, it dramatically misestimates the ATE relative to the experimenter who runs a traditional A/B test.

Not only does the MAB experimenter usually pick the wrong arm, we see in Fig 6 that the MAB experimenter also has fewer total conversions than the traditional A/B experimenter in 267 out of 300 simulations.

*The bimodality in Fig 5 and Fig 6 is caused by whether or not the MAB overconfidently goes into full “exploit” mode by sending nearly all users to the new arm (the preferred arm among frequent users). Occasionally the MAB “explores” just enough to send users to the current arm and eventually becomes confident that it is the winning arm (the preferred arm among infrequent users).

In this simulation, changing assignment weights with a MAB caused biased results and usually changed the outcome of the experiment. Additionally, the MAB experimenter usually had inferior reward than the traditional A/B experimenter who wasn’t even concerned about maximizing reward during the experiment!

### MAB solution: measure assignment time and condition on its effect

If one wants valid inference after using a MAB, one could use a similar strategy as the one described for ramp-up. Just as in the case of ramp-up, one can get valid inference without needing to know and observe all possible time-based confounders by conditioning on continuous assignment time and using a weighted average over assignment time as follows$$P(Y_i(t) | \theta )= \int P(Y_i | Z_i=z, T_i=t, \theta) P(Z_i=z | \theta) dz

$$where are model parameters. Conditioning on continuous assignment time Zi, rather than discrete assignment epoch, is necessary because the assignment weights can change with each additional sample in a MAB. Notice that this means a model that conditions on continuous assignment time is required. Conditioning on continuous assignment time is much more difficult in practice than conditioning on discrete epoch. For example, what happens if one doesn’t have a good understanding of how continuous assignment time impacts reward? Will the inference still be solid?

This solves the bias problem but doesn’t fix the problem of maximizing reward during the experiment when the model omits time-based confounders. Since maximizing reward is the reason to use a MAB, fixing the bias problem isn’t reason enough to run a MAB if it can’t also maximize reward. For maximizing reward during the experiment when there are time-based confounders, one should consider a class of MABs called restless multi-armed bandits which don’t assume reward is constant over time [8]. However, restless multi-armed bandits add new assumptions about how reward changes over time. Considering how counterproductive things can be when assumptions are not met in Thompson sampling, one should thoroughly understand and test these assumptions before implementing.

When using MABs, an experimenter needs to critically assess whether or not the model is a good fit to the data at hand. Often simple models won’t work. Even when the model is mostly correct, but is missing just one time-based confounder, the MAB can lead to both bad inference and inferior reward. Moreover, if a MAB ends and one of the arms in consideration is deployed, then the MAB is being used like an A/B test. Given all this, perhaps one should focus on valid inference rather than taking on the complexities required to also maximize reward.

## Conclusion

Changing assignment weights during an experiment is a common practice. Experts in the field often recommend ramping up assignment to never-before-seen arms [1] and MAB algorithms are popular in industry [2], [3], [4]. This post shows that naive implementations of ramp-up and MABs can break the inferential promises of randomized experiments and, in the case of MABs, can lead to inferior reward relative to a simple A/B testing design with constant assignment weights.We offer a solution for an unbiased estimator for the average treatment effect (ATE) when an experimenter does full traffic ramp-up and aggregates data across epochs in the presence of changing assignment weights and time-based confounders. The solution is a straightforward weighted average of reward during epochs that allows the experimenter to use data across the entire experiment. For MABs, the strategy of conditioning on assignment time is a bit trickier and doesn’t fix the problem of maximizing reward. With this in mind, one needs to consider why they are using a MAB. If the goal is really to maximize reward, one should thoroughly understand and critically test the assumptions of the MAB algorithm and any models the algorithm may require before implementing. If the goal is to eventually deploy one of the arms in consideration, perhaps one shouldn’t be taking on the complexities required for using a MAB.

Although this blog post makes some specific points about changing assignment weights in an A/B experiment, there is a more general takeaway as well. A/B testing isn’t simple just because data is big — the law of large numbers doesn’t take care of everything! Even with big data, A/B tests require thinking deeply and critically about whether or not the assumptions made match the data. Simply trusting methods that have been used in the past, either by the experimenter or the industry, can often lead experimenters astray.

### References

[1] Kohavi, Ron, Randal M. Henne, and Dan Sommerfield. "Practical guide to controlled experiments on the web: listen to your customers not to the hippo." Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. 2007.[2] Scott, Steven L. "Multi‐armed bandit experiments in the online service economy." Applied Stochastic Models in Business and Industry 31.1 (2015): 37-45.

[3] Hill, Daniel N., et al. "An efficient bandit algorithm for realtime multivariate optimization." Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017.

[4] Audibert, Jean-Yves, and Sébastien Bubeck. "Best arm identification in multi-armed bandits." 2010.

[5] Imbens, Guido W., and Donald B. Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.

[6] Thompson, William R. "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples." Biometrika 25.3/4 (1933): 285-294.

[7] Thompson, William R. "On the theory of apportionment." American Journal of Mathematics 57.2 (1935): 450-456.

[8] Whittle, P. “Restless Bandits: Activity Allocation in a Changing World.” Journal of Applied Probability, vol. 25, no. A, 1988, pp. 287–298., doi:10.2307/3214163.

## Comments

## Post a Comment