FY's Notes on TS PostDiff

Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization

Nogas, J., Li, T., Yanez, F. J., Modiri, A., Deliu, N., Prystawski, B., Villar, S. S., Durand, A., Williams, J. J. (2021). Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization. arXiv preprint arXiv:2112.08507.

Posted on January 6, 2022

Top-Two TS
- It seems that Top-Two TS is very similar to TS, except for the fact that it will not always choose the arm with the greater expected reward. Instead, a beta hyperparameter (usually 50%) is fixed as the probability of choosing between the best two arms according to the current knowledge.
  - This implies that early on in the experiment, Top-Two TS can explore more options than TS, giving it a chance to be more certain about the best arm. However, this behavior does not change with time, which means that once it has a good estimate of the top-two real means, it won't focus on exploiting the best one, but it will keep choosing randomly between the two of them instead.
  - Also, this translates into a much higher regret compared to TS, as it fails to focus on the optimal arm.
  - It seems that the beauty of Top-Two TS is in finding the best arm (i.e., over many experiments, Top-Two TS will find the best one more times than TS will). The trade-off here is that it fails to truly exploit the best arm, which in the case of adaptive experiments means that it is good in having a greater confidence on which intervention is better, at the cost of not giving most of the users the best one. Still, even though it might sound like a similar conclusion to UR, in the Top-Two TS setting, most of the users will get assigned to the best two arms, instead of any arm randomly.
    - Perhaps an approach to build on top of Top-Two TS would be to define an 'exploration threshold' (e.g. when the top-two arms are the same for X% of the time, with a considerable estimated effect size, take this as a hint that the best has been found and exploit it).
PostDiff
- The Posterior Probability Difference is an MAB approach that looks to increase exploration (i.e., assign an arm randomly) when the empirical effect size is small. This is done in order to accept that the evidence thus far is not conclusive regarding a significant difference in which arm is best. On the other hand, if it actually senses there is such a difference, it allocates the next action following an adaptive strategy.
- TS PostDiff is such a strategy where the adaptive allocation when there is a considerably big estimated effect size is done via Thompson Sampling.
- In order to determine if the difference between arms is not significant enough (i.e. choose UR over TS), PostDiff calculates the posterior probability that the difference in expected reward between two arms is less than a threshold c in [0,1].
False Positive Rate (FPR)
- Probability of concluding there is a difference in the arms' means, when there is none.
Statistical Power
- Probability of concluding there is a difference in the arms' means, when there is one.
Questions
- Posterior probability (TS PostDiff)
  - Why calculating the probability stated (phi_t) is equivalent to just checking |p1-p2| < c ?