scispace - formally typeset
Open AccessPosted Content

A Decision Theoretic Approach to A/B Testing

TLDR
The results suggest that the 0.05 p-value threshold may be too conservative in some settings, but that its widespread use may reflect an ad-hoc means of controlling multiplicity in the common case of repeatedly testing variants of an experiment when the threshold is not reached.
Abstract
A/B testing is ubiquitous within the machine learning and data science operations of internet companies. Generically, the idea is to perform a statistical test of the hypothesis that a new feature is better than the existing platform---for example, it results in higher revenue. If the p value for the test is below some pre-defined threshold---often, 0.05---the new feature is implemented. The difficulty of choosing an appropriate threshold has been noted before, particularly because dependent tests are often done sequentially, leading some to propose control of the false discovery rate (FDR) rather than use of a single, universal threshold. However, it is still necessary to make an arbitrary choice of the level at which to control FDR. Here we suggest a decision-theoretic approach to determining whether to adopt a new feature, which enables automated selection of an appropriate threshold. Our method has the basic ingredients of any decision-theory problem: a loss function, action space, and a notion of optimality, for which we choose Bayes risk. However, the loss function and the action space differ from the typical choices made in the literature, which has focused on the theory of point estimation. We give some basic results for Bayes-optimal thresholding rules for the feature adoption decision, and give some examples using eBay data. The results suggest that the 0.05 p-value threshold may be too conservative in some settings, but that its widespread use may reflect an ad-hoc means of controlling multiplicity in the common case of repeatedly testing variants of an experiment when the threshold is not reached.

read more

Citations
More filters
Journal ArticleDOI

A/B Testing with Fat Tails

TL;DR: The theoretical results, along with an empirical analysis of Microsoft Bing’s EXP platform, suggest that simple changes to business practices could increase innovation productivity.
Journal ArticleDOI

Empirical Bayes Estimation of Treatment Effects with Many A/B Tests: An Overview

TL;DR: This is a practical guide on how to use treatment effect estimates from a large number of experiments to improve estimates of the effects of each experiment.
Posted Content

Empirical Bayes for Large-scale Randomized Experiments: a Spectral Approach

TL;DR: A spectral maximum likelihood estimate based on a Fourier series representation, which can be efficiently computed via convex optimization, and in order to select hyperparameters and compare models, is developed.
Proceedings ArticleDOI

On Post-Selection Inference in A/B Tests

TL;DR: This paper explores two seemingly unrelated paths, one based on supervised machine learning and the other on empirical Bayes, and proposes post-selection inferential approaches that combine the strengths of both.
Posted Content

Optimal Testing in the Experiment-rich Regime

TL;DR: In this article, the authors propose a new experimental design framework for the setting where potential experiments are abundant (i.e., many hypotheses are available to test), and observations are costly; they refer to this as the experiment-rich regime.
References
More filters
Journal ArticleDOI

Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper)

Andrew Gelman
- 01 Sep 2006 - 
TL;DR: In this paper, a folded-noncentral-$t$ family of conditionally conjugate priors for hierarchical standard deviation parameters is proposed, and weakly informative priors in this family are considered.
Journal ArticleDOI

Controlled experiments on the web: survey and practical guide

TL;DR: This work provides a practical guide to conducting online experiments, and shares key lessons that will help practitioners in running trustworthy controlled experiments, including statistical power, sample size, and techniques for variance reduction.
Journal ArticleDOI

A modern Bayesian look at the multi-armed bandit

TL;DR: A heuristic for managing multi-armed bandits called randomized probability matching is described, which randomly allocates observations to arms according the Bayesian posterior probability that each arm is optimal.
Proceedings ArticleDOI

Online controlled experiments at large scale

TL;DR: This work discusses why negative experiments, which degrade the user experience short term, should be run, given the learning value and long-term benefits, and designs a highly scalable system able to handle data at massive scale: hundreds of concurrent experiments, each containing millions of users.
Proceedings ArticleDOI

Practical guide to controlled experiments on the web: listen to your customers not to the hippo

TL;DR: This work provides a practical guide to conducting online experiments, and shares key lessons that will help practitioners in running trustworthy controlled experiments, including statistical power, sample size, and techniques for variance reduction.
Related Papers (5)