scispace - formally typeset
Proceedings ArticleDOI

Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled Experiments

Reads0
Chats0
TLDR
In this tutorial, challenges, best practices, and pitfalls in evaluating experiment results are discussed, focusing on both lessons learned and practical guidelines as well as open research questions.
Abstract
A/B Testing is the gold standard to estimate the causal relationship between a change in a product and its impact on key outcome measures. It is widely used in the industry to test changes ranging from simple copy change or UI change to more complex changes like using machine learning models to personalize user experience. The key aspect of A/B testing is evaluation of experiment results. Designing the right set of metrics - correct outcome measures, data quality indicators, guardrails that prevent harm to business, and a comprehensive set of supporting metrics to understand the "why" behind the key movements is the #1 challenge practitioners face when trying to scale their experimentation program [18, 22]. On the technical side, improving sensitivity of experiment metrics is a hard problem and an active research area, with large practical implications as more and more small and medium size businesses are trying to adopt A/B testing and suffer from insufficient power. In this tutorial we will discuss challenges, best practices, and pitfalls in evaluating experiment results, focusing on both lessons learned and practical guidelines as well as open research questions.

read more

Citations
More filters
Posted Content

How to Measure Your App: A Couple of Pitfalls and Remedies in Measuring App Performance in Online Controlled Experiments

TL;DR: Several scalable methods including user-level performance metric calculation and imputation and matching for missing metric values are introduced, which arise from strong heterogeneity in both mobile devices and user engagement and from self-selection bias caused by post-treatment user engagement changes.
Proceedings ArticleDOI

Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled Experiments

TL;DR: This tutorial will discuss challenges, best practices, and pitfalls in evaluating experiment results, focusing on both lessons learned and practical guidelines as well as open research questions.
Proceedings ArticleDOI

How to Measure Your App: A Couple of Pitfalls and Remedies in Measuring App Performance in Online Controlled Experiments

TL;DR: In this article, the authors discuss two major pitfalls in this industry-standard practice of measuring performance for mobile apps: strong heterogeneity in both mobile devices and user engagement, and the self-selection bias caused by post-treatment user engagement changes.
Proceedings ArticleDOI

User Sentiment as a Success Metric: Persistent Biases Under Full Randomization

TL;DR: It is shown that a simple mean comparison produces biased population level estimates and a set of consistent estimators for the average and local treatment effects on treated and respondent users are proposed.
Book ChapterDOI

Performance Comparison for E-Learning and Tools in Twenty-First Century with Legacy System Using Classification Approach

TL;DR: The research work also includes the implications and challenges faced by the universities while implementing these technologies, and the performance was observed, and various inferences were discussed with the effective delivery of the teaching material and their issues.
References
More filters
Proceedings ArticleDOI

Trustworthy online controlled experiments: five puzzling outcomes explained

TL;DR: The topics covered include: the OEC (Overall Evaluation Criterion), click tracking, effect trends, experiment length and power, and carryover effects, which should help readers increase the trustworthiness of the results coming out of controlled experiments.
Journal ArticleDOI

Large-scale validation and analysis of interleaved search evaluation

TL;DR: This paper provides a comprehensive analysis of interleaving using data from two major commercial search engines and a retrieval system for scientific literature, and analyzes the agreement ofinterleaving with manual relevance judgments and observational implicit feedback measures.
Proceedings ArticleDOI

From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks

TL;DR: The experimentation platform at LinkedIn is described in depth and how it is built to handle each step of the A/B testing process at LinkedIn, from designing and deploying experiments to analyzing them.
Proceedings ArticleDOI

Improving the sensitivity of online controlled experiments by utilizing pre-experiment data

TL;DR: This work proposes an approach (CUPED) that utilizes data from the pre-experiment period to reduce metric variability and hence achieve better sensitivity in experiments, applicable to a wide variety of key business metrics.
Proceedings ArticleDOI

Measuring the user experience on a large scale: user-centered metrics for web applications

TL;DR: The HEART framework for user-centered metrics, as well as a process for mapping product goals to metrics, are described, which have generalized to enough of the company's own products that teams in other organizations will be able to reuse or adapt them.
Related Papers (5)
Trending Questions (1)
What are best practices for online controlled experiments?

The paper discusses challenges, best practices, and pitfalls in evaluating experiment results, but does not explicitly mention the best practices for online controlled experiments.