Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled Experiments

doi:10.1145/3292500.3332297

Proceedings ArticleDOI

Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled Experiments

Xiaolin Shi, +3 more

- pp 3189-3190

Chats0

TLDR

In this tutorial, challenges, best practices, and pitfalls in evaluating experiment results are discussed, focusing on both lessons learned and practical guidelines as well as open research questions.

Abstract:

A/B Testing is the gold standard to estimate the causal relationship between a change in a product and its impact on key outcome measures. It is widely used in the industry to test changes ranging from simple copy change or UI change to more complex changes like using machine learning models to personalize user experience. The key aspect of A/B testing is evaluation of experiment results. Designing the right set of metrics - correct outcome measures, data quality indicators, guardrails that prevent harm to business, and a comprehensive set of supporting metrics to understand the "why" behind the key movements is the #1 challenge practitioners face when trying to scale their experimentation program [18, 22]. On the technical side, improving sensitivity of experiment metrics is a hard problem and an active research area, with large practical implications as more and more small and medium size businesses are trying to adopt A/B testing and suffer from insufficient power. In this tutorial we will discuss challenges, best practices, and pitfalls in evaluating experiment results, focusing on both lessons learned and practical guidelines as well as open research questions.

Citations

PDF

Open Access

More filters

Posted Content

How to Measure Your App: A Couple of Pitfalls and Remedies in Measuring App Performance in Online Controlled Experiments

Yuxiang Xie, +3 more

- 29 Nov 2020 -

arXiv: Applications

TL;DR: Several scalable methods including user-level performance metric calculation and imputation and matching for missing metric values are introduced, which arise from strong heterogeneity in both mobile devices and user engagement and from self-selection bias caused by post-treatment user engagement changes.

...read moreread less

Proceedings ArticleDOI

Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled Experiments

Somit Gupta, +4 more

TL;DR: This tutorial will discuss challenges, best practices, and pitfalls in evaluating experiment results, focusing on both lessons learned and practical guidelines as well as open research questions.

...read moreread less

Proceedings ArticleDOI

How to Measure Your App: A Couple of Pitfalls and Remedies in Measuring App Performance in Online Controlled Experiments

Yuxiang Xie, +3 more

TL;DR: In this article, the authors discuss two major pitfalls in this industry-standard practice of measuring performance for mobile apps: strong heterogeneity in both mobile devices and user engagement, and the self-selection bias caused by post-treatment user engagement changes.

...read moreread less

Proceedings ArticleDOI

User Sentiment as a Success Metric: Persistent Biases Under Full Randomization

Ercan Yildiz, +2 more

TL;DR: It is shown that a simple mean comparison produces biased population level estimates and a set of consistent estimators for the average and local treatment effects on treated and respondent users are proposed.

...read moreread less

Book ChapterDOI

Performance Comparison for E-Learning and Tools in Twenty-First Century with Legacy System Using Classification Approach

Akhilesh Kumar Sharma, +2 more

TL;DR: The research work also includes the implications and challenges faced by the universities while implementing these technologies, and the performance was observed, and various inferences were discussed with the effective delivery of the teaching material and their issues.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Trustworthy online controlled experiments: five puzzling outcomes explained

Ron Kohavi, +5 more

TL;DR: The topics covered include: the OEC (Overall Evaluation Criterion), click tracking, effect trends, experiment length and power, and carryover effects, which should help readers increase the trustworthiness of the results coming out of controlled experiments.

...read moreread less

Journal ArticleDOI

Large-scale validation and analysis of interleaved search evaluation

Olivier Chapelle, +3 more

- 06 Mar 2012 -

ACM Transactions on Information Systems

TL;DR: This paper provides a comprehensive analysis of interleaving using data from two major commercial search engines and a retrieval system for scientific literature, and analyzes the agreement ofinterleaving with manual relevance judgments and observational implicit feedback measures.

...read moreread less

Proceedings ArticleDOI

From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks

Ya Xu, +4 more

TL;DR: The experimentation platform at LinkedIn is described in depth and how it is built to handle each step of the A/B testing process at LinkedIn, from designing and deploying experiments to analyzing them.

...read moreread less

Proceedings ArticleDOI

Improving the sensitivity of online controlled experiments by utilizing pre-experiment data

Alex Deng, +3 more

TL;DR: This work proposes an approach (CUPED) that utilizes data from the pre-experiment period to reduce metric variability and hence achieve better sensitivity in experiments, applicable to a wide variety of key business metrics.

...read moreread less

Proceedings ArticleDOI

Measuring the user experience on a large scale: user-centered metrics for web applications

Kerry Rodden, +2 more

TL;DR: The HEART framework for user-centered metrics, as well as a process for mapping product goals to metrics, are described, which have generalized to enough of the company's own products that teams in other organizations will be able to reuse or adapt them.

...read moreread less

Collapse

Related Papers (5)

Toward Evaluation that Leads to Best Practices: Reconciling Dialog Evaluation in Research and Industry

Tim Paek

Towards cognitive support for unit testing: A qualitative study with practitioners

Marllos Paiva Prado, +1 more

- 01 Jul 2018 -

Journal of Systems and Software

Empirical Software Engineering

Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled Experiments

Citations

How to Measure Your App: A Couple of Pitfalls and Remedies in Measuring App Performance in Online Controlled Experiments

Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled Experiments

How to Measure Your App: A Couple of Pitfalls and Remedies in Measuring App Performance in Online Controlled Experiments

User Sentiment as a Success Metric: Persistent Biases Under Full Randomization

Performance Comparison for E-Learning and Tools in Twenty-First Century with Legacy System Using Classification Approach

References

Trustworthy online controlled experiments: five puzzling outcomes explained

Large-scale validation and analysis of interleaved search evaluation

From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks

Improving the sensitivity of online controlled experiments by utilizing pre-experiment data

Measuring the user experience on a large scale: user-centered metrics for web applications

Related Papers (5)

Toward Evaluation that Leads to Best Practices: Reconciling Dialog Evaluation in Research and Industry

Towards cognitive support for unit testing: A qualitative study with practitioners

Practitioners' views on good software testing practices

Improving test effectiveness using test executions history: an industrial experience report

A practical guide to controlled experiments of software engineering tools with human participants

Trending Questions (1)