Leakage in data mining: Formulation, detection, and avoidance

doi:10.1145/2382577.2382579

Journal ArticleDOI

Leakage in data mining: Formulation, detection, and avoidance

Shachar Kaufman, +3 more

- 18 Dec 2012 -

ACM Transactions on Knowledge Discovery ...

- Vol. 6, Iss: 4, pp 15

TLDR

It is shown that it is possible to avoid leakage with a simple specific approach to data management followed by what is called a learn-predict separation, and several ways of detecting leakage when the modeler has no control over how the data have been collected are presented.

Abstract:

Deemed “one of the top ten data mining mistakes”, leakage is the introduction of information about the data mining target that should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical independently and identically distributed (i.i.d.) assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by explicitly defining modeling goals and analyzing the broader framework of the data mining problem. The resulting definition enables us to derive general methodology for dealing with the issue. We show that it is possible to avoid leakage with a simple specific approach to data management followed by what we call a learn-predict separation, and present several ways of detecting leakage when the modeler has no control over how the data have been collected. We also offer an alternative point of view on leakage that is based on causal graph modeling concepts.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

Marco Tulio Ribeiro, +2 more

TL;DR: In this article, the authors propose LIME, a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem.

...read moreread less

Journal ArticleDOI

Opportunities and obstacles for deep learning in biology and medicine.

Travers Ching, +38 more

- 01 Apr 2018 -

Journal of the Royal Society Interface

TL;DR: It is found that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art.

...read moreread less

Proceedings Article

Anchors: High-Precision Model-Agnostic Explanations

Marco Tulio Ribeiro, +2 more

TL;DR: This work introduces a novel model-agnostic system that explains the behavior of complex models with high-precision rules called anchors, representing local, “sufficient” conditions for predictions, and proposes an algorithm to efficiently compute these explanations for any black-box model with high probability guarantees.

...read moreread less

Proceedings ArticleDOI

Leakage in data mining: formulation, detection, and avoidance

Shachar Kaufman, +2 more

TL;DR: It is shown that it is possible to avoid leakage with a simple specific approach to data management followed by what the authors call a learn-predict separation, and several ways of detecting leakage when the modeler has no control over how the data have been collected.

...read moreread less

Journal ArticleDOI

Auditing black-box models for indirect influence

Philip Adler, +7 more

- 01 Jan 2018 -

Knowledge and Information Systems

TL;DR: In this article, the authors present a technique for auditing black-box models, which lets them study the extent to which existing models take advantage of particular features in the data set, without knowing how the models work.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Co-integration and Error Correction: Representation, Estimation and Testing

Robert F. Engle, +1 more

- 01 Mar 1987 -

Econometrica

TL;DR: The relationship between co-integration and error correction models, first suggested in Granger (1981), is here extended and used to develop estimation procedures, tests, and empirical examples.

...read moreread less

Book

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Trevor Hastie, +2 more

TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.

...read moreread less

MonographDOI

Causality: models, reasoning, and inference

Judea Pearl

- 14 Sep 2009 -

Tijdschrift Voor Filosofie

TL;DR: The art and science of cause and effect have been studied in the social sciences for a long time as mentioned in this paper, see, e.g., the theory of inferred causation, causal diagrams and the identification of causal effects.

...read moreread less

Journal ArticleDOI