scispace - formally typeset
Search or ask a question
Journal Article

A rate of convergence for mixture proportion estimation, with application to learning from noisy labels

01 Jan 2015-Journal of Machine Learning Research (Microtome Publishing)-Vol. 38, pp 838-846
TL;DR: This work establishes a rate of convergence for mixture proportion estimation under an appropriate distributional assumption, and argues that this rate of converge is useful for analyzing weakly supervised learning algorithms that build on MPE.
Abstract: Mixture proportion estimation (MPE) is a fundamental tool for solving a number of weakly supervised learning problems – supervised learning problems where label information is noisy or missing. Previous work on MPE has established a universally consistent estimator. In this work we establish a rate of convergence for mixture proportion estimation under an appropriate distributional assumption, and argue that this rate of convergence is useful for analyzing weakly supervised learning algorithms that build on MPE. To illustrate this idea, we examine an algorithm for classification in the presence of noisy labels based on surrogate risk minimization, and show that the rate of convergence for MPE enables proof of the algorithm’s consistency. Finally, we provide a practical implementation of mixture proportion estimation and demonstrate its efficacy in classification with noisy labels.
Citations
More filters
Journal ArticleDOI
TL;DR: It is proved that any surrogate loss function can be used for classification with noisy labels by using importance reweighting, with consistency assurance that the label noise does not ultimately hinder the search for the optimal classifier of the noise-free sample.
Abstract: In this paper, we study a classification problem in which sample labels are randomly corrupted. In this scenario, there is an unobservable sample with noise-free labels. However, before being observed, the true labels are independently flipped with a probability $\rho \in [0,0.5)$ , and the random label noise can be class-conditional. Here, we address two fundamental problems raised by this scenario. The first is how to best use the abundant surrogate loss functions designed for the traditional classification problem when there is label noise. We prove that any surrogate loss function can be used for classification with noisy labels by using importance reweighting, with consistency assurance that the label noise does not ultimately hinder the search for the optimal classifier of the noise-free sample. The other is the open problem of how to obtain the noise rate $\rho$ . We show that the rate is upper bounded by the conditional probability $P(\hat{Y}|X)$ of the noisy sample. Consequently, the rate can be estimated, because the upper bound can be easily reached in classification problems. Experimental results on synthetic and real datasets confirm the efficiency of our methods.

744 citations


Cites methods from "A rate of convergence for mixture p..."

  • ...In Section 4, we discuss how to perform classification in the presence of RCN and benefit from the abundant surrogate loss functions and algorithms designed for the traditional classification problem....

    [...]

Posted Content
TL;DR: A comprehensive review of 62 state-of-the-art robust training methods, all of which are categorized into five groups according to their methodological difference, followed by a systematic comparison of six properties used to evaluate their superiority.
Abstract: Deep learning has achieved remarkable success in numerous domains with help from large amounts of big data. However, the quality of data labels is a concern because of the lack of high-quality labels in many real-world scenarios. As noisy labels severely degrade the generalization performance of deep neural networks, learning from noisy labels (robust training) is becoming an important task in modern deep learning applications. In this survey, we first describe the problem of learning with label noise from a supervised learning perspective. Next, we provide a comprehensive review of 46 state-of-the-art robust training methods, all of which are categorized into seven groups according to their methodological difference, followed by a systematic comparison of six properties used to evaluate their superiority. Subsequently, we summarize the typically used evaluation methodology, including public noisy datasets and evaluation metrics. Finally, we present several promising research directions that can serve as a guideline for future studies.

474 citations


Cites background from "A rate of convergence for mixture p..."

  • ..., anchor points [117], [153]; an example x with its label i is defined as an anchor point if p(y = i|x) = 1 and p(y = k|x) = 0 for k 6= i....

    [...]

Journal ArticleDOI
TL;DR: A survey of the current state of the art in PU learning proposes seven key research questions that commonly arise in this field and provides a broad overview of how the field has tried to address them.
Abstract: Learning from positive and unlabeled data or PU learning is the setting where a learner only has access to positive examples and unlabeled data. The assumption is that the unlabeled data can contain both positive and negative examples. This setting has attracted increasing interest within the machine learning literature as this type of data naturally arises in applications such as medical diagnosis and knowledge base completion. This article provides a survey of the current state of the art in PU learning. It proposes seven key research questions that commonly arise in this field and provides a broad overview of how the field has tried to address them.

291 citations


Cites background from "A rate of convergence for mixture p..."

  • ...2015b; Scott 2015)....

    [...]

  • ...Unfortunately, this is an ill-dened problem because it is not identiable: the absence of a label can be explained by either a small prior probability for the positive class or a low label frequency [90]. In order for the class prior to be identiable, additional assumption are necessary. This section gives an overview on possible assumptions, listed from strongest to strictly weaker. 1. Separable Cl...

    [...]

  • ...Unfortunately, this is an ill-defined problem because it is not identifiable: the absence of a label can be explained by either a small prior probability for the positive class or a low label frequency (Scott 2015)....

    [...]

  • ...ra assumptions, innite examples are required for convergence. The stricter positive subdomain assumption allows for Learning From Positive and Unlabeled Data: A Survey 31 practical algorithms. Scott [90] implements this idea by building a conditional probability classier. The same idea is approached from a dierent angle by Jain et al. [42,40]. They use k-kernel density estimation to approximate the...

    [...]

  • ...et Instead of requiring no overlap between the distributions, it suces to require a subset of the instance space dened by partial attribute assignment (called the anchor set), to be purely positive [2, 65,83,90]. The ratio of labeled examples in this subdomain is equal to the label frequency, while in other parts of the positive distribution, the ratio can be lower. 3. Positive function/separability This is ...

    [...]

Posted Content
TL;DR: This work combines building on the assumption of a classification noise process to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels, resulting in a generalized CL which is provably consistent and experimentally performant.
Abstract: Learning exists in the context of data, yet notions of \emph{confidence} typically focus on model predictions, not label quality. Confident learning (CL) is an alternative approach which focuses instead on label quality by characterizing and identifying label errors in datasets, based on the principles of pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence. Whereas numerous studies have developed these principles independently, here, we combine them, building on the assumption of a classification noise process to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels. This results in a generalized CL which is provably consistent and experimentally performant. We present sufficient conditions where CL exactly finds label errors, and show CL performance exceeds seven state-of-the-art approaches for learning with noisy labels on the CIFAR dataset. The CL framework is \emph{not} coupled to a specific data modality or model: we use CL to find errors in the presumed error-free MNIST dataset and improve sentiment classification on text data in Amazon Reviews. We also employ CL on ImageNet to quantify ontological class overlap (e.g. finding approximately 645 \emph{missile} images are mislabeled as their parent class \emph{projectile}), and moderately increase model accuracy (e.g. for ResNet) by cleaning data prior to training. These results are replicable using the open-source \texttt{cleanlab} release.

272 citations


Cites background from "A rate of convergence for mixture p..."

  • ...For example, Scott et al. (2013); Scott (2015) developed a theoretical and practical convergence criterion in the binary setting....

    [...]

Proceedings Article
06 Jul 2015
TL;DR: This paper uses class-probability estimation to study corruption processes belonging to the mutually contaminated distributions framework, with three conclusions: one can optimise balanced error and AUC without knowledge of the corruption parameters, and one can minimise a range of classification risks.
Abstract: Many supervised learning problems involve learning from samples whose labels are corrupted in some way. For example, each label may be flipped with some constant probability (learning with label noise), or one may have a pool of unlabelled samples in lieu of negative samples (learning from positive and unlabelled data). This paper uses class-probability estimation to study these and other corruption processes belonging to the mutually contaminated distributions framework (Scott et al., 2013), with three conclusions. First, one can optimise balanced error and AUC without knowledge of the corruption parameters. Second, given estimates of the corruption parameters, one can minimise a range of classification risks. Third, one can estimate corruption parameters via a class-probability estimator (e.g. kernel logistic regression) trained solely on corrupted data. Experiments on label noise tasks corroborate our analysis.

213 citations


Cites background from "A rate of convergence for mixture p..."

  • ...Sanderson & Scott (2014); Scott (2015) explored a practical estimator along these lines....

    [...]

References
More filters
Proceedings ArticleDOI
24 Jul 1998
TL;DR: A PAC-style analysis is provided for a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views, to allow inexpensive unlabeled data to augment, a much smaller set of labeled examples.
Abstract: We consider the problem of using a large unlabeled sample to boost performance of a learning algorit,hrn when only a small set of labeled examples is available. In particular, we consider a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views. For example, the description of a web page can be partitioned into the words occurring on that page, and the words occurring in hyperlinks t,hat point to that page. We assume that either view of the example would be sufficient for learning if we had enough labeled data, but our goal is to use both views together to allow inexpensive unlabeled data to augment, a much smaller set of labeled examples. Specifically, the presence of two distinct views of each example suggests strategies in which two learning algorithms are trained separately on each view, and then each algorithm’s predictions on new unlabeled examples are used to enlarge the training set of the other. Our goal in this paper is to provide a PAC-style analysis for this setting, and, more broadly, a PAC-style framework for the general problem of learning from both labeled and unlabeled data. We also provide empirical results on real web-page data indicating that this use of unlabeled examples can lead to significant improvement of hypotheses in practice. *This research was supported in part by the DARPA HPKB program under contract F30602-97-1-0215 and by NSF National Young investigator grant CCR-9357793. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. TO copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. COLT 98 Madison WI USA Copyright ACM 1998 l-58113-057--0/98/ 7...%5.00 92 Tom Mitchell School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213-3891 mitchell+@cs.cmu.edu

5,840 citations


"A rate of convergence for mixture p..." refers background or methods in this paper

  • ...Finally, we provide a practical implementation of mixture proportion estimation and demonstrate its efficacy in classification with noisy labels....

    [...]

  • ...…relevant WSL problem is binary classification with label noise, when the label noise is assumed to be independent of the observed feature vector (Blum and Mitchell, 1998; Lawrence and Schölkopf, 2001; Bouveyron and Girard, 2009; Stempfel and Ralaivola, 2009; Long and Servido, 2010; Manwani and…...

    [...]

  • ...These include crowdsourcing (Raykar et al., 2010), multiple instance learning (Blum and Kalai, 1998), co-training (Blum and Mitchell, 1998), and learning from partial labels (Cour et al., 2011)....

    [...]

Book
12 Aug 2008
TL;DR: This book explains the principles that make support vector machines (SVMs) a successful modelling and prediction tool for a variety of applications and provides a unique in-depth treatment of both fundamental and recent material on SVMs that so far has been scattered in the literature.
Abstract: This book explains the principles that make support vector machines (SVMs) a successful modelling and prediction tool for a variety of applications. The authors present the basic ideas of SVMs together with the latest developments and current research questions in a unified style. They identify three reasons for the success of SVMs: their ability to learn well with only a very small number of free parameters, their robustness against several types of model violations and outliers, and their computational efficiency compared to several other methods. Since their appearance in the early nineties, support vector machines and related kernel-based methods have been successfully applied in diverse fields of application such as bioinformatics, fraud detection, construction of insurance tariffs, direct marketing, and data and text mining. As a consequence, SVMs now play an important role in statistical machine learning and are used not only by statisticians, mathematicians, and computer scientists, but also by engineers and data analysts. The book provides a unique in-depth treatment of both fundamental and recent material on SVMs that so far has been scattered in the literature. The book can thus serve as both a basis for graduate courses and an introduction for statisticians, mathematicians, and computer scientists. It further provides a valuable reference for researchers working in the field. The book covers all important topics concerning support vector machines such as: loss functions and their role in the learning process; reproducing kernel Hilbert spaces and their properties; a thorough statistical analysis that uses both traditional uniform bounds and more advanced localized techniques based on Rademacher averages and Talagrand's inequality; a detailed treatment of classification and regression; a detailed robustness analysis; and a description of some of the most recent implementation techniques. To make the book self-contained, an extensive appendix is added which provides the reader with the necessary background from statistics, probability theory, functional analysis, convex analysis, and topology.

4,664 citations


"A rate of convergence for mixture p..." refers background in this paper

  • ...Generalizing the above, for any α ∈ (0, 1) we can define the α-cost-sensitive P -risk for any f ∈M, RP,α(f) := E(X,Y )∼P [(1− α)1{Y=1}1{f(X)≤0} + α1{Y=0}1{f(X)>0}]....

    [...]

  • ...Let > 0, and let f ∈ H be such that RP̃ ,Lα(f ) R∗ P̃ ,Lα + 2 , which is possible since the the reproducing kernel associated with H is universal (Steinwart and Christmann, 2008)....

    [...]

  • ...If (B) holds, then for any f ∈M, RP (f)−R∗P = 2(1− π1 − π0)(RP̃ ,α(f)−R ∗ P̃ ,α ) (8) where α = ( 12 − π0)/(1− π1 − π0)....

    [...]

  • ...We will assume that the reproducing kernel k associated with H is universal and bounded (Steinwart and Christmann, 2008)....

    [...]

Book
01 Jan 1996
TL;DR: The Bayes Error and Vapnik-Chervonenkis theory are applied as guide for empirical classifier selection on the basis of explicit specification and explicit enforcement of the maximum likelihood principle.
Abstract: Preface * Introduction * The Bayes Error * Inequalities and alternatedistance measures * Linear discrimination * Nearest neighbor rules *Consistency * Slow rates of convergence Error estimation * The regularhistogram rule * Kernel rules Consistency of the k-nearest neighborrule * Vapnik-Chervonenkis theory * Combinatorial aspects of Vapnik-Chervonenkis theory * Lower bounds for empirical classifier selection* The maximum likelihood principle * Parametric classification *Generalized linear discrimination * Complexity regularization *Condensed and edited nearest neighbor rules * Tree classifiers * Data-dependent partitioning * Splitting the data * The resubstitutionestimate * Deleted estimates of the error probability * Automatickernel rules * Automatic nearest neighbor rules * Hypercubes anddiscrete spaces * Epsilon entropy and totally bounded sets * Uniformlaws of large numbers * Neural networks * Other error estimates *Feature extraction * Appendix * Notation * References * Index

3,598 citations


"A rate of convergence for mixture p..." refers methods in this paper

  • ...The estimator κ̂ of Blanchard et al. (2010) relies on VC theory (Devroye et al., 1996)....

    [...]

  • ...Finally, we provide a practical implementation of mixture proportion estimation and demonstrate its efficacy in classification with noisy labels....

    [...]

  • ...In particular, if F = (1−κ)G+κH holds, then any alternate decomposition of the form F = (1− κ+ δ)G′ + (κ− δ)H , with G′ = (1 − κ + δ)−1((1 − κ)G + δH) , and δ ∈ [0, κ) , is also valid....

    [...]

  • ...It is well known (Devroye et al., 1996) that for any f ∈M, the excess P -risk satisfies RP (f)−R∗P = 2EX [1{u(f(X))6=u(η(X)− 12 )}|η(X)− 1 2 |], (6) where η(x) := P (Y = 1...

    [...]

Book
27 Mar 1992
TL;DR: In this article, the authors provide a systematic account of the subject area, concentrating on the most recent advances in the field and discuss theoretical and practical issues in statistical image analysis, including regularized discriminant analysis and bootstrap-based assessment of the performance of a sample-based discriminant rule.
Abstract: Provides a systematic account of the subject area, concentrating on the most recent advances in the field. While the focus is on practical considerations, both theoretical and practical issues are explored. Among the advances covered are: regularized discriminant analysis and bootstrap-based assessment of the performance of a sample-based discriminant rule and extensions of discriminant analysis motivated by problems in statistical image analysis. Includes over 1,200 references in the bibliography.

2,999 citations


"A rate of convergence for mixture p..." refers methods in this paper

  • ...Finally, we provide a practical implementation of mixture proportion estimation and demonstrate its efficacy in classification with noisy labels....

    [...]

  • ...Finally, we remark that MPE had been studied prior to Blanchard et al. (2010), but under parametric modeling assumptions (McLachlan, 1992; Bouveyron and Girard, 2009)....

    [...]

Book
17 Aug 2012
TL;DR: This graduate-level textbook introduces fundamental concepts and methods in machine learning, and provides the theoretical underpinnings of these algorithms, and illustrates key aspects for their application.
Abstract: This graduate-level textbook introduces fundamental concepts and methods in machine learning. It describes several important modern algorithms, provides the theoretical underpinnings of these algorithms, and illustrates key aspects for their application. The authors aim to present novel theoretical tools and concepts while giving concise proofs even for relatively advanced topics. Foundations of Machine Learning fills the need for a general textbook that also offers theoretical details and an emphasis on proofs. Certain topics that are often treated with insufficient attention are discussed in more detail here; for example, entire chapters are devoted to regression, multi-class classification, and ranking. The first three chapters lay the theoretical foundation for what follows, but each remaining chapter is mostly self-contained. The appendix offers a concise probability review, a short introduction to convex optimization, tools for concentration bounds, and several basic properties of matrices and norms used in the book. The book is intended for graduate students and researchers in machine learning, statistics, and related areas; it can be used either as a textbook or as a reference text for a research seminar.

2,511 citations


"A rate of convergence for mixture p..." refers methods in this paper

  • ...= x) = Pr(Y = 1|Ỹ = 1, X = x)η̃(x) + Pr(Y = 1|Ỹ = 0, X = x)(1− η̃(x)) = (1− π1)η̃(x) + π0(1− η̃(x)) = (1− π0 − π1)η̃(x) + π0....

    [...]

  • ...The first and last terms can be bounded, with probability at least 1− 1/n, by 2DBMn√ n + 2BMn √ ln 2n 2n using Rademacher complexity analysis for balls in a RKHS (Mohri et al., 2012)....

    [...]