# A rate of convergence for mixture proportion estimation, with application to learning from noisy labels

##### Citations

744 citations

### Cites methods from "A rate of convergence for mixture p..."

...In Section 4, we discuss how to perform classification in the presence of RCN and benefit from the abundant surrogate loss functions and algorithms designed for the traditional classification problem....

[...]

474 citations

### Cites background from "A rate of convergence for mixture p..."

..., anchor points [117], [153]; an example x with its label i is defined as an anchor point if p(y = i|x) = 1 and p(y = k|x) = 0 for k 6= i....

[...]

291 citations

### Cites background from "A rate of convergence for mixture p..."

...2015b; Scott 2015)....

[...]

...Unfortunately, this is an ill-dened problem because it is not identiable: the absence of a label can be explained by either a small prior probability for the positive class or a low label frequency [90]. In order for the class prior to be identiable, additional assumption are necessary. This section gives an overview on possible assumptions, listed from strongest to strictly weaker. 1. Separable Cl...

[...]

...Unfortunately, this is an ill-defined problem because it is not identifiable: the absence of a label can be explained by either a small prior probability for the positive class or a low label frequency (Scott 2015)....

[...]

...ra assumptions, innite examples are required for convergence. The stricter positive subdomain assumption allows for Learning From Positive and Unlabeled Data: A Survey 31 practical algorithms. Scott [90] implements this idea by building a conditional probability classier. The same idea is approached from a dierent angle by Jain et al. [42,40]. They use k-kernel density estimation to approximate the...

[...]

...et Instead of requiring no overlap between the distributions, it suces to require a subset of the instance space dened by partial attribute assignment (called the anchor set), to be purely positive [2, 65,83,90]. The ratio of labeled examples in this subdomain is equal to the label frequency, while in other parts of the positive distribution, the ratio can be lower. 3. Positive function/separability This is ...

[...]

272 citations

### Cites background from "A rate of convergence for mixture p..."

...For example, Scott et al. (2013); Scott (2015) developed a theoretical and practical convergence criterion in the binary setting....

[...]

213 citations

### Cites background from "A rate of convergence for mixture p..."

...Sanderson & Scott (2014); Scott (2015) explored a practical estimator along these lines....

[...]

##### References

5,840 citations

### "A rate of convergence for mixture p..." refers background or methods in this paper

...Finally, we provide a practical implementation of mixture proportion estimation and demonstrate its efficacy in classification with noisy labels....

[...]

...…relevant WSL problem is binary classification with label noise, when the label noise is assumed to be independent of the observed feature vector (Blum and Mitchell, 1998; Lawrence and Schölkopf, 2001; Bouveyron and Girard, 2009; Stempfel and Ralaivola, 2009; Long and Servido, 2010; Manwani and…...

[...]

...These include crowdsourcing (Raykar et al., 2010), multiple instance learning (Blum and Kalai, 1998), co-training (Blum and Mitchell, 1998), and learning from partial labels (Cour et al., 2011)....

[...]

4,664 citations

### "A rate of convergence for mixture p..." refers background in this paper

...Generalizing the above, for any α ∈ (0, 1) we can define the α-cost-sensitive P -risk for any f ∈M, RP,α(f) := E(X,Y )∼P [(1− α)1{Y=1}1{f(X)≤0} + α1{Y=0}1{f(X)>0}]....

[...]

...Let > 0, and let f ∈ H be such that RP̃ ,Lα(f ) R∗ P̃ ,Lα + 2 , which is possible since the the reproducing kernel associated with H is universal (Steinwart and Christmann, 2008)....

[...]

...If (B) holds, then for any f ∈M, RP (f)−R∗P = 2(1− π1 − π0)(RP̃ ,α(f)−R ∗ P̃ ,α ) (8) where α = ( 12 − π0)/(1− π1 − π0)....

[...]

...We will assume that the reproducing kernel k associated with H is universal and bounded (Steinwart and Christmann, 2008)....

[...]

3,598 citations

### "A rate of convergence for mixture p..." refers methods in this paper

...The estimator κ̂ of Blanchard et al. (2010) relies on VC theory (Devroye et al., 1996)....

[...]

...Finally, we provide a practical implementation of mixture proportion estimation and demonstrate its efficacy in classification with noisy labels....

[...]

...In particular, if F = (1−κ)G+κH holds, then any alternate decomposition of the form F = (1− κ+ δ)G′ + (κ− δ)H , with G′ = (1 − κ + δ)−1((1 − κ)G + δH) , and δ ∈ [0, κ) , is also valid....

[...]

...It is well known (Devroye et al., 1996) that for any f ∈M, the excess P -risk satisfies RP (f)−R∗P = 2EX [1{u(f(X))6=u(η(X)− 12 )}|η(X)− 1 2 |], (6) where η(x) := P (Y = 1...

[...]

2,999 citations

### "A rate of convergence for mixture p..." refers methods in this paper

...Finally, we provide a practical implementation of mixture proportion estimation and demonstrate its efficacy in classification with noisy labels....

[...]

...Finally, we remark that MPE had been studied prior to Blanchard et al. (2010), but under parametric modeling assumptions (McLachlan, 1992; Bouveyron and Girard, 2009)....

[...]

2,511 citations

### "A rate of convergence for mixture p..." refers methods in this paper

...= x) = Pr(Y = 1|Ỹ = 1, X = x)η̃(x) + Pr(Y = 1|Ỹ = 0, X = x)(1− η̃(x)) = (1− π1)η̃(x) + π0(1− η̃(x)) = (1− π0 − π1)η̃(x) + π0....

[...]

...The first and last terms can be bounded, with probability at least 1− 1/n, by 2DBMn√ n + 2BMn √ ln 2n 2n using Rademacher complexity analysis for balls in a RKHS (Mohri et al., 2012)....

[...]