How do the authors assign a noisy label to an image?

The authors assign an image a noisy label if the authors find its surrounding text contains only the keywords of that label, otherwise the authors discard the image to reduce ambiguity.

How many clean and noisy samples are in the classifier?

In their experiments, the authors find that the performance of the classifier drops significantly without upsampling, but it is not sensitive with the upsampling ratio as long as the number of clean and noisy samples are in the same order.

How many images are used in the training dataset?

the size of training datasets are |Dc| = 47, 570 and |Dη| = 106, while validation and test set have 14, 313 and 10, 526 images, respectively.

What is the way to solve the problem of label noise?

The authors first randomly generate a confusion matrix Q between clean labels and noisy labels, and then corrupt the training labels according to it.

How do the authors deal with the noise problem?

To deal with this problem, the authors bootstrapNoise Free 41% Random 3% Confusing 56%p(z | x)5 Layers of Conv +Pool + Norm3 FC Layers of Size4096→4096→145 Layers of Conv +Pool + Norm3 FC Layers of Size4096→1024→3Label Noise Model LayerDown Coat Windbreaker 4% Jacket 1%……94%p(y | x)Noisy Label: WindbreakerDown Coat Windbreaker 11% Jacket 4%……75% p(y | y!,x)Noise Free 11% Random 4% Confusing 85%p(z | y!,x)Data with Clean LabelsFigure

(Open Access) Learning from massive noisy labeled data for image classification (2015) | Tong Xiao

Q: What contributions have the authors mentioned in the paper "Learning from massive noisy labeled data for image classification" ?

In this paper, the authors introduce a general framework to train CNNs with only a limited number of clean labels and millions of easily obtained noisy labels. To demonstrate the effectiveness of their approach, the authors collect a large-scale real-world clothing classification dataset with both noisy and clean labels. Experiments on this dataset indicate that their approach can better correct the noisy labels and improves the performance of trained CNNs. The authors model the relationships between images, class labels and label noises with a probabilistic graphical model and further integrate it into an end-to-end deep learning system.

Q: What is the way to train a CNN?

In their case of clothing classification, the authors find that training a CNN from scratch with limited clean labels and massive noisy labels is better than finetuning it only on the clean labels.

Q: What is the proposed model layer for the CNN?

the authors append a label noise model layer at the end, which takes as input the CNNs’ prediction scores and the observed noisy label.

Learning from Massive Noisy Labeled Data for Image Classiﬁcation

Tong Xiao

, Tian Xia

, Yi Yang

, Chang Huang

, and Xiaogang Wang

The Chinese University of Hong Kong

Baidu Research

Abstract

Large-scale supervised datasets are crucial to train con-

volutional neural networks (CNNs) for various computer vi-

sion problems. However, obtaining a massive amount of

well-labeled data is usually very expensive and time con-

suming. In this paper, we introduce a general framework

to train CNNs with only a limited number of clean labels

and millions of easily obtained noisy labels. We model the

relationships between images, class labels and label noises

with a probabilistic graphical model and further integrate

it into an end-to-end deep learning system. To demonstrate

the effectiveness of our approach, we collect a large-scale

real-world clothing classiﬁcation dataset with both noisy

and clean labels. Experiments on this dataset indicate that

our approach can better correct the noisy labels and im-

proves the performance of trained CNNs.

1. Introduction

Deep learning with large-scale supervised training

dataset has recently shown very impressive improvement

on multiple image recognition challenges including image

classiﬁcation [12], attribute learning [29], and scene clas-

siﬁcation [8]. While state-of-the-art results have been con-

tinuously reported [23,25,28], all these methods require re-

liable annotations from millions of images [6] which are

often expensive and time-consuming to obtain, prevent-

ing deep models from being quickly trained on new image

recognition problems. Thus it is necessary to develop new

efﬁcient labeling and training frameworks for deep learning.

One possible solution is to automatically collect a large

amount of annotations from the Internet web images [10]

(i.e. extracting tags from the surrounding texts or keywords

from search engines) and directly use them as ground truth

to train deep models. Unfortunately, these labels are ex-

tremely unreliable due to various types of noise (e.g. label-

ing mistakes from annotators or computing errors from ex-

traction algorithms). Many studies have shown that these

Sweater ? Windbreaker ? Windbreaker × Shawl ×

Training

Images

Noisy

Labels

Label

Noise

Model

CNNs

Extract features

Detect and correct the wrong labels

Chiffon √ Sweater √ Knitwear √ Down Coat √

Corrected

Labels

Web

Images

Figure 1. Overview of our approach. Labels of web images often

suffer from different types of noise. A label noise model is pro-

posed to detect and correct the wrong labels. The corrected labels

are used to train underlying CNNs.

noisy labels could adversely impact the classiﬁcation ac-

curacy of the induced classiﬁers [20, 22, 31]. Various la-

bel noise-robust algorithms are developed but experiments

show that performances of classiﬁers inferred by robust al-

gorithms are still affected by label noise [3, 26]. Other data

cleansing algorithms are proposed [2, 5, 17], but these ap-

proaches are difﬁcult in distinguishing informative hard ex-

amples from harmful mislabeled ones.

Although annotating all the data is costly, it is often easy

to obtain a small amount of clean labels. Based on the ob-

servation of transferability of deep neural networks, peo-

ple initialize parameters with a model pretrained on a larger

yet related dataset [12], and then ﬁnetune on the smaller

dataset of speciﬁc tasks [1, 7, 21]. Such methods may better

avoid overﬁtting and utilize the relationships between the

two datasets. However, we ﬁnd that training a CNN from

scratch with limited clean labels and massive noisy labels

is better than ﬁnetuning it only on clean labels. Other ap-

proaches address the problem as semi-supervised learning

where noisy labels are discarded [30]. These algorithms

usually suffer from model complexity thus cannot be ap-

plied on large-scale datasets. Therefore, it is inevitable to

develop a better way of using the huge amount of noisy la-

beled data.

Our goal is to build an end-to-end deep learning system

that is capable of training with both limited clean labels and

massive noisy labels more effectively. Figure 1 shows the

framework of our approach. We collect 1, 000, 000 clothing

images from online shopping websites. Each image is auto-

matically assigned with a noisy label according to the key-

words in its surrounding text. We manually reﬁne 72, 409

image labels, which constitute a clean sub-dataset. All the

data are then used to train CNNs, while the major challenge

is to identify and correct wrong labels during the training

process.

To cope with this challenge, we extend CNNs with a

novel probabilistic model, which infers the true labels and

uses them to supervise the training of the network. Our

work is inspired by [24], which modiﬁes a CNN by inserting

a linear layer on top of the softmax layer to map clean labels

to noisy labels. However, [24] assumes noisy labels are con-

ditionally independent of input images given clean labels.

However, when examining our collected dataset, we ﬁnd

that this assumption is too strong to ﬁt the real-world data

well. For example, in Figure 2, all the images should belong

to “Hoodie”. The top ﬁve are correct while the bottom ﬁve

are either mislabeled as “Windbreaker” or “Jacket”. Since

different sellers have their own bias on different categories,

they may provide wrong keywords for similar clothes. We

observe these visual patterns and hypothesize that they are

important to estimate how likely an image is mislabeled.

Based on these observations, we further introduce two types

of label noise:

• Confusing noise makes the noisy label reasonably

wrong. It usually occurs when the image content is

confusing (e.g., the samples with “?” in Figure 1).

• Pure random noise makes the noisy label totally

wrong. It is often caused by either the mismatch be-

tween an image and its surrounding text, or the false

conversion from the text to label (e.g., the samples with

“×” in Figure 1).

Our proposed probabilistic model captures the relations

among images, noisy labels, ground truth labels, and noise

types, where the latter two are treated as latent variables.

We use the Expectation-Maximization (EM) algorithm to

solve the problem and integrate it into the training process

of CNNs. Experiments on our real-world clothing dataset

Hoodie

Windbreaker

Hoodie Hoodie Hoodie Hoodie

Windbreaker Windbreaker Jacket Jacket

Figure 2. Mislabeled images often share similar visual patterns.

indicate that our model can better detect and correct the

noisy labels.

Our contributions fall in three aspects. First, we study

the cause of noisy labels in real-world data and describe it

with a novel probabilistic model. Second, we integrate the

model into a deep learning framework and explore different

training strategies to make the CNNs learn from better su-

pervisions. Finally, we collect a large-scale clothing dataset

with both noisy and clean labels, which will be released for

academic use.

2. Related Work

Learning with noisy labeled training data has been exten-

sively studied in the machine learning and computer vision

literature. For most of the related work including the effect

of label noises, taxonomy of label noises, robust algorithms

and noise cleaning algorithms for learning with noisy data,

we refer to [9] for a comprehensive review.

Direct learning with noisy labels: Many studies have

shown that label noises can adversely impact the classiﬁ-

cation accuracy of induced classiﬁers [31]. To better han-

dle label noise, some approaches rely on training classi-

ﬁers with label noise-robust algorithms [4, 15]. However,

Bartlett et al. [3] prove that most of the loss functions are

not completely robust to label noise. Experiments in [26]

show that the classiﬁers inferred by label noise-robust algo-

rithms are still affected by label noise. These methods seem

to be adequate only when label noise can be safely man-

aged by overﬁtting avoidance [9]. On the other hand, some

label noise cleansing methods were proposed to remove or

correct mislabeled instances [2,5, 17], but these approaches

were difﬁcult in distinguishing informative hard examples

from harmful mislabeled ones. Thus they might remove too

many instances and the overcleansing could reduce the per-

formances of classiﬁers [16].

Semi-supervised learning: Apart from direct learning

with label noise, some semi-supervised learning algorithms

were developed to utilize weakly labeled or even unlabeled

data. The Label Propagation method [30] explicitly used

ground truths of well labeled data to classify unlabeled sam-

ples. However, it suffered from computing pairwise dis-

tance, which has quadratic complexity with the number of

samples thus cannot be applied on large-scale datasets. We-

ston et al. [27] proposed to embed a pairwise loss in the

middle layer of a deep neural network, which beneﬁts the

learning of discriminative features. But they needed ex-

tra information about whether a pair of unlabeled images

belong to the same class, which cannot be obtained in our

problem.

Transfer learning: The success of CNNs lies in their

capability of learning rich and hierarchical image features.

However, the model parameters cannot be properly learned

when training data is not enough. Researchers proposed to

conquer this problem by ﬁrst initializing CNN parameters

with a model pretrained on a larger yet related dataset, and

then ﬁnetuning it on the smaller dataset of speciﬁc task [1,

7, 12, 21]. Nevertheless, this transfer learning scheme could

be suboptimal when the two tasks are just loosely related.

In our case of clothing classiﬁcation, we ﬁnd that training

a CNN from scratch with limited clean labels and massive

noisy labels is better than ﬁnetuning it only on the clean

labels.

Noise modeling with deep learning: Various methods

have been proposed to handle label noise in different prob-

lem settings, but there are very few works about deep learn-

ing from noisy labels [13, 18, 24]. Mnih and Hinton [18]

built a simple noise model for aerial images but only con-

sidered binary classiﬁcation. Larsen et al. [13] assumed la-

bel noises are independent from true class labels which is

a simple and special case. Sukhbaatar et al. [24] gener-

alized from them by considering multi-class classiﬁcation

and modeling class dependent noise, but they assumed the

noise was conditionally independent with the image con-

tent, ignoring the hardness of labeling images of different

confusing levels. Our work can be viewed as a generaliza-

tion of [19, 24] and our model is ﬂexible enough to not only

class dependent but also image dependent noise.

3. Label Noise Model

We target on learning a classiﬁer from a set of images

with noisy labels. To be speciﬁc, we have a noisy la-

beled dataset D



(1)

, ˜y

(1)



, . . . ,



(N)

, ˜y

(N)



with

n-th image x

(n)

and its corresponding noisy label ˜y

(n)

∈

{1, . . . , L}, where L is the number of classes. We describe

how the noisy label is generated by using a probabilistic

graphical model shown in Figure 3.

Despite the observed image x and the noisy label

y, we

exploit two discrete latent variables — y and z — to rep-

resent the true label and the label noise type, respectively.

Both

y and y are L-dimensional binary random variables

in 1-of-L fashion, i.e., only one element is equal to 1 while

others are all 0.

The label noise type z is an 1-of-3 binary random vari-

able. It is associated with three semantic meanings:

˜y

Figure 3. Probabilistic graphical model of label noise

Noise Free

Pure Random 2%

Confusing Noise 7%

91%

Noise Free 24%

Pure Random 18%

Confusing Noise

58%

Noise Free

Pure Random 5%

Confusing Noise 13%

82%

Noise Free 31%

Pure Random 6%

Confusing Noise

63%

Figure 4. Predicting noise types of four different “T-shirt” images.

The top two can be recognized with little ambiguity, while the

bottom two are easily confusing with the class “Chiffon”. Image

content can affect the possibility of it to be mislabeled.

1. The label is noise free, i.e.,

y should be equal to y.

2. The label suffers from a pure random noise, i.e.,

y can

take any possible value other than y.

3. The label suffers from a confusing noise, i.e.,

y can

take several values that are confusing with y.

Following this assignment rule, we deﬁne the conditional

probability of the noisy label as

y|y, z) =











Iy if z

= 1

L−1

(U − I)y if z

= 1

Cy if z

= 1,

(1)

where I is the identity matrix, U is the unit matrix (all the

elements are ones), C is a sparse stochastic matrix with

tr(C) = 0 and C

denotes the confusion probability be-

tween classes i and j. Then we can derive from Figure 3

the joint distribution of

y, y and z conditioning on x,

y, y, z|x) = p(

y|y, z)p(y|x)p(z|x). (2)

While the class label probability distribution p(y|x) is

comprehensible, the semantic meaning of p(z|x) needs ex-

tra clariﬁcation: it represents how confusing the image con-

tent is. Speciﬁc to our clothing classiﬁcation problem,

p(z|x) can be affected by different factors, including back-

ground clutter, image resolution, the style and material of

the clothes. Some examples are shown in Figure 4.

To illustrate the relations between noisy and true labels,

we derive their conditional probability from Eq. 2,

y|y, x) =



y, z|y, x) =



y|y, z)p(z|x), (3)

which can be interpreted as a mixture model. Given an input

image x, the conditional probability p(z|x) can be seen as

the prior of each mixture component. This makes a key

difference between our work and [24], where they assume

y is conditionally independent with x if y is given. All

the images share a same noise model in [24], while in our

approach each data sample has its own.

3.1. Learning the Parameters

We exploit two CNNs to model p(y|x) and p(z|x) sep-

arately. Denote the parameter set of each CNN by θ

and

. Our goal is to ﬁnd the optimal θ = θ

∪ θ

that maxi-

mizes the incomplete log-likelihood log p(

y|x; θ). The EM

algorithm is used to iteratively solve this problem.

For any probability distribution q(y, z|

y, x), we can de-

rive a lower bound of the incomplete log-likelihood,

log p(

y|x; θ) = log



y,z

y, y, z|x; θ)

≥



y,z

q(y, z|

y, x) log

y, y, z|x; θ)

q(y, z|

y, x)

(4)

E-Step The difference between log p(

y|x; θ) and

its lower bound is the Kullback-Leibler divergence

KL (q(y, z|

y, x)||p(y, z|

y, x; θ)), which is equal to zero

if and only if q(y , z|

y, x) = p(y, z|

y, x; θ). Therefore,

in each iteration t, we ﬁrst compute the posterior of latent

variables using the current parameters θ

(t)

p(y, z|

y, x; θ

(t)

) =

y, y, z|x; θ

(t)

)

y|x; θ

(t)

)

y|y, z; θ

(t)

)p(y|x; θ

(t)

)p(z|x; θ

(t)

)



′

y|y

′

, z

′

; θ

(t)

)p(y

′

|x; θ

(t)

)p(z

′

|x; θ

(t)

)

. (5)

Then the expected complete log-likelihood can be written

Q(θ; θ

(t)

) =



y,z

p(y, z|

y, x; θ

(t)

) log p(

y, y, z|x; θ). (6)

M-Step We exploit two CNNs to model the probability

p(y|x; θ

) and p(z|x; θ

), respectively. Thus the gradient

of Q w.r.t. θ can be decoupled into two parts:

∂Q

∂θ



y,z

p(y, z|

y, x; θ

(t)

)

∂

∂θ

log p(

y, y, z|x; θ)



p(y|

y, x; θ

(t)

)

∂

∂θ

log p(y|x; θ



p(z|

y, x; θ

(t)

)

∂

∂θ

log p(z|x; θ

). (7)

The M-Step above is equivalent to minimizing the cross

entropy between the estimated ground truth distribution and

the prediction of the classiﬁer.

3.2. Estimating Matrix C

Notice that we do not set parameters to the conditional

probability p(

y|y, z) in Eq. (1) and keep it unchanged dur-

ing the learning process. Because without other regulariza-

tions, learning all the three parts in Eq. (2) could lead to

trivial solutions. For example, the network will always pre-

dict y

= 1, z

= 1, and the matrix C is learned to make

= 1 for all i. To avoid such degeneration, we esti-

mate C on a relatively small dataset D

= {(x, y,

y)}

where we have N images with both clean and noisy labels.

As prior information about z is not available, we solve the

following optimization problem:

max

C,z

(1)

,··· ,z

(N )



i=1

log p(

(i)

, z

(i)

). (8)

Obviously, sample i contributes nothing to the optimal C

∗

if y

(i)

and

(i)

are equal. So that we discard those samples

and reinterpret the problem in another form by exploiting

Eq. (1):

max

C,t

E =

′



i=1

log α

+ log(

(i)T

(i)

)

1−t

subject to C is a stochastic matrix of size L × L,

t ∈ {0, 1}

′

(9)

where α =

L−1

and N

′

is the number of remaining sam-

ples. The semantic meaning of the above formulation is that

we need to assign each (y,

y) pair the optimal noise type,

while ﬁnding the optimal C simultaneously.

Next, we will show that the problem can be solved by a

simple yet efﬁcient algorithm in O(N

′

) time complex-

ity. Denote the optimal solution by C

∗

and t

∗

Theorem 1. C

∗

= 0 ⇒ C

∗

> α, ∀i, j ∈ {1, . . . , L}.

Proof. Suppose there exists some i, j such that 0 < C

∗

≤

α. Then we conduct the following operations. First, we set

∗

= 0 while adding its original value to other elements

in column j. Second, for all the samples n where

(n)

= 1

and y

(n)

= 1, we set t

to 1. The resulting E will get

increased, which leads to a contradiction.

Theorem 2. (

(i)

, y

(i)

) = (

(j)

, y

(j)

) ⇒ t

∗

= t

∗

, ∀i, j ∈

{1, . . . , N

′

Proof. Suppose

(i)

(j)

= 1 and y

(i)

= y

(j)

= 1

but t

∗

= t

∗

. From Theorem 1 we know that elements in

∗

is either 0 or greater than α. If C

∗

= 0, we can set

∗

= t

∗

= 1, otherwise we can set t

∗

= t

∗

= 0. In either

case the resulting E will get increased, which leads to a

contradiction.

Theorem 3.

(i)T

∗

(i)

> α ⇔ t

∗

= 0 and

(i)T

∗

(i)

= 0 ⇔ t

∗

= 1, ∀i ∈ {1, . . . , N

′

Proof. The ﬁrst part is straightforward. For the second part,

∗

= 1 implies

(i)T

∗

(i)

≤ α. By using Theorem 1 we

know that

(i)T

∗

(i)

= 0.

Notice that if the true label is class i while the noisy la-

bel is class j, then it can only affect the value of C

. Thus

each column of C can be optimized separately. Theorem 1

further shows that samples with same pair of (

y, y) share

a same noise type. Then what really matters is the fre-

quencies of all the L × L pairs of (

y, y). Considering a

particular column c, suppose there are M samples affect-

ing this column. We can count the frequencies of noisy la-

bel class 1 to L as m

, . . . , m

and might as well assume

≥ m

≥ · · · ≥ m

. The problem is then converted to

max

c,t

E =



k=1



log α

+ log c

1−t



subject to c ∈ [0, 1]



k=1

= 1,

t ∈ {0, 1}

(10)

Due to the rearrangement inequality, we can prove that

in the optimal solution,

max(α, c

∗

) ≥ max(α, c

∗

) ≥ · · · ≥ max(α, c

∗

). (11)

Then by using Theorem 3, there must exist a k

∗

∈

{1, . . . , L} such that

∗

= 0, i = 1, . . . , k

∗

= 1, i = k

∗

+ 1, . . . , L.

(12)

This also implies that only the ﬁrst k

∗

elements of c

∗

have

nonzero values (greater than α actually). Furthermore, if k

∗

is known, ﬁnding the optimal c

∗

is to solve the following

problem:

max

E =

∗



k=1

log c

subject to c ∈ [0, 1]

∗



k=1

= 1,

(13)

whose solution is

∗



∗

k=1

, i = 1, . . . , k

∗

= 0, i = k

∗

+ 1, . . . , L.

(14)

The above analysis leads to a simple algorithm. We enu-

merate k

∗

from 1 to L. For each k

∗

, t

∗

and c

∗

are computed

by using Eq. (12) and (14), respectively. Then we evaluate

the objective function E and record the best solution.

4. Deep Learning from Noisy Labels

We integrate the proposed label noise model into a deep

learning framework. As demonstrated in Figure 5, we pre-

dict the probability p(y|x) and p(z|x) by using two inde-

pendent CNNs. Moreover, we append a label noise model

layer at the end, which takes as input the CNNs’ prediction

scores and the observed noisy label. Stochastic Gradient

Ascent with backpropagation technique is used to approxi-

mately optimize the whole network. In each forward pass,

the label noise model layer computes the posterior of latent

variables according to Eq. (5). While in the backward pass,

it computes the gradients according to Eq. (7).

Directly training the whole network with random ini-

tialization is impractical, because the posterior computation

could be totally wrong. Therefore, we need to pretrain each

CNN component with strongly supervised data. Images and

their ground truth labels in the dataset D

are used to train

the CNN that predicts p(y|x). On the other hand, the opti-

mal solutions of z

(1)

, · · · , z

(N)

in Eq. (8) are used to train

the CNN that predicts p(z|x).

After both CNN components are properly pretrained, we

can start to train the whole network with massive noisy la-

beled data. However, some practical issues need further

discussion. First, if we merely use noisy labels, we will

lose precious knowledge that we have gained before and the

model could be drifted. Therefore, we need to mix the data

with clean labels into our training set, which is depicted in

Figure 5 as the extra supervisions for the two CNNs. Then

each CNN receives two kinds of gradients, one is from the

clean labels and the other is from the noisy labels. We de-

note them by ∆

and ∆

, respectively. A potential prob-

lem is that |∆

| ≪ |∆

|, because clean data is much less

than noisy data. To deal with this problem, we bootstrap

Learning from massive noisy labeled data for image classification

Figures

Citations

DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations

Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach

Generalized cross entropy loss for training deep neural networks with noisy labels

Learning to Reweight Examples for Robust Deep Learning

Robust Loss Functions under Label Noise for Deep Neural Networks

References

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Going deeper with convolutions

Related Papers (5)

Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach

Deep Residual Learning for Image Recognition

Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels

Learning with Noisy Labels

Learning Multiple Layers of Features from Tiny Images

Frequently Asked Questions (11)

Q1. What contributions have the authors mentioned in the paper "Learning from massive noisy labeled data for image classification" ?

Q2. What is the purpose of the semi-supervised learning?

Q3. How do the authors assign a noisy label to an image?

Q4. What is the way to train deep models?

Q5. How many clean and noisy samples are in the classifier?

Q6. What is the way to train a CNN?

Q7. What is the main challenge of the training process?

Q8. What is the proposed model layer for the CNN?

Q9. How many images are used in the training dataset?

Q10. What is the way to solve the problem of label noise?

Q11. How do the authors deal with the noise problem?