scispace - formally typeset
Open AccessProceedings ArticleDOI

Learning from massive noisy labeled data for image classification

TLDR
A general framework to train CNNs with only a limited number of clean labels and millions of easily obtained noisy labels is introduced and the relationships between images, class labels and label noises are model with a probabilistic graphical model and further integrate it into an end-to-end deep learning system.
Abstract
Large-scale supervised datasets are crucial to train convolutional neural networks (CNNs) for various computer vision problems. However, obtaining a massive amount of well-labeled data is usually very expensive and time consuming. In this paper, we introduce a general framework to train CNNs with only a limited number of clean labels and millions of easily obtained noisy labels. We model the relationships between images, class labels and label noises with a probabilistic graphical model and further integrate it into an end-to-end deep learning system. To demonstrate the effectiveness of our approach, we collect a large-scale real-world clothing classification dataset with both noisy and clean labels. Experiments on this dataset indicate that our approach can better correct the noisy labels and improves the performance of trained CNNs.

read more

Content maybe subject to copyright    Report

Learning from Massive Noisy Labeled Data for Image Classification
Tong Xiao
1
, Tian Xia
2
, Yi Yang
2
, Chang Huang
2
, and Xiaogang Wang
1
1
The Chinese University of Hong Kong
2
Baidu Research
Abstract
Large-scale supervised datasets are crucial to train con-
volutional neural networks (CNNs) for various computer vi-
sion problems. However, obtaining a massive amount of
well-labeled data is usually very expensive and time con-
suming. In this paper, we introduce a general framework
to train CNNs with only a limited number of clean labels
and millions of easily obtained noisy labels. We model the
relationships between images, class labels and label noises
with a probabilistic graphical model and further integrate
it into an end-to-end deep learning system. To demonstrate
the effectiveness of our approach, we collect a large-scale
real-world clothing classification dataset with both noisy
and clean labels. Experiments on this dataset indicate that
our approach can better correct the noisy labels and im-
proves the performance of trained CNNs.
1. Introduction
Deep learning with large-scale supervised training
dataset has recently shown very impressive improvement
on multiple image recognition challenges including image
classification [12], attribute learning [29], and scene clas-
sification [8]. While state-of-the-art results have been con-
tinuously reported [23,25,28], all these methods require re-
liable annotations from millions of images [6] which are
often expensive and time-consuming to obtain, prevent-
ing deep models from being quickly trained on new image
recognition problems. Thus it is necessary to develop new
efficient labeling and training frameworks for deep learning.
One possible solution is to automatically collect a large
amount of annotations from the Internet web images [10]
(i.e. extracting tags from the surrounding texts or keywords
from search engines) and directly use them as ground truth
to train deep models. Unfortunately, these labels are ex-
tremely unreliable due to various types of noise (e.g. label-
ing mistakes from annotators or computing errors from ex-
traction algorithms). Many studies have shown that these
Sweater ? Windbreaker ? Windbreaker × Shawl ×
Training
Images
Noisy
Labels
Label
Noise
Model
CNNs
Extract features
Detect and correct the wrong labels
Chiffon Sweater Knitwear Down Coat
Corrected
Labels
Web
Images
Figure 1. Overview of our approach. Labels of web images often
suffer from different types of noise. A label noise model is pro-
posed to detect and correct the wrong labels. The corrected labels
are used to train underlying CNNs.
noisy labels could adversely impact the classification ac-
curacy of the induced classifiers [20, 22, 31]. Various la-
bel noise-robust algorithms are developed but experiments
show that performances of classifiers inferred by robust al-
gorithms are still affected by label noise [3, 26]. Other data
cleansing algorithms are proposed [2, 5, 17], but these ap-
proaches are difficult in distinguishing informative hard ex-
amples from harmful mislabeled ones.
Although annotating all the data is costly, it is often easy
to obtain a small amount of clean labels. Based on the ob-
servation of transferability of deep neural networks, peo-
ple initialize parameters with a model pretrained on a larger
yet related dataset [12], and then finetune on the smaller
dataset of specific tasks [1, 7, 21]. Such methods may better
avoid overfitting and utilize the relationships between the

two datasets. However, we find that training a CNN from
scratch with limited clean labels and massive noisy labels
is better than finetuning it only on clean labels. Other ap-
proaches address the problem as semi-supervised learning
where noisy labels are discarded [30]. These algorithms
usually suffer from model complexity thus cannot be ap-
plied on large-scale datasets. Therefore, it is inevitable to
develop a better way of using the huge amount of noisy la-
beled data.
Our goal is to build an end-to-end deep learning system
that is capable of training with both limited clean labels and
massive noisy labels more effectively. Figure 1 shows the
framework of our approach. We collect 1, 000, 000 clothing
images from online shopping websites. Each image is auto-
matically assigned with a noisy label according to the key-
words in its surrounding text. We manually refine 72, 409
image labels, which constitute a clean sub-dataset. All the
data are then used to train CNNs, while the major challenge
is to identify and correct wrong labels during the training
process.
To cope with this challenge, we extend CNNs with a
novel probabilistic model, which infers the true labels and
uses them to supervise the training of the network. Our
work is inspired by [24], which modifies a CNN by inserting
a linear layer on top of the softmax layer to map clean labels
to noisy labels. However, [24] assumes noisy labels are con-
ditionally independent of input images given clean labels.
However, when examining our collected dataset, we find
that this assumption is too strong to fit the real-world data
well. For example, in Figure 2, all the images should belong
to “Hoodie”. The top ve are correct while the bottom ve
are either mislabeled as “Windbreaker” or “Jacket”. Since
different sellers have their own bias on different categories,
they may provide wrong keywords for similar clothes. We
observe these visual patterns and hypothesize that they are
important to estimate how likely an image is mislabeled.
Based on these observations, we further introduce two types
of label noise:
Confusing noise makes the noisy label reasonably
wrong. It usually occurs when the image content is
confusing (e.g., the samples with “?” in Figure 1).
Pure random noise makes the noisy label totally
wrong. It is often caused by either the mismatch be-
tween an image and its surrounding text, or the false
conversion from the text to label (e.g., the samples with
× in Figure 1).
Our proposed probabilistic model captures the relations
among images, noisy labels, ground truth labels, and noise
types, where the latter two are treated as latent variables.
We use the Expectation-Maximization (EM) algorithm to
solve the problem and integrate it into the training process
of CNNs. Experiments on our real-world clothing dataset
Hoodie
Windbreaker
Hoodie Hoodie Hoodie Hoodie
Windbreaker Windbreaker Jacket Jacket
Figure 2. Mislabeled images often share similar visual patterns.
indicate that our model can better detect and correct the
noisy labels.
Our contributions fall in three aspects. First, we study
the cause of noisy labels in real-world data and describe it
with a novel probabilistic model. Second, we integrate the
model into a deep learning framework and explore different
training strategies to make the CNNs learn from better su-
pervisions. Finally, we collect a large-scale clothing dataset
with both noisy and clean labels, which will be released for
academic use.
2. Related Work
Learning with noisy labeled training data has been exten-
sively studied in the machine learning and computer vision
literature. For most of the related work including the effect
of label noises, taxonomy of label noises, robust algorithms
and noise cleaning algorithms for learning with noisy data,
we refer to [9] for a comprehensive review.
Direct learning with noisy labels: Many studies have
shown that label noises can adversely impact the classifi-
cation accuracy of induced classifiers [31]. To better han-
dle label noise, some approaches rely on training classi-
fiers with label noise-robust algorithms [4, 15]. However,
Bartlett et al. [3] prove that most of the loss functions are
not completely robust to label noise. Experiments in [26]
show that the classifiers inferred by label noise-robust algo-
rithms are still affected by label noise. These methods seem
to be adequate only when label noise can be safely man-
aged by overfitting avoidance [9]. On the other hand, some
label noise cleansing methods were proposed to remove or
correct mislabeled instances [2,5, 17], but these approaches
were difficult in distinguishing informative hard examples
from harmful mislabeled ones. Thus they might remove too
many instances and the overcleansing could reduce the per-
formances of classifiers [16].
Semi-supervised learning: Apart from direct learning
with label noise, some semi-supervised learning algorithms
were developed to utilize weakly labeled or even unlabeled
data. The Label Propagation method [30] explicitly used
ground truths of well labeled data to classify unlabeled sam-
ples. However, it suffered from computing pairwise dis-

tance, which has quadratic complexity with the number of
samples thus cannot be applied on large-scale datasets. We-
ston et al. [27] proposed to embed a pairwise loss in the
middle layer of a deep neural network, which benefits the
learning of discriminative features. But they needed ex-
tra information about whether a pair of unlabeled images
belong to the same class, which cannot be obtained in our
problem.
Transfer learning: The success of CNNs lies in their
capability of learning rich and hierarchical image features.
However, the model parameters cannot be properly learned
when training data is not enough. Researchers proposed to
conquer this problem by first initializing CNN parameters
with a model pretrained on a larger yet related dataset, and
then finetuning it on the smaller dataset of specific task [1,
7, 12, 21]. Nevertheless, this transfer learning scheme could
be suboptimal when the two tasks are just loosely related.
In our case of clothing classification, we find that training
a CNN from scratch with limited clean labels and massive
noisy labels is better than finetuning it only on the clean
labels.
Noise modeling with deep learning: Various methods
have been proposed to handle label noise in different prob-
lem settings, but there are very few works about deep learn-
ing from noisy labels [13, 18, 24]. Mnih and Hinton [18]
built a simple noise model for aerial images but only con-
sidered binary classification. Larsen et al. [13] assumed la-
bel noises are independent from true class labels which is
a simple and special case. Sukhbaatar et al. [24] gener-
alized from them by considering multi-class classification
and modeling class dependent noise, but they assumed the
noise was conditionally independent with the image con-
tent, ignoring the hardness of labeling images of different
confusing levels. Our work can be viewed as a generaliza-
tion of [19, 24] and our model is flexible enough to not only
class dependent but also image dependent noise.
3. Label Noise Model
We target on learning a classifier from a set of images
with noisy labels. To be specific, we have a noisy la-
beled dataset D
η
=

x
(1)
, ˜y
(1)
, . . . ,
x
(N)
, ˜y
(N)

with
n-th image x
(n)
and its corresponding noisy label ˜y
(n)
{1, . . . , L}, where L is the number of classes. We describe
how the noisy label is generated by using a probabilistic
graphical model shown in Figure 3.
Despite the observed image x and the noisy label
˜
y, we
exploit two discrete latent variables y and z to rep-
resent the true label and the label noise type, respectively.
Both
˜
y and y are L-dimensional binary random variables
in 1-of-L fashion, i.e., only one element is equal to 1 while
others are all 0.
The label noise type z is an 1-of-3 binary random vari-
able. It is associated with three semantic meanings:
N
˜y
n
θ
2
θ
1
y
n
x
n
z
n
Figure 3. Probabilistic graphical model of label noise
Noise Free
Pure Random 2%
Confusing Noise 7%
91%
Noise Free 24%
Pure Random 18%
Confusing Noise
58%
Noise Free
Pure Random 5%
Confusing Noise 13%
82%
Noise Free 31%
Pure Random 6%
Confusing Noise
63%
Figure 4. Predicting noise types of four different “T-shirt” images.
The top two can be recognized with little ambiguity, while the
bottom two are easily confusing with the class “Chiffon”. Image
content can affect the possibility of it to be mislabeled.
1. The label is noise free, i.e.,
˜
y should be equal to y.
2. The label suffers from a pure random noise, i.e.,
˜
y can
take any possible value other than y.
3. The label suffers from a confusing noise, i.e.,
˜
y can
take several values that are confusing with y.
Following this assignment rule, we define the conditional
probability of the noisy label as
p(
˜
y|y, z) =
˜
y
T
Iy if z
1
= 1
1
L1
˜
y
T
(U I)y if z
2
= 1
˜
y
T
Cy if z
3
= 1,
(1)
where I is the identity matrix, U is the unit matrix (all the
elements are ones), C is a sparse stochastic matrix with
tr(C) = 0 and C
ij
denotes the confusion probability be-
tween classes i and j. Then we can derive from Figure 3
the joint distribution of
˜
y, y and z conditioning on x,
p(
˜
y, y, z|x) = p(
˜
y|y, z)p(y|x)p(z|x). (2)
While the class label probability distribution p(y|x) is
comprehensible, the semantic meaning of p(z|x) needs ex-
tra clarification: it represents how confusing the image con-
tent is. Specific to our clothing classification problem,
p(z|x) can be affected by different factors, including back-
ground clutter, image resolution, the style and material of
the clothes. Some examples are shown in Figure 4.

To illustrate the relations between noisy and true labels,
we derive their conditional probability from Eq. 2,
p(
˜
y|y, x) =
z
p(
˜
y, z|y, x) =
z
p(
˜
y|y, z)p(z|x), (3)
which can be interpreted as a mixture model. Given an input
image x, the conditional probability p(z|x) can be seen as
the prior of each mixture component. This makes a key
difference between our work and [24], where they assume
˜
y is conditionally independent with x if y is given. All
the images share a same noise model in [24], while in our
approach each data sample has its own.
3.1. Learning the Parameters
We exploit two CNNs to model p(y|x) and p(z|x) sep-
arately. Denote the parameter set of each CNN by θ
1
and
θ
2
. Our goal is to find the optimal θ = θ
1
θ
2
that maxi-
mizes the incomplete log-likelihood log p(
˜
y|x; θ). The EM
algorithm is used to iteratively solve this problem.
For any probability distribution q(y, z|
˜
y, x), we can de-
rive a lower bound of the incomplete log-likelihood,
log p(
˜
y|x; θ) = log
y,z
p(
˜
y, y, z|x; θ)
y,z
q(y, z|
˜
y, x) log
p(
˜
y, y, z|x; θ)
q(y, z|
˜
y, x)
.
(4)
E-Step The difference between log p(
˜
y|x; θ) and
its lower bound is the Kullback-Leibler divergence
KL (q(y, z|
˜
y, x)||p(y, z|
˜
y, x; θ)), which is equal to zero
if and only if q(y , z|
˜
y, x) = p(y, z|
˜
y, x; θ). Therefore,
in each iteration t, we first compute the posterior of latent
variables using the current parameters θ
(t)
,
p(y, z|
˜
y, x; θ
(t)
) =
p(
˜
y, y, z|x; θ
(t)
)
p(
˜
y|x; θ
(t)
)
=
p(
˜
y|y, z; θ
(t)
)p(y|x; θ
(t)
)p(z|x; θ
(t)
)
y
,z
p(
˜
y|y
, z
; θ
(t)
)p(y
|x; θ
(t)
)p(z
|x; θ
(t)
)
. (5)
Then the expected complete log-likelihood can be written
as
Q(θ; θ
(t)
) =
y,z
p(y, z|
˜
y, x; θ
(t)
) log p(
˜
y, y, z|x; θ). (6)
M-Step We exploit two CNNs to model the probability
p(y|x; θ
1
) and p(z|x; θ
2
), respectively. Thus the gradient
of Q w.r.t. θ can be decoupled into two parts:
Q
θ
=
y,z
p(y, z|
˜
y, x; θ
(t)
)
θ
log p(
˜
y, y, z|x; θ)
=
y
p(y|
˜
y, x; θ
(t)
)
θ
1
log p(y|x; θ
1
)+
z
p(z|
˜
y, x; θ
(t)
)
θ
2
log p(z|x; θ
2
). (7)
The M-Step above is equivalent to minimizing the cross
entropy between the estimated ground truth distribution and
the prediction of the classifier.
3.2. Estimating Matrix C
Notice that we do not set parameters to the conditional
probability p(
˜
y|y, z) in Eq. (1) and keep it unchanged dur-
ing the learning process. Because without other regulariza-
tions, learning all the three parts in Eq. (2) could lead to
trivial solutions. For example, the network will always pre-
dict y
1
= 1, z
3
= 1, and the matrix C is learned to make
C
1i
= 1 for all i. To avoid such degeneration, we esti-
mate C on a relatively small dataset D
c
= {(x, y,
˜
y)}
N
,
where we have N images with both clean and noisy labels.
As prior information about z is not available, we solve the
following optimization problem:
max
C,z
(1)
,··· ,z
(N )
N
i=1
log p(
˜
y
(i)
|y
(i)
, z
(i)
). (8)
Obviously, sample i contributes nothing to the optimal C
if y
(i)
and
˜
y
(i)
are equal. So that we discard those samples
and reinterpret the problem in another form by exploiting
Eq. (1):
max
C,t
E =
N
i=1
log α
t
i
+ log(
˜
y
(i)T
Cy
(i)
)
1t
i
,
subject to C is a stochastic matrix of size L × L,
t {0, 1}
N
,
(9)
where α =
1
L1
and N
is the number of remaining sam-
ples. The semantic meaning of the above formulation is that
we need to assign each (y,
˜
y) pair the optimal noise type,
while finding the optimal C simultaneously.
Next, we will show that the problem can be solved by a
simple yet efficient algorithm in O(N
+L
2
) time complex-
ity. Denote the optimal solution by C
and t
Theorem 1. C
ij
= 0 C
ij
> α, i, j {1, . . . , L}.
Proof. Suppose there exists some i, j such that 0 < C
ij
α. Then we conduct the following operations. First, we set
C
ij
= 0 while adding its original value to other elements

in column j. Second, for all the samples n where
˜
y
(n)
i
= 1
and y
(n)
j
= 1, we set t
n
to 1. The resulting E will get
increased, which leads to a contradiction.
Theorem 2. (
˜
y
(i)
, y
(i)
) = (
˜
y
(j)
, y
(j)
) t
i
= t
j
, i, j
{1, . . . , N
}.
Proof. Suppose
˜
y
(i)
k
=
˜
y
(j)
k
= 1 and y
(i)
l
= y
(j)
l
= 1
but t
i
= t
j
. From Theorem 1 we know that elements in
C
is either 0 or greater than α. If C
kl
= 0, we can set
t
i
= t
j
= 1, otherwise we can set t
i
= t
j
= 0. In either
case the resulting E will get increased, which leads to a
contradiction.
Theorem 3.
˜
y
(i)T
C
y
(i)
> α t
i
= 0 and
˜
y
(i)T
C
y
(i)
= 0 t
i
= 1, i {1, . . . , N
}.
Proof. The first part is straightforward. For the second part,
t
i
= 1 implies
˜
y
(i)T
C
y
(i)
α. By using Theorem 1 we
know that
˜
y
(i)T
C
y
(i)
= 0.
Notice that if the true label is class i while the noisy la-
bel is class j, then it can only affect the value of C
ij
. Thus
each column of C can be optimized separately. Theorem 1
further shows that samples with same pair of (
˜
y, y) share
a same noise type. Then what really matters is the fre-
quencies of all the L × L pairs of (
˜
y, y). Considering a
particular column c, suppose there are M samples affect-
ing this column. We can count the frequencies of noisy la-
bel class 1 to L as m
1
, . . . , m
L
and might as well assume
m
1
m
2
· · · m
L
. The problem is then converted to
max
c,t
E =
L
k=1
m
k
log α
t
k
+ log c
1t
k
k
,
subject to c [0, 1]
L
,
L
k=1
c
k
= 1,
t {0, 1}
L
.
(10)
Due to the rearrangement inequality, we can prove that
in the optimal solution,
max(α, c
1
) max(α, c
2
) · · · max(α, c
L
). (11)
Then by using Theorem 3, there must exist a k
{1, . . . , L} such that
t
i
= 0, i = 1, . . . , k
,
t
i
= 1, i = k
+ 1, . . . , L.
(12)
This also implies that only the first k
elements of c
have
nonzero values (greater than α actually). Furthermore, if k
is known, finding the optimal c
is to solve the following
problem:
max
c
E =
k
k=1
m
k
log c
k
,
subject to c [0, 1]
L
,
k
k=1
c
k
= 1,
(13)
whose solution is
c
i
=
m
i
k
k=1
m
k
, i = 1, . . . , k
,
c
i
= 0, i = k
+ 1, . . . , L.
(14)
The above analysis leads to a simple algorithm. We enu-
merate k
from 1 to L. For each k
, t
and c
are computed
by using Eq. (12) and (14), respectively. Then we evaluate
the objective function E and record the best solution.
4. Deep Learning from Noisy Labels
We integrate the proposed label noise model into a deep
learning framework. As demonstrated in Figure 5, we pre-
dict the probability p(y|x) and p(z|x) by using two inde-
pendent CNNs. Moreover, we append a label noise model
layer at the end, which takes as input the CNNs’ prediction
scores and the observed noisy label. Stochastic Gradient
Ascent with backpropagation technique is used to approxi-
mately optimize the whole network. In each forward pass,
the label noise model layer computes the posterior of latent
variables according to Eq. (5). While in the backward pass,
it computes the gradients according to Eq. (7).
Directly training the whole network with random ini-
tialization is impractical, because the posterior computation
could be totally wrong. Therefore, we need to pretrain each
CNN component with strongly supervised data. Images and
their ground truth labels in the dataset D
c
are used to train
the CNN that predicts p(y|x). On the other hand, the opti-
mal solutions of z
(1)
, · · · , z
(N)
in Eq. (8) are used to train
the CNN that predicts p(z|x).
After both CNN components are properly pretrained, we
can start to train the whole network with massive noisy la-
beled data. However, some practical issues need further
discussion. First, if we merely use noisy labels, we will
lose precious knowledge that we have gained before and the
model could be drifted. Therefore, we need to mix the data
with clean labels into our training set, which is depicted in
Figure 5 as the extra supervisions for the two CNNs. Then
each CNN receives two kinds of gradients, one is from the
clean labels and the other is from the noisy labels. We de-
note them by
c
and
n
, respectively. A potential prob-
lem is that |
c
| |
n
|, because clean data is much less
than noisy data. To deal with this problem, we bootstrap

Citations
More filters
Proceedings ArticleDOI

DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations

TL;DR: This work introduces DeepFashion1, a large-scale clothes dataset with comprehensive annotations, and proposes a new deep model, namely FashionNet, which learns clothing features by jointly predicting clothing attributes and landmarks.
Proceedings ArticleDOI

Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach

TL;DR: In this article, a theoretically grounded approach to train deep neural networks, including recurrent networks, subject to class-dependent label noise is presented, and two procedures for loss correction that are agnostic to both application domain and network architecture are proposed.
Proceedings Article

Generalized cross entropy loss for training deep neural networks with noisy labels

TL;DR: In this paper, a theoretically grounded set of noise-robust loss functions that can be seen as a generalization of mean absolute error (MAE) and categorical cross entropy (CCE) loss is proposed.
Proceedings Article

Learning to Reweight Examples for Robust Deep Learning

TL;DR: This article propose a meta-learning algorithm that learns to assign weights to training examples based on their gradient directions, which can be easily implemented on any type of deep network, does not require any additional hyperparameter tuning, and achieves impressive performance on class imbalance and corrupted label problems where only a small amount of clean validation data is available.
Proceedings Article

Robust Loss Functions under Label Noise for Deep Neural Networks

TL;DR: This paper provides some sufficient conditions on a loss function so that risk minimization under that loss function would be inherently tolerant to label noise for multiclass classification problems, and generalizes the existing results on noise-tolerant loss functions for binary classification.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Related Papers (5)
Frequently Asked Questions (11)
Q1. What contributions have the authors mentioned in the paper "Learning from massive noisy labeled data for image classification" ?

In this paper, the authors introduce a general framework to train CNNs with only a limited number of clean labels and millions of easily obtained noisy labels. To demonstrate the effectiveness of their approach, the authors collect a large-scale real-world clothing classification dataset with both noisy and clean labels. Experiments on this dataset indicate that their approach can better correct the noisy labels and improves the performance of trained CNNs. The authors model the relationships between images, class labels and label noises with a probabilistic graphical model and further integrate it into an end-to-end deep learning system. 

Semi-supervised learning: Apart from direct learning with label noise, some semi-supervised learning algorithms were developed to utilize weakly labeled or even unlabeled data. 

The authors assign an image a noisy label if the authors find its surrounding text contains only the keywords of that label, otherwise the authors discard the image to reduce ambiguity. 

One possible solution is to automatically collect a large amount of annotations from the Internet web images [10] (i.e. extracting tags from the surrounding texts or keywords from search engines) and directly use them as ground truth to train deep models. 

In their experiments, the authors find that the performance of the classifier drops significantly without upsampling, but it is not sensitive with the upsampling ratio as long as the number of clean and noisy samples are in the same order. 

In their case of clothing classification, the authors find that training a CNN from scratch with limited clean labels and massive noisy labels is better than finetuning it only on the clean labels. 

All the data are then used to train CNNs, while the major challenge is to identify and correct wrong labels during the training process. 

the authors append a label noise model layer at the end, which takes as input the CNNs’ prediction scores and the observed noisy label. 

the size of training datasets are |Dc| = 47, 570 and |Dη| = 106, while validation and test set have 14, 313 and 10, 526 images, respectively. 

The authors first randomly generate a confusion matrix Q between clean labels and noisy labels, and then corrupt the training labels according to it. 

To deal with this problem, the authors bootstrapNoise Free 41% Random 3% Confusing 56%p(z | x)5 Layers of Conv +Pool + Norm3 FC Layers of Size4096→4096→145 Layers of Conv +Pool + Norm3 FC Layers of Size4096→1024→3Label Noise Model LayerDown Coat Windbreaker 4% Jacket 1%……94%p(y | x)Noisy Label: WindbreakerDown Coat Windbreaker 11% Jacket 4%……75% p(y | y!,x)Noise Free 11% Random 4% Confusing 85%p(z | y!,x)Data with Clean LabelsFigure