scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Image compression with Stochastic Winner-Take-All Auto-Encoder

TL;DR: This paper addresses the problem of image compression using sparse representations with a variant of autoencoder called Stochastic Winner-Take-All Auto-Encoder (SWTA AE), which performs variable rate image compression for images of any size after a single training, which is fundamental for compression.
Abstract: This paper addresses the problem of image compression using sparse representations. We propose a variant of autoencoder called Stochastic Winner-Take-All Auto-Encoder (SWTA AE). “Winner-Take-All” means that image patches compete with one another when computing their sparse representation and “Stochastic” indicates that a stochastic hyperparameter rules this competition during training. Unlike auto-encoders, SWTA AE performs variable rate image compression for images of any size after a single training, which is fundamental for compression. For comparison, we also propose a variant of Orthogonal Matching Pursuit (OMP) called Winner-Take-All Orthogonal Matching Pursuit (WTA OMP). In terms of rate-distortion trade-off, SWTA AE outperforms auto-encoders but it is worse than WTA OMP. Besides, SWTA AE can compete with JPEG in terms of rate-distortion.

Summary (2 min read)

Introduction

  • Image compression, sparse representations, auto-encoders, Orthogonal Matching Pursuit.
  • Auto-encoders are powerful tools for reducing the dimensionality of data.
  • But all image patches have the same rate and therefore different distortions due to the texture complexity variety in image patches.
  • This work has been supported by the French Defense Procurement Agency (DGA).
  • Therefore, during training, the WTA parameter that controls the rate is stochastically driven.

1.1. Notation

  • Vectors are denoted by bold lower case letters and matrices by upper case ones.
  • The authors now present their Stochastic Winner-Take-All AutoEncoder (SWTA AE) whose architecture is shown in Figure 1.
  • The authors justify below two of the most critical choices for the SWTA AE architecture.

2.1. Strided convolution

  • A compression algorithm must process images of various sizes.
  • This imposes to train one architecture per image size.
  • Each layer i ∈ J1, 4K consists in convolving the layer input with the bank of filters W(i), adding the biases b(i) and applying a mapping g(i), producing the layer output.
  • For the borders of the layer input, zero-padding of width p(i) is used.
  • Indeed, if the encoder contains a maxpooling layer, the locations of maximum activations selected during pooling operations must be recorded and transmitted to the corresponding unpooling layer in the decoder [12, 13].

2.2. Semi-sparse bottleneck

  • The authors propose to apply a global sparse constraint that povides control over the coding cost of Z. gα only applies to the output of the convolution in the second layer involving the first 64 filters in W(2), producing the first 64 sparse feature maps in Z. Figure 1 displays these sparse feature maps in orange.
  • Varying α leads to various coding costs of Z. Note that [14] uses WTA, but their WTA rule is different and gα does not apply to specific dimensions of its input tensor as this constraint is not relevant for image compression.
  • The authors have noticed that, during the training in Section 4.2, SWTA AE learns to store in the last feature map a subsampled version of its input image.

2.3. Bitstream generation

  • The coefficients of the non-sparse feature map in Z are uniformly quantized over 8-bits and coded with a Huffman code.
  • The position along z is coded with a fixed-length code and, for each pair (x, y), the number of non-zero coefficients along z is coded with a Huffman code.
  • The difference is that SWTA AE computes the sparse representation of an image by alternating convolutions and mappings whereas OMP runs an iterative decomposition of the image patches over a dictionary.
  • For the sake of comparison, the authors build a variant of OMP called Winner-Take-All Orthogonal Matching Pursuit (WTA OMP).
  • The support of the sparse representation of each patch has therefore been changed.

4.1. Training data extraction

  • The RGB color space is transformed into YCbCr and the authors only keep the luminance channel.
  • For SWTA AE, the luminance images are resized to 321×321. σ ∈ R∗+ is the mean of the standard deviation over all luminance images.
  • The authors remove the DC component from each patch.

4.2. SWTA AE training

  • If α is fixed during training, all the filters and the biases of SWTA AE are learned for one rate.
  • This justifies the prefix “Stochastic” in SWTA AE.
  • The training objective is to minimize the mean squared error between these cropped images and their reconstruction plus l2-norm weights decay.
  • The authors implementation is based on Caffe [18].

4.3. Dictionary learning for WTA OMP

  • Η∑ i=j ‖Zj‖0 ≤ γ × n× η (4) (4) is solved by Algorithm 2 which alternates between sparse coding steps that involve WTA OMP and dictionary updates that use stochastic gradient descent.
  • Dictionary learning for WTA OMP, also known as Algorithm 2.
  • For SWTA AE, the same values for m and n are used for training D via Algorithm 2. 1K-SVD code: http://www.cs.technion.ac.il/ elad/software/ 5. IMAGE COMPRESSION EXPERIMENT.
  • After training in Section 4, the authors compare the rate-distortion curves of OMP, WTA OMP, SWTA AE, JPEG and JPEG2000 on test luminance images.

5.1. Image CODEC for SWTA AE

  • Each input test luminance image is pre-processed similarly to the training in Section 4.1.
  • The mean learned image M is interpolated to match the size of the input image.
  • Then, the input image is subtracted by this interpolated mean image and divided by the learned σ.

5.2. Image CODEC for OMP and WTA OMP

  • A luminance image is split into 8×8 non-overlapping patches.
  • The DC component is removed from each patch.
  • The DC components are uniformly quantized over 8-bits and coded with a fixed-length code.
  • OMP (or WTA OMP) finds the coefficients of the sparse decompositions of the image patches over D ′ (or D).
  • The non-zero coefficients are uniformly quantized over 8-bits and coded with a Huffman code while their position is coded with a fixed-length code.

5.3. Comparison of rate-distortion curves

  • In the literature, there is no reference rate-distortion curve for auto-encoders.
  • Furthermore, the authors compare SWTA AE with its 2JPEG and JPEG2000 code: http://www.imagemagick.org/script/index.php non-sparse Auto-Encoder counterpart (AE).
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, “ImageNet classification with deep convolutional neural networks,” in NIPS, 2012. [9].

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

HAL Id: hal-01493137
https://hal.archives-ouvertes.fr/hal-01493137
Submitted on 21 Mar 2017
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Image Compression with Stochastic Winner-Take-All
Auto-Encoder
Thierry Dumas, Aline Roumy, Christine Guillemot
To cite this version:
Thierry Dumas, Aline Roumy, Christine Guillemot. Image Compression with Stochastic Winner-Take-
All Auto-Encoder. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP 2017), Mar 2017, New Orleans, United States. �hal-01493137�

IMAGE COMPRESSION WITH STOCHASTIC WINNER-TAKE-ALL AUTO-ENCODER
Thierry Dumas, Aline Roumy, Christine Guillemot
INRIA Rennes Bretagne-Atlantique
thierry.dumas@inria.fr, aline.roumy@inria.fr, christine.guillemot@inria.fr
ABSTRACT
This paper addresses the problem of image compression us-
ing sparse representations. We propose a variant of auto-
encoder called Stochastic Winner-Take-All Auto-Encoder
(SWTA AE). “Winner-Take-All” means that image patches
compete with one another when computing their sparse rep-
resentation and “Stochastic” indicates that a stochastic hy-
perparameter rules this competition during training. Unlike
auto-encoders, SWTA AE performs variable rate image com-
pression for images of any size after a single training, which
is fundamental for compression. For comparison, we also
propose a variant of Orthogonal Matching Pursuit (OMP)
called Winner-Take-All Orthogonal Matching Pursuit (WTA
OMP). In terms of rate-distortion trade-off, SWTA AE out-
performs auto-encoders but it is worse than WTA OMP. Be-
sides, SWTA AE can compete with JPEG in terms of rate-
distortion.
Index Terms Image compression, sparse representa-
tions, auto-encoders, Orthogonal Matching Pursuit.
1. INTRODUCTION
Auto-encoders are powerful tools for reducing the dimension-
ality of data. Deep fully-connected auto-encoders [1] are tra-
ditionally used for this task. However, two issues have so far
prevented them from becoming efficient image compression
algorithms: they can only be trained for one image size and
one compression rate [2, 3].
[4] attempts to solve both issues. The authors train an
auto-encoder on image patches so that images of various sizes
can be compressed. Their auto-encoder is a recurrent [5]
residual auto-encoder that performs variable rate image com-
pression after a single training. But all image patches have the
same rate and therefore different distortions due to the texture
complexity variety in image patches. In addition, recurrence,
which is equivalent to scalability in image compression, is not
optimal in terms of rate-distortion trade-off [6, 7].
Instead, we propose to perform learning on whole images
under a global rate-distortion constraint. This is done through
This work has been supported by the French Defense Procurement
Agency (DGA).
Winner-Take-All (WTA), which can be viewed as a competi-
tion between image patches when computing their represen-
tation. Furthermore, auto-encoders architecture must adapt
to different rates. Therefore, during training, the WTA pa-
rameter that controls the rate is stochastically driven. These
contributions give rise to Stochastic Winner-Take-All Auto-
Encoder (SWTA AE).
1.1. Notation
Vectors are denoted by bold lower case letters and matrices
by upper case ones. X
j
denotes the j
th
column of a matrix
X. kXk
F
is the Frobenius norm of X. kXk
0
counts the
number of non-zero elements in X. The support of a vector x
is supp (x) = {i | x
i
6= 0}.
2. STOCHASTIC WINNER-TAKE-ALL
AUTO-ENCODER (SWTA AE)
We now present our Stochastic Winner-Take-All Auto-
Encoder (SWTA AE) whose architecture is shown in Figure 1.
SWTA AE is a type of auto-encoder. An auto-encoder is a
neural network that takes an input and provides a reconstruc-
tion of this input. We justify below two of the most critical
choices for the SWTA AE architecture.
2.1. Strided convolution
A compression algorithm must process images of various
sizes. However, the most efficient neural networks [8, 9] re-
quire that all images have the same size. Indeed, they include
both convolutional layers and fully-connected layers, and the
number of parameters of the latters directly depends on the
image size. This imposes to train one architecture per im-
age size. That is why our proposed SWTA AE only contains
convolutional layers. Its encoder has two convolutional lay-
ers and its decoder has two deconvolutional layers [10]. Each
layer i J1, 4K consists in convolving the layer input with
the bank of filters W
(i)
, adding the biases b
(i)
and applying
a mapping g
(i)
, producing the layer output. For the borders of
the layer input, zero-padding of width p
(i)
is used.
Max-pooling is a core component of neural networks [11]
that downsamples its input representation by appling a max

Fig. 1: SWTA AE architecture.
filter to non-overlapping sub-regions. But max-pooling in-
creases the rate. Indeed, if the encoder contains a max-
pooling layer, the locations of maximum activations selected
during pooling operations must be recorded and transmitted
to the corresponding unpooling layer in the decoder [12, 13].
Instead, for i J1, 2K, we downsample using a fixed stride
s
(i)
> 1 for convolution, which does not need any signaling.
2.2. Semi-sparse bottleneck
The bottleneck is the stack of feature maps denoted Z
R
h×w×65
in Figure 1. Z is the representation of the input
image that is processed in Section 2.3 to give the bitstream.
We propose to apply a global sparse constraint that po-
vides control over the coding cost of Z. This is called
Winner-Take-All (WTA). Let us define WTA via a mapping
g
α
: R
h×w×64
R
h×w×64
, where α ]0, 1[ is the WTA
parameter. g
α
keeps the α × h × w × 64 most representa-
tive coefficients in its input tensor, i.e. those whose absolute
values are the largest, and sets the rest to 0. g
α
only applies
to the output of the convolution in the second layer involving
the first 64 filters in W
(2)
, producing the first 64 sparse fea-
ture maps in Z. Figure 1 displays these sparse feature maps
in orange. Varying α leads to various coding costs of Z. Note
that [14] uses WTA, but our WTA rule is different and g
α
does not apply to specific dimensions of its input tensor as
this constraint is not relevant for image compression.
A patch of the input image might be represented by a por-
tion of the first 64 sparse feature maps in Z that only con-
tains zeros. We want to ensure that each image patch has a
minimum code in Z to guarantee a sufficient quality of recon-
struction per patch. That is why the last feature map in Z is
not sparse. Figure 1 displays it in red. We have noticed that,
during the training in Section 4.2, SWTA AE learns to store in
the last feature map a subsampled version of its input image.
2.3. Bitstream generation
The coefficients of the non-sparse feature map in Z are uni-
formly quantized over 8-bits and coded with a Huffman code.
The non-zero coefficients of the 64 sparse feature maps in Z
are uniformly quantized over 8-bits and coded with a Huff-
man code while their position is coded as explained here-
after. Figure 1 defines a coordinate system (x, y, z) for Z.
The non-zero coefficients in Z are scanned along (x, y, z)
where z changes the fastest. The position along z is coded
with a fixed-length code and, for each pair (x, y), the num-
ber of non-zero coefficients along z is coded with a Huffman
code. This unequivocally characterizes the position of each
non-zero coefficient in Z. We have observed that this pro-
cessing is effective in encoding the position of the non-zero
coefficients.
3. WINNER-TAKE-ALL ORTHOGONAL
MATCHING PURSUIT (WTA OMP)
SWTA AE is similar to Orthogonal Matching Pursuit (OMP)
[15], a common algorithm for image compression using
sparse representations [16]. The difference is that SWTA AE
computes the sparse representation of an image by alternat-
ing convolutions and mappings whereas OMP runs an iter-
ative decomposition of the image patches over a dictionary.
More precisely, let x R
m
be an image patch. Given x and
a dictionary D R
m×n
, OMP finds a vector of coefficients
y R
n
with k < m non-zero coefficients so that Dy equals
to x approximatively.
For the sake of comparison, we build a variant of OMP
called Winner-Take-All Orthogonal Matching Pursuit (WTA
OMP). More precisely, let X R
m×p
be a matrix whose
columns are formed by p image patches of dimension m and
Y R
n×p
be a matrix whose columns are formed by p vec-

tors of coefficients of dimension n. WTA OMP first decom-
poses each image patch over D, see (1). Then, it keeps the
γ × n × p coefficients with largest absolute value for the n-
length sparse representation of the p patches and sets the rest
to 0, see (2). The support of the sparse representation of each
patch has therefore been changed. Hence the need for a final
least-square minimization, see (3).
Algorithm 1 : WTA OMP
Inputs: X R
m×p
, D R
m×n
, k < m and γ ]0, 1[.
For each j J1, pK, Y
j
= OMP (X
j
, D, k) (1)
I = f
γ
(Y) (2)
For each j J1, pK, Z
j
= min
zR
n
kX
j
Dzk
2
2
st.
supp (z) = supp (I
j
)
(3)
Output: Z R
n×p
.
4. TRAINING
Before moving on to the image compression experiment in
Section 5, SWTA AE needs training. Similarly, a dictionary
D R
m×n
must be learned for WTA OMP.
4.1. Training data extraction
We extract 1.0 × 10
5
RGB images from the ILSVRC2012
ImageNet dataset [17]. The RGB color space is transformed
into YCbCr and we only keep the luminance channel.
For SWTA AE, the luminance images are resized to
321 ×321. M R
321×321
denotes the mean of all luminance
images. σ R
+
is the mean of the standard deviation over
all luminance images. Each luminance image is subtracted by
M and divided by σ. These images are concatenated into a
training set R
321×321×
(
1.0×10
5
)
.
For WTA OMP, η = 1.2 × 10
6
image patches of size
m ×
m are randomly sampled from the luminance im-
ages. We remove the DC component from each patch. These
patches are concatenated into a training set Γ R
m×η
.
4.2. SWTA AE training
As explained in Section 2.2, α tunes the coding cost of Z. If α
is fixed during training, all the filters and the biases of SWTA
AE are learned for one rate. That is why we turn α into a
stochastic hyperparameter during training. This justifies the
prefix “Stochastic” in SWTA AE. Since there is no reason to
favor some rates during training, we sample α according to
the uniform distribution U[µ , µ + ], where µ > 0 and
µ + < 1. We select µ = 1.8 × 10
1
and = 1.7 × 10
1
to make the support of α large. At each training epoch, α is
drawn for each training image of .
As shown in Section 2.1, SWTA AE can process images
of various sizes. During training, we feed SWTA AE with
random crops of size 49×49 of the training images of . This
accelerates training considerably. The training objective is to
minimize the mean squared error between these cropped im-
ages and their reconstruction plus l
2
-norm weights decay. We
use stochastic gradient descent. The gradient descent learn-
ing rate is fixed to 2.0 × 10
5
, the momentum is 0.9 and the
size of mini-batches is 5. The weights decay coefficient is
5.0 × 10
4
. Our implementation is based on Caffe [18]. It
adds to Caffe the tools introduced in Sections 2.2 and 2.3.
4.3. Dictionary learning for WTA OMP
Given Γ, k < m and γ ]0, 1[, the dictionary learning prob-
lem is formulated as (4).
min
D,Z
1
,...,Z
η
1
η
η
X
j=1
kΓ
j
DZ
j
k
2
2
st. j J1, ηK, kZ
j
k
0
k
st.
η
X
i=j
kZ
j
k
0
γ × n × η
(4)
(4) is solved by Algorithm 2 which alternates between sparse
coding steps that involve WTA OMP and dictionary updates
that use stochastic gradient descent. Given Γ and p N
+
,
let φ be a function that randomly partitions Γ into η
p
=
η / p mini-batches
X
(1)
, ..., X
(η
p
)
, where, for i J1, η
p
K,
X
(i)
R
m×p
. Mini-batches make learning very fast [19].
Algorithm 2 : dictionary learning for WTA OMP.
Inputs: Γ R
m×η
, k < m, γ ]0, 1[, p N
+
and ε R
+
.
D R
m×n
is randomly initialized.
j [|1, n|], D
j
D
j
/ kD
j
k
2
For several epochs do:
h
X
(1)
, ..., X
(η
p
)
i
= φ (Γ, p)
i [|1, η
p
|], Z
(i)
= WTA OMP
X
(i)
, D, k, γ
D D ε
X
(i)
DZ
(i)
2
F
D
j [|1, n|], D
j
D
j
/ kD
j
k
2
Output: D R
m×n
.
For OMP, given Γ, a dictionary D
0
R
m×n
is learned
using K-SVD [16]
1
, and the parameters m and n are opti-
mized with an exhaustive search. This leads to m = 64 and
n = 1024. For SWTA AE, the same values for m and n
are used for training D via Algorithm 2. Moreover, k = 15,
γ = 4.5 × 10
3
, p = 10 and ε = 2.0 × 10
2
.
1
K-SVD code: http://www.cs.technion.ac.il/ elad/software/

Fig. 2: Evolution of PNSR with the rate.
(a) LENA luminance 512 × 512. (b) BARBARA luminance 480 × 384.
5. IMAGE COMPRESSION EXPERIMENT
After training in Section 4, we compare the rate-distortion
curves of OMP, WTA OMP, SWTA AE, JPEG and JPEG2000
on test luminance images.
5.1. Image CODEC for SWTA AE
Each input test luminance image is pre-processed similarly to
the training in Section 4.1. The mean learned image M is
interpolated to match the size of the input image. Then, the
input image is subtracted by this interpolated mean image and
divided by the learned σ. The encoder of SWTA AE computes
Z. The bitstream is obtained by processing Z as detailed in
Section 2.3.
5.2. Image CODEC for OMP and WTA OMP
A luminance image is split into 8×8 non-overlapping patches.
The DC component is removed from each patch. The DC
components are uniformly quantized over 8-bits and coded
with a fixed-length code. OMP (or WTA OMP) finds the co-
efficients of the sparse decompositions of the image patches
over D
0
(or D). The non-zero coefficients are uniformly
quantized over 8-bits and coded with a Huffman code while
their position is coded with a fixed-length code.
Then, for WTA OMP only, the number of non-zero coef-
ficients of the sparse decomposition of each patch over D is
coded with a Huffman code.
5.3. Comparison of rate-distortion curves
In the literature, there is no reference rate-distortion curve
for auto-encoders. We compare SWTA AE with JPEG and
JPEG2000
2
even though the image CODEC of SWTA AE is
less optimized. Furthermore, we compare SWTA AE with its
2
JPEG and JPEG2000 code: http://www.imagemagick.org/script/index.php
non-sparse Auto-Encoder counterpart (AE). AE has the same
architecture as SWTA AE but its Z only contains non-sparse
feature maps. Note that, to draw a new point in the AE rate-
distortion curve, AE must be first re-trained with a different
number of feature maps in Z.
Figure 2 shows the rate-distortion curves of OMP, WTA
OMP, AE, SWTA AE, JPEG and JPEG2000 for two of the
most common images: LENA and BARBARA. In terms
of rate-distortion trade-off, SWTA AE outperforms AE and
WTA OMP is better than OMP. This highlights the value
of WTA for image compression. When we compare SWTA
AE with WTA OMP, we see that iterative decomposition
is more efficient for image compression using sparse repre-
sentations. Moreover, SWTA AE can compete with JPEG.
We also ran this image compression experiment on several
crops of LENA and BARBARA and observed that the rel-
ative position of the six rate-distortion curves was compa-
rable to the relative positioning in figure 2. The size of
the test image does not affect the performance of SWTA
AE. More simulation results and a complexity analysis for
OMP, WTA OMP and SWTA AE can be found on the
web page https://www.irisa.fr/temics/demos/
NeuralNets/AutoEncoders/swtaAE.htm.
6. CONCLUSIONS AND FUTURE WORK
We have shown that, SWTA AE is more adaptated to image
compression than auto-encoders as it performs variable rate
image compression for any size of image after a single train-
ing and provides better rate-distortion trade-offs.
So far, our work has focused on the layer of auto-encoders
which is dedicated to coding. Yet, many avenues of research
are still to be explored to improve auto-encoders for image
compression. For instance, [20] proves that removing a max-
pooling layer and increasing the stride of the previous con-
volution, as we do, harms neural networks. This has to be
addressed.

Citations
More filters
Proceedings ArticleDOI
14 May 2017
TL;DR: In this study, asymmetric autoencoders are explored — unequal number of encoders and decoders — and found to be more accurate compared to traditional symmetrically stacked autoen coders for classification accuracy and also yield slightly better results on compression problems.
Abstract: Traditional stacked autoencoders have an equal number of encoders and decoders. However, while fine-tuned as a deep neural network the decoder portion is detached and never used. This begs the question: ‘do we need equal number of decoders and encoders’? In this study we explore asymmetric autoencoders — unequal number of encoders and decoders. We specifically address two tasks — 1. Classification capacity as deep neural network and 2. Compressibility of stacked autoencoder. For both the problems, our asymmetric autoencoders have several encoders but a single decoders. We find that such autoencoders are more accurate compared to traditional symmetrically stacked autoencoders for classification accuracy and also yield slightly better results on compression problems.

25 citations


Cites background or methods from "Image compression with Stochastic W..."

  • ...In [7], convolutional autoencoder has been used for image compression....

    [...]

  • ...Very recent studies [6, 7], have shown how autoencoders can be used for compression....

    [...]

  • ...Let us reiterate that autoencoder based compression is a new topic, both [6, 7] are yet to be published....

    [...]

Journal ArticleDOI
TL;DR: The results show that the proposed model can better capture the interactive driving behaviour and outperforms the state-of-the-art methods in root-weighted square error of displacement and velocity.
Abstract: Autonomous vehicles need to have the ability to predict the motion of surrounding vehicles, which will help to avoid potential accidents and make the best decision to ensure safety and comfort. The interactions among vehicles and those between them and the uncertainty of driving intention make trajectory prediction a challenging task. This study presents a long short-term memory (LSTM) model for the task of trajectory prediction to account for both the mutual information and the multi-modal intention. The model consists of a data fusion encoder and a multi-modal decoder. The data fusion encoder summarises the mutual information by multi-LSTM with shared parameters and the multi-modal decoder generates trajectories based on driving intention. In addition, mixture density network is added to output a probabilistic prediction which improves the reliability of prediction results. NGSIM data set is used for training and testing. The results show that the proposed model can better capture the interactive driving behaviour and outperforms the state-of-the-art methods in root-weighted square error of displacement and velocity.

17 citations

Journal ArticleDOI
TL;DR: In this paper , the authors classify ML-based image compression frameworks into subgroups based on their architectures, including variational auto-encoders (VAEs), CNNs, recurrent neural networks (RNNs), long short-term memory (LSTMs), gated recurrent units (GRUs), generative adversarial networks (GANs), transformers, principal component analysis (PCA), and fuzzy means clustering.

9 citations

Book ChapterDOI
13 Dec 2018
TL;DR: The neural image encoding approach has various low-level image processing applications ranging from image encoding, image compression and image denoising to image resampling and image completion and its superiority over standard baselines is demonstrated.
Abstract: We propose a deep neural network for mapping the 2D pixel coordinates in an image to the corresponding RGB color values. The neural network is termed CocoNet, i.e. coordinates-to-color network. During the training process, the neural network learns to encode the input image within its layers, i.e. it learns a continuous function that approximates the discrete RGB values sampled over the discrete 2D pixel locations. At test time, given a 2D pixel coordinate, the neural network will output the RGB values of the corresponding pixel. By considering every 2D pixel location, the network can actually reconstruct the entire learned image. We note that we have to train an individual neural network for each input image, i.e. one network encodes a single image. Our neural image encoding approach has various low-level image processing applications ranging from image denoising to image resampling and image completion. Our code is available at https://github.com/paubric/python-fuse-coconet.

9 citations


Cites background from "Image compression with Stochastic W..."

  • ...The neural models that map pixels to pixels are usually applied on tasks such as image compression [1,2,4,6, 21,24,34], image denoising and restoration [20,38,41], image super-resolution [5, 14, 18, 20, 28, 29, 38, 39], image completion [12, 40] and image generation [11, 35]....

    [...]

  • ...[6] address image compression using sparse representations, by proposing a stochastic winner-takes-all autoencoder in which image patches compete with one another when their sparse representation is computed....

    [...]

Posted Content
TL;DR: CocoNet as discussed by the authors is a deep neural network approach for mapping the 2D pixel coordinates in an image to the corresponding Red-Green-Blue (RGB) color values, which is termed as coordinates-to-color network.
Abstract: In this paper, we propose a deep neural network approach for mapping the 2D pixel coordinates in an image to the corresponding Red-Green-Blue (RGB) color values. The neural network is termed CocoNet, i.e. coordinates-to-color network. During the training process, the neural network learns to encode the input image within its layers. More specifically, the network learns a continuous function that approximates the discrete RGB values sampled over the discrete 2D pixel locations. At test time, given a 2D pixel coordinate, the neural network will output the approximate RGB values of the corresponding pixel. By considering every 2D pixel location, the network can actually reconstruct the entire learned image. It is important to note that we have to train an individual neural network for each input image, i.e. one network encodes a single image only. To the best of our knowledge, we are the first to propose a neural approach for encoding images individually, by learning a mapping from the 2D pixel coordinate space to the RGB color space. Our neural image encoding approach has various low-level image processing applications ranging from image encoding, image compression and image denoising to image resampling and image completion. We conduct experiments that include both quantitative and qualitative results, demonstrating the utility of our approach and its superiority over standard baselines, e.g. bilateral filtering or bicubic interpolation. Our code is available at this https URL.

8 citations

References
More filters
Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings ArticleDOI
Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1, Kai Li1, Li Fei-Fei1 
20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

49,639 citations


"Image compression with Stochastic W..." refers methods in this paper

  • ...0 × 10(5) RGB images from the ILSVRC2012 ImageNet dataset [17]....

    [...]

Journal ArticleDOI
28 Jul 2006-Science
TL;DR: In this article, an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data is described.
Abstract: High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such "autoencoder" networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

16,717 citations

Journal ArticleDOI
TL;DR: Quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures, including FCN and DeconvNet.
Abstract: We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1] . The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3] , DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/ .

13,468 citations


"Image compression with Stochastic W..." refers background in this paper

  • ...Indeed, if the encoder contains a maxpooling layer, the locations of maximum activations selected during pooling operations must be recorded and transmitted to the corresponding unpooling layer in the decoder [12, 13]....

    [...]

Book ChapterDOI
06 Sep 2014
TL;DR: A novel visualization technique is introduced that gives insight into the function of intermediate feature layers and the operation of the classifier in large Convolutional Network models, used in a diagnostic role to find model architectures that outperform Krizhevsky et al on the ImageNet classification benchmark.
Abstract: Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark Krizhevsky et al. [18]. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we explore both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al on the ImageNet classification benchmark. We also perform an ablation study to discover the performance contribution from different model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

12,783 citations


"Image compression with Stochastic W..." refers background in this paper

  • ...Indeed, if the encoder contains a maxpooling layer, the locations of maximum activations selected during pooling operations must be recorded and transmitted to the corresponding unpooling layer in the decoder [12, 13]....

    [...]

Frequently Asked Questions (15)
Q1. What are the contributions in "Image compression with stochastic winner-take-all auto-encoder" ?

This paper addresses the problem of image compression using sparse representations. The authors propose a variant of autoencoder called Stochastic Winner-Take-All Auto-Encoder ( SWTA AE ). For comparison, the authors also propose a variant of Orthogonal Matching Pursuit ( OMP ) called Winner-Take-All Orthogonal Matching Pursuit ( WTA OMP ). 

Given Γ and p ∈ N∗+, let φ be a function that randomly partitions Γ into ηp = η / p mini-batches { X(1), ...,X(ηp) } , where, for i ∈ J1, ηpK, X(i) ∈ Rm×p. 

The position along z is coded with a fixed-length code and, for each pair (x, y), the number of non-zero coefficients along z is coded with a Huffman code. 

The training objective is to minimize the mean squared error between these cropped images and their reconstruction plus l2-norm weights decay. 

Max-pooling is a core component of neural networks [11] that downsamples its input representation by appling a maxfilter to non-overlapping sub-regions. 

Each layer i ∈ J1, 4K consists in convolving the layer input with the bank of filters W(i), adding the biases b(i) and applying a mapping g(i), producing the layer output. 

η∑ i=j ‖Zj‖0 ≤ γ × n× η(4)(4) is solved by Algorithm 2 which alternates between sparse coding steps that involve WTA OMP and dictionary updates that use stochastic gradient descent. 

The non-zero coefficients are uniformly quantized over 8-bits and coded with a Huffman code while their position is coded with a fixed-length code. 

for WTA OMP only, the number of non-zero coefficients of the sparse decomposition of each patch over D is coded with a Huffman code. 

For each j ∈ J1, pK,Yj = OMP (Xj ,D, k) (1) The author= fγ (Y) (2)For each j ∈ J1, pK,Zj = min z∈Rn ‖Xj −Dz‖22 st.supp (z) = supp (Ij) (3)Output: Z ∈ Rn×p.4. 

For instance, [20] proves that removing a maxpooling layer and increasing the stride of the previous convolution, as the authors do, harms neural networks. 

CONCLUSIONS AND FUTURE WORKThe authors have shown that, SWTA AE is more adaptated to image compression than auto-encoders as it performs variable rate image compression for any size of image after a single training and provides better rate-distortion trade-offs. 

Gary J. Sullivan, Jim M. Boyce, Ying Chen, Jens-Rainer Ohm, C. Andrew Segal, and Anthony Vetro, “Standardized extensions of high efficiency video coding (HEVC),” IEEE Journal of Selected Topics in Signal Processing, vol. 7 (6), pp. 1001–1016, December 2013.[8] 

it keeps the γ × n × p coefficients with largest absolute value for the nlength sparse representation of the p patches and sets the rest to 0, see (2). 

Given Γ, k < m and γ ∈ ]0, 1[, the dictionary learning problem is formulated as (4).min D,Z1,...,Zη1η η∑ j=1 ‖Γj −DZj‖22st. ∀j ∈ J1, ηK, ‖Zj‖0 ≤ kst.