What are the contributions in "Image compression with stochastic winner-take-all auto-encoder" ?

This paper addresses the problem of image compression using sparse representations. The authors propose a variant of autoencoder called Stochastic Winner-Take-All Auto-Encoder ( SWTA AE ). For comparison, the authors also propose a variant of Orthogonal Matching Pursuit ( OMP ) called Winner-Take-All Orthogonal Matching Pursuit ( WTA OMP ).

What is the coding objective of the algorithm?

Given Γ and p ∈ N∗+, let φ be a function that randomly partitions Γ into ηp = η / p mini-batches { X(1), ...,X(ηp) } , where, for i ∈ J1, ηpK, X(i) ∈ Rm×p.

What is the structure of the layer input?

Each layer i ∈ J1, 4K consists in convolving the layer input with the bank of filters W(i), adding the biases b(i) and applying a mapping g(i), producing the layer output.

What is the code for the WTA OMP?

for WTA OMP only, the number of non-zero coefficients of the sparse decomposition of each patch over D is coded with a Huffman code.

What is the simplest way to decompose a matrix?

For each j ∈ J1, pK,Yj = OMP (Xj ,D, k) (1) The author= fγ (Y) (2)For each j ∈ J1, pK,Zj = min z∈Rn ‖Xj −Dz‖22 st.supp (z) = supp (Ij) (3)Output: Z ∈ Rn×p.4.

Who is the author of this article?

Gary J. Sullivan, Jim M. Boyce, Ying Chen, Jens-Rainer Ohm, C. Andrew Segal, and Anthony Vetro, “Standardized extensions of high efficiency video coding (HEVC),” IEEE Journal of Selected Topics in Signal Processing, vol. 7 (6), pp. 1001–1016, December 2013.[8]

What is the coding objective of the problem?

Given Γ, k < m and γ ∈ ]0, 1[, the dictionary learning problem is formulated as (4).min D,Z1,...,Zη1η η∑ j=1 ‖Γj −DZj‖22st. ∀j ∈ J1, ηK, ‖Zj‖0 ≤ kst.

(Open Access) Image compression with Stochastic Winner-Take-All Auto-Encoder (2017) | Thierry Dumas

Q: What is the code for the last feature map in Z?

The position along z is coded with a fixed-length code and, for each pair (x, y), the number of non-zero coefficients along z is coded with a Huffman code.

Q: What is the objective of the training?

The training objective is to minimize the mean squared error between these cropped images and their reconstruction plus l2-norm weights decay.

Q: What is the definition of a coding constraint?

Max-pooling is a core component of neural networks [11] that downsamples its input representation by appling a maxfilter to non-overlapping sub-regions.

Q: What is the solution to the coding problem?

η∑ i=j ‖Zj‖0 ≤ γ × n× η(4)(4) is solved by Algorithm 2 which alternates between sparse coding steps that involve WTA OMP and dictionary updates that use stochastic gradient descent.

Q: What is the code for the non-zero coefficients?

The non-zero coefficients are uniformly quantized over 8-bits and coded with a Huffman code while their position is coded with a fixed-length code.

HAL Id: hal-01493137

https://hal.archives-ouvertes.fr/hal-01493137

Submitted on 21 Mar 2017

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Image Compression with Stochastic Winner-Take-All

Auto-Encoder

Thierry Dumas, Aline Roumy, Christine Guillemot

To cite this version:

Thierry Dumas, Aline Roumy, Christine Guillemot. Image Compression with Stochastic Winner-Take-

All Auto-Encoder. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing

(ICASSP 2017), Mar 2017, New Orleans, United States. �hal-01493137�

IMAGE COMPRESSION WITH STOCHASTIC WINNER-TAKE-ALL AUTO-ENCODER

Thierry Dumas, Aline Roumy, Christine Guillemot

INRIA Rennes Bretagne-Atlantique

thierry.dumas@inria.fr, aline.roumy@inria.fr, christine.guillemot@inria.fr

ABSTRACT

This paper addresses the problem of image compression us-

ing sparse representations. We propose a variant of auto-

encoder called Stochastic Winner-Take-All Auto-Encoder

(SWTA AE). “Winner-Take-All” means that image patches

compete with one another when computing their sparse rep-

resentation and “Stochastic” indicates that a stochastic hy-

perparameter rules this competition during training. Unlike

auto-encoders, SWTA AE performs variable rate image com-

pression for images of any size after a single training, which

is fundamental for compression. For comparison, we also

propose a variant of Orthogonal Matching Pursuit (OMP)

called Winner-Take-All Orthogonal Matching Pursuit (WTA

OMP). In terms of rate-distortion trade-off, SWTA AE out-

performs auto-encoders but it is worse than WTA OMP. Be-

sides, SWTA AE can compete with JPEG in terms of rate-

distortion.

Index Terms— Image compression, sparse representa-

tions, auto-encoders, Orthogonal Matching Pursuit.

1. INTRODUCTION

Auto-encoders are powerful tools for reducing the dimension-

ality of data. Deep fully-connected auto-encoders [1] are tra-

ditionally used for this task. However, two issues have so far

prevented them from becoming efﬁcient image compression

algorithms: they can only be trained for one image size and

one compression rate [2, 3].

[4] attempts to solve both issues. The authors train an

auto-encoder on image patches so that images of various sizes

can be compressed. Their auto-encoder is a recurrent [5]

residual auto-encoder that performs variable rate image com-

pression after a single training. But all image patches have the

same rate and therefore different distortions due to the texture

complexity variety in image patches. In addition, recurrence,

which is equivalent to scalability in image compression, is not

optimal in terms of rate-distortion trade-off [6, 7].

Instead, we propose to perform learning on whole images

under a global rate-distortion constraint. This is done through

This work has been supported by the French Defense Procurement

Agency (DGA).

Winner-Take-All (WTA), which can be viewed as a competi-

tion between image patches when computing their represen-

tation. Furthermore, auto-encoders architecture must adapt

to different rates. Therefore, during training, the WTA pa-

rameter that controls the rate is stochastically driven. These

contributions give rise to Stochastic Winner-Take-All Auto-

Encoder (SWTA AE).

1.1. Notation

Vectors are denoted by bold lower case letters and matrices

by upper case ones. X

denotes the j

column of a matrix

X. kXk

is the Frobenius norm of X. kXk

counts the

number of non-zero elements in X. The support of a vector x

is supp (x) = {i | x

6= 0}.

2. STOCHASTIC WINNER-TAKE-ALL

AUTO-ENCODER (SWTA AE)

We now present our Stochastic Winner-Take-All Auto-

Encoder (SWTA AE) whose architecture is shown in Figure 1.

SWTA AE is a type of auto-encoder. An auto-encoder is a

neural network that takes an input and provides a reconstruc-

tion of this input. We justify below two of the most critical

choices for the SWTA AE architecture.

2.1. Strided convolution

A compression algorithm must process images of various

sizes. However, the most efﬁcient neural networks [8, 9] re-

quire that all images have the same size. Indeed, they include

both convolutional layers and fully-connected layers, and the

number of parameters of the latters directly depends on the

image size. This imposes to train one architecture per im-

age size. That is why our proposed SWTA AE only contains

convolutional layers. Its encoder has two convolutional lay-

ers and its decoder has two deconvolutional layers [10]. Each

layer i ∈ J1, 4K consists in convolving the layer input with

the bank of ﬁlters W

(i)

, adding the biases b

(i)

and applying

a mapping g

(i)

, producing the layer output. For the borders of

the layer input, zero-padding of width p

(i)

is used.

Max-pooling is a core component of neural networks [11]

that downsamples its input representation by appling a max

Fig. 1: SWTA AE architecture.

ﬁlter to non-overlapping sub-regions. But max-pooling in-

creases the rate. Indeed, if the encoder contains a max-

pooling layer, the locations of maximum activations selected

during pooling operations must be recorded and transmitted

to the corresponding unpooling layer in the decoder [12, 13].

Instead, for i ∈ J1, 2K, we downsample using a ﬁxed stride

(i)

> 1 for convolution, which does not need any signaling.

2.2. Semi-sparse bottleneck

The bottleneck is the stack of feature maps denoted Z ∈

h×w×65

in Figure 1. Z is the representation of the input

image that is processed in Section 2.3 to give the bitstream.

We propose to apply a global sparse constraint that po-

vides control over the coding cost of Z. This is called

Winner-Take-All (WTA). Let us deﬁne WTA via a mapping

: R

h×w×64

→ R

h×w×64

, where α ∈]0, 1[ is the WTA

parameter. g

keeps the α × h × w × 64 most representa-

tive coefﬁcients in its input tensor, i.e. those whose absolute

values are the largest, and sets the rest to 0. g

only applies

to the output of the convolution in the second layer involving

the ﬁrst 64 ﬁlters in W

(2)

, producing the ﬁrst 64 sparse fea-

ture maps in Z. Figure 1 displays these sparse feature maps

in orange. Varying α leads to various coding costs of Z. Note

that [14] uses WTA, but our WTA rule is different and g

does not apply to speciﬁc dimensions of its input tensor as

this constraint is not relevant for image compression.

A patch of the input image might be represented by a por-

tion of the ﬁrst 64 sparse feature maps in Z that only con-

tains zeros. We want to ensure that each image patch has a

minimum code in Z to guarantee a sufﬁcient quality of recon-

struction per patch. That is why the last feature map in Z is

not sparse. Figure 1 displays it in red. We have noticed that,

during the training in Section 4.2, SWTA AE learns to store in

the last feature map a subsampled version of its input image.

2.3. Bitstream generation

The coefﬁcients of the non-sparse feature map in Z are uni-

formly quantized over 8-bits and coded with a Huffman code.

The non-zero coefﬁcients of the 64 sparse feature maps in Z

are uniformly quantized over 8-bits and coded with a Huff-

man code while their position is coded as explained here-

after. Figure 1 deﬁnes a coordinate system (x, y, z) for Z.

The non-zero coefﬁcients in Z are scanned along (x, y, z)

where z changes the fastest. The position along z is coded

with a ﬁxed-length code and, for each pair (x, y), the num-

ber of non-zero coefﬁcients along z is coded with a Huffman

code. This unequivocally characterizes the position of each

non-zero coefﬁcient in Z. We have observed that this pro-

cessing is effective in encoding the position of the non-zero

coefﬁcients.

3. WINNER-TAKE-ALL ORTHOGONAL

MATCHING PURSUIT (WTA OMP)

SWTA AE is similar to Orthogonal Matching Pursuit (OMP)

[15], a common algorithm for image compression using

sparse representations [16]. The difference is that SWTA AE

computes the sparse representation of an image by alternat-

ing convolutions and mappings whereas OMP runs an iter-

ative decomposition of the image patches over a dictionary.

More precisely, let x ∈ R

be an image patch. Given x and

a dictionary D ∈ R

m×n

, OMP ﬁnds a vector of coefﬁcients

y ∈ R

with k < m non-zero coefﬁcients so that Dy equals

to x approximatively.

For the sake of comparison, we build a variant of OMP

called Winner-Take-All Orthogonal Matching Pursuit (WTA

OMP). More precisely, let X ∈ R

m×p

be a matrix whose

columns are formed by p image patches of dimension m and

Y ∈ R

n×p

be a matrix whose columns are formed by p vec-

tors of coefﬁcients of dimension n. WTA OMP ﬁrst decom-

poses each image patch over D, see (1). Then, it keeps the

γ × n × p coefﬁcients with largest absolute value for the n-

length sparse representation of the p patches and sets the rest

to 0, see (2). The support of the sparse representation of each

patch has therefore been changed. Hence the need for a ﬁnal

least-square minimization, see (3).

Algorithm 1 : WTA OMP

Inputs: X ∈ R

m×p

, D ∈ R

m×n

, k < m and γ ∈]0, 1[.

For each j ∈ J1, pK, Y

= OMP (X

, D, k) (1)

I = f

(Y) (2)

For each j ∈ J1, pK, Z

= min

z∈R

− Dzk

st.

supp (z) = supp (I

)

(3)

Output: Z ∈ R

n×p

4. TRAINING

Before moving on to the image compression experiment in

Section 5, SWTA AE needs training. Similarly, a dictionary

D ∈ R

m×n

must be learned for WTA OMP.

4.1. Training data extraction

We extract 1.0 × 10

RGB images from the ILSVRC2012

ImageNet dataset [17]. The RGB color space is transformed

into YCbCr and we only keep the luminance channel.

For SWTA AE, the luminance images are resized to

321 ×321. M ∈ R

321×321

denotes the mean of all luminance

images. σ ∈ R

∗

is the mean of the standard deviation over

all luminance images. Each luminance image is subtracted by

M and divided by σ. These images are concatenated into a

training set ∆ ∈ R

321×321×

(

1.0×10

)

For WTA OMP, η = 1.2 × 10

image patches of size

√

m ×

√

m are randomly sampled from the luminance im-

ages. We remove the DC component from each patch. These

patches are concatenated into a training set Γ ∈ R

m×η

4.2. SWTA AE training

As explained in Section 2.2, α tunes the coding cost of Z. If α

is ﬁxed during training, all the ﬁlters and the biases of SWTA

AE are learned for one rate. That is why we turn α into a

stochastic hyperparameter during training. This justiﬁes the

preﬁx “Stochastic” in SWTA AE. Since there is no reason to

favor some rates during training, we sample α according to

the uniform distribution U[µ −, µ + ], where µ − > 0 and

µ +  < 1. We select µ = 1.8 × 10

−1

and  = 1.7 × 10

−1

to make the support of α large. At each training epoch, α is

drawn for each training image of ∆.

As shown in Section 2.1, SWTA AE can process images

of various sizes. During training, we feed SWTA AE with

random crops of size 49×49 of the training images of ∆. This

accelerates training considerably. The training objective is to

minimize the mean squared error between these cropped im-

ages and their reconstruction plus l

-norm weights decay. We

use stochastic gradient descent. The gradient descent learn-

ing rate is ﬁxed to 2.0 × 10

−5

, the momentum is 0.9 and the

size of mini-batches is 5. The weights decay coefﬁcient is

5.0 × 10

−4

. Our implementation is based on Caffe [18]. It

adds to Caffe the tools introduced in Sections 2.2 and 2.3.

4.3. Dictionary learning for WTA OMP

Given Γ, k < m and γ ∈ ]0, 1[, the dictionary learning prob-

lem is formulated as (4).

min

D,Z

,...,Z

j=1

kΓ

− DZ

st. ∀j ∈ J1, ηK, kZ

≤ k

st.

i=j

≤ γ × n × η

(4)

(4) is solved by Algorithm 2 which alternates between sparse

coding steps that involve WTA OMP and dictionary updates

that use stochastic gradient descent. Given Γ and p ∈ N

∗

let φ be a function that randomly partitions Γ into η

η / p mini-batches



(1)

, ..., X

(η

)



, where, for i ∈ J1, η

(i)

∈ R

m×p

. Mini-batches make learning very fast [19].

Algorithm 2 : dictionary learning for WTA OMP.

Inputs: Γ ∈ R

m×η

, k < m, γ ∈]0, 1[, p ∈ N

∗

and ε ∈ R

∗

D ∈ R

m×n

is randomly initialized.

∀j ∈ [|1, n|], D

← D

/ kD

For several epochs do:

(1)

, ..., X

(η

)

= φ (Γ, p)

∀i ∈ [|1, η

|], Z

(i)

= WTA OMP



(i)

, D, k, γ



D ← D − ε

∂



(i)

− DZ

(i)



∂D

∀j ∈ [|1, n|], D

← D

/ kD

Output: D ∈ R

m×n

For OMP, given Γ, a dictionary D

∈ R

m×n

is learned

using K-SVD [16]

, and the parameters m and n are opti-

mized with an exhaustive search. This leads to m = 64 and

n = 1024. For SWTA AE, the same values for m and n

are used for training D via Algorithm 2. Moreover, k = 15,

γ = 4.5 × 10

−3

, p = 10 and ε = 2.0 × 10

−2

K-SVD code: http://www.cs.technion.ac.il/ elad/software/

Fig. 2: Evolution of PNSR with the rate.

(a) LENA luminance 512 × 512. (b) BARBARA luminance 480 × 384.

5. IMAGE COMPRESSION EXPERIMENT

After training in Section 4, we compare the rate-distortion

curves of OMP, WTA OMP, SWTA AE, JPEG and JPEG2000

on test luminance images.

5.1. Image CODEC for SWTA AE

Each input test luminance image is pre-processed similarly to

the training in Section 4.1. The mean learned image M is

interpolated to match the size of the input image. Then, the

input image is subtracted by this interpolated mean image and

divided by the learned σ. The encoder of SWTA AE computes

Z. The bitstream is obtained by processing Z as detailed in

Section 2.3.

5.2. Image CODEC for OMP and WTA OMP

A luminance image is split into 8×8 non-overlapping patches.

The DC component is removed from each patch. The DC

components are uniformly quantized over 8-bits and coded

with a ﬁxed-length code. OMP (or WTA OMP) ﬁnds the co-

efﬁcients of the sparse decompositions of the image patches

over D

(or D). The non-zero coefﬁcients are uniformly

quantized over 8-bits and coded with a Huffman code while

their position is coded with a ﬁxed-length code.

Then, for WTA OMP only, the number of non-zero coef-

ﬁcients of the sparse decomposition of each patch over D is

coded with a Huffman code.

5.3. Comparison of rate-distortion curves

In the literature, there is no reference rate-distortion curve

for auto-encoders. We compare SWTA AE with JPEG and

JPEG2000

even though the image CODEC of SWTA AE is

less optimized. Furthermore, we compare SWTA AE with its

JPEG and JPEG2000 code: http://www.imagemagick.org/script/index.php

non-sparse Auto-Encoder counterpart (AE). AE has the same

architecture as SWTA AE but its Z only contains non-sparse

feature maps. Note that, to draw a new point in the AE rate-

distortion curve, AE must be ﬁrst re-trained with a different

number of feature maps in Z.

Figure 2 shows the rate-distortion curves of OMP, WTA

OMP, AE, SWTA AE, JPEG and JPEG2000 for two of the

most common images: LENA and BARBARA. In terms

of rate-distortion trade-off, SWTA AE outperforms AE and

WTA OMP is better than OMP. This highlights the value

of WTA for image compression. When we compare SWTA

AE with WTA OMP, we see that iterative decomposition

is more efﬁcient for image compression using sparse repre-

sentations. Moreover, SWTA AE can compete with JPEG.

We also ran this image compression experiment on several

crops of LENA and BARBARA and observed that the rel-

ative position of the six rate-distortion curves was compa-

rable to the relative positioning in ﬁgure 2. The size of

the test image does not affect the performance of SWTA

AE. More simulation results and a complexity analysis for

OMP, WTA OMP and SWTA AE can be found on the

web page https://www.irisa.fr/temics/demos/

NeuralNets/AutoEncoders/swtaAE.htm.

6. CONCLUSIONS AND FUTURE WORK

We have shown that, SWTA AE is more adaptated to image

compression than auto-encoders as it performs variable rate

image compression for any size of image after a single train-

ing and provides better rate-distortion trade-offs.

So far, our work has focused on the layer of auto-encoders

which is dedicated to coding. Yet, many avenues of research

are still to be explored to improve auto-encoders for image

compression. For instance, [20] proves that removing a max-

pooling layer and increasing the stride of the previous con-

volution, as we do, harms neural networks. This has to be

addressed.

Image compression with Stochastic Winner-Take-All Auto-Encoder

Figures

Citations

A practical tutorial on autoencoders for nonlinear feature fusion: taxonomy, models, software and guidelines

Fully Connected Network-Based Intra Prediction for Image Coding

Learning for Video Compression

Deep Learning-Based Video Coding: A Review and A Case Study

A practical tutorial on autoencoders for nonlinear feature fusion: Taxonomy, models, software and guidelines

References

Caffe: Convolutional Architecture for Fast Feature Embedding

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Caffe: Convolutional Architecture for Fast Feature Embedding

Speech recognition with deep recurrent neural networks

Related Papers (5)

Full Resolution Image Compression with Recurrent Neural Networks

Convolutional Neural Network-Based Block Up-Sampling for Intra Frame Coding

Reducing the Dimensionality of Data with Neural Networks

Deep Residual Learning for Image Recognition

A Convolutional Neural Network Approach for Post-Processing in HEVC Intra Coding

Frequently Asked Questions (15)

Q1. What are the contributions in "Image compression with stochastic winner-take-all auto-encoder" ?

Q2. What is the coding objective of the algorithm?

Q3. What is the code for the last feature map in Z?

Q4. What is the objective of the training?

Q5. What is the definition of a coding constraint?

Q6. What is the structure of the layer input?

Q7. What is the solution to the coding problem?

Q8. What is the code for the non-zero coefficients?

Q9. What is the code for the WTA OMP?

Q10. What is the simplest way to decompose a matrix?

Q11. What is the effect of removing a maxpooling layer?

Q12. What is the difference between SWTA AE and WTA OMP?

Q13. Who is the author of this article?

Q14. What is the definition of a vector of coefficients?

Q15. What is the coding objective of the problem?