Proceedings ArticleDOI

A butterfly structured design of the hybrid transform coding scheme

, Yaowu Xu1
01 Dec 2013-pp 17-20

TL;DR: This work devise a novel ADST-like transform whose kernel is consistent with that of DCT, thereby enabling butterfly structured computation flow, while largely retaining the performance advantages of hybrid transform coding scheme in terms of compression efficiency.

AbstractThe hybrid transform coding scheme that alternates amongst the asymmetric discrete sine transform (ADST) and the discrete cosine transform (DCT) depending on the boundary prediction conditions, is an efficient tool for video and image compression. It optimally exploits the statistical characteristics of prediction residual, thereby achieving significant coding performance gains over the conventional DCT-based approach. A practical concern lies in the intrinsic conflict between transform kernels of ADST and DCT, which prevents a butterfly structured implementation for parallel computing. Hence the hybrid transform coding scheme has to rely on matrix multiplication, which presents a speed-up barrier due to under-utilization of the hardware, especially for larger block sizes. In this work, we devise a novel ADST-like transform whose kernel is consistent with that of DCT, thereby enabling butterfly structured computation flow, while largely retaining the performance advantages of hybrid transform coding scheme in terms of compression efficiency. A prototype implementation of the proposed butterfly structured hybrid transform coding scheme is available in the VP9 codec repository.

Topics: Transform coding (66%), Discrete sine transform (64%), Lapped transform (64%), S transform (63%)

Introduction

• Transform coding is a central component in video and image compression.
• In fact, methods along this line are typically limited to smaller transform dimensions.
• It is noteworthy that larger block size transforms provides higher transform coding gains for stationary signal and are experimentally proved to contribute compression efficiency in various video codecs.
• The authors hence use this btfADST to replace the original ADST in the hybrid transform coding scheme.

II. SPATIAL PREDICTION AND TRANSFORM CODING

• The authors revisit the mathematical theory that derived the original ADST, in the context 1-D first-order Gauss-Markov model, given partial 1In practice, all the computations are performed in the integer format for speed reasons.
• Prediction boundary [1], which leads to their btf-ADST proposed in this work.
• This irregularity complicates an analytic derivation of the eigenvalues and eigenvectors of P1.
• The approximation clearly holds for ρ → 1, which is indeed a common approximation that describes the spatial correlation of video/image signals.

III. BUTTERFLY STRUCTURED VARIANT OF ADST

• A key observation of the above derived ADST is that the rows of TS (i.e., basis functions of the transform) possess smaller values in the beginning (closer to the known boundary), and larger values towards the other end.
• This effectively exploits the fact that pixels closer to the known boundary are better predicted and hence have statistically smaller variance than those at far end.
• It inspires their search for a unitary sinusoidal transform that resembles the compression performance of the ADST, to overcome the intricacy of butterfly design of ADST and hence hybrid transform coding for parallel computing.
• Clearly, it also possesses the property of asymmetric basis function, but has the denominator of kernel argument, 4N , consistent with that of DCT, thereby allowing the butterfly structured implementation.
• In practice, all these computations are performed in the integer format, which inevitably incurs rounding effects accumulated through every stage.

IV. QUANTITATIVE ANALYSIS

• The authors quantitatively evaluate the performance of the btf-ADST, original ADST, and DCT, against the KLT (of y in Sec. II) in terms of coding gains [7] under the assumed signal model, at different correlation coefficient values.
• This bit-allocation problem is addressed by water filling algorithm of [7].
• The coding gain, GA thus provides a comparison of the average distortion incurred with and without the transformation A. Note that for any given A (including the btf-ADST, ADST, DCT, and KLT of y), computing Rzz , and hence σ2zi , does not require making any approximations for P1.
• Clearly the original ADST well approximates KLT at various values of the correlation coefficient ρ.
• The maximum gap between ADST and KLT, or the maximum loss of optimality, is less than 0.05 dB.

V. EXPERIMENTAL RESULTS

• The proposed btf-ADST was employed to replace the original ADST in the hybrid transform coding scheme.
• This btf-ADST/DCT hybrid transform coding scheme was implemented in the VP9 codec [8].
• Fig. 2 demonstrates the rate-distortion performance comparison for sequence harbour at CIF resolution.
• Similar results were observed over a wide varieties of sequences and resolutions.
• The authors compare the runtime of the btf-ADST/DCT and the original ADST/DCT hybrid transform schemes, in terms of the average CPU cycles, as shown in Fig.

VI. CONCLUSIONS

• This work devised a novel variant of ADST transform whose kernel approximates the original ADST basis-wisely and is consistent with the DCT kernel, thereby enabling the butterfly structured implementation.
• The proposed scheme allows efficient hardware utilization for significant codec speed-up, while largely retaining the advantageous compression performance of hybrid transform coding scheme.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Butterﬂy Structured Design of The Hybrid Transform
Coding Scheme
Jingning Han, Yaowu Xu, and Debargha Mukherjee
Abstract—The hybrid transform coding scheme that alternates
amongst the asymmetric discrete sine transform (ADST) and the discrete
cosine transform (DCT) depending on the boundary prediction condi-
tions, is an efﬁcient tool for video and image compression. It optimally
exploits the statistical characteristics of prediction residual, thereby
achieving signiﬁcant coding performance gains over the conventional
DCT-based approach. A practical concern lies in the intrinsic conﬂict
between transform kernels of ADST and DCT, which prevents a butterﬂy
structured implementation for parallel computing. Hence the hybrid
transform coding scheme has to rely on matrix multiplication, which
presents a speed-up barrier due to under-utilization of the hardware,
especially for larger block sizes. In this work, we devise a novel ADST-
like transform whose kernel is consistent with that of DCT, thereby
enabling butterﬂy structured computation ﬂow, while largely retaining
the performance advantages of hybrid transform coding scheme in terms
of compression efﬁciency. A prototype implementation of the proposed
butterﬂy structured hybrid transform coding scheme is available in the
VP9 codec repository.
I. INTRODUCTION
Transform coding is a central component in video and image
compression. Many research efforts have been devoted to optimize the
transform kernel to fully exploit signal correlation for compression
gains. A recent approach that jointly optimized spatial prediction
and the choice of the subsequent transform for video and image
compression was developed in [1] and [2], where it was shown that
the optimal Karhunen-Loeve transform (KLT) given available, partial
boundary information is well approximated by a close relative of the
discrete sine transform (DST), with basis vectors that tend to vanish at
the known boundary and maximize energy at the unknown boundary.
The overall intra coding scheme thus switches between this variant
of DST named asymmetric DST (ADST), and the conventional DCT,
depending on the prediction direction and boundary information. This
adaptive prediction-transform approach, namely hybrid transform
coding scheme, was experimentally shown to signiﬁcantly outperform
the DCT-based intra-frame prediction-transform coding.
On the hardware design side, the transform module typically
contributes a large portion of codec computational complexity, and
hence a butterﬂy structured implementation that allows parallel com-
puting via single instruction multiple data (SIMD) operations [3] is
highly desirable. The design of butterﬂy structured discrete cosine
transform can trace back to 1970’s (e.g., [4].) A recent development
on fast transform using integer transform was proposed in [5], where
it approximates the DCT transform element-wisely using a matrix
whose entries are all small integers. The computation hence only
involves additions and shifts. Similar principle was also applied to the
ADST in [2], both targetting at 4×4 block size. In fact, methods along
this line are typically limited to smaller transform dimensions. When
it comes to transform dimension of 8 × 8 or above, it is in general
difﬁcult to ﬁnd an orthogonal matrix whose elements are small
integers and approximates the DCT closely. Hence, its advantages in
simple computation will gradually disappear as transform size grows.
It is noteworthy that larger block size transforms provides higher
transform coding gains for stationary signal and are experimentally
proved to contribute compression efﬁciency in various video codecs.
Challenges arise in the design of fast ADST, and hence hybrid
transform coding scheme, of any block sizes. The original ADST
kernel was derived as sin(
n (2k1)π
2N+1
) in [1], where N is the
block dimension, n and k denote the time and frequency indexes,
respectively, both ranging from 1 to N. The DCT kernel, on the
other hand, is of form cos(
(2n1)(k1)π
2N
). The butterﬂy structured
implementations of sinusoidal transforms exist if and only if the
denominator of the kernel argument, i.e., (2N +1) for the ADST and
2N for DCT, is a composite number (and ideally can be decomposed
into product of small integers). For this reason, most block-based
video (and image) codecs are designed to make the block size power
of two, e.g., N = 4, 8, 16, etc, for efﬁcient computation of DCT
transformation. It, however, makes the original ADST not capable of
fast implementation. For example, when N = 8, (2N + 1) turns out
to be 17, which is a prime number that precludes the possibility of
butterﬂy structure.
This work resolves this intrinsic conﬂict between DCT and the
is of the form sin(
(2n1)(2k1)π
4N
). Clearly the denominator of the
kernel argument, 4N , is consistent with that of DCT, in that if 2N
is a power of two, so is 4N. Therefore, it can be implemented
in a butterﬂy structure, and is henceforth referred to as btf-ADST.
A prototype butterﬂy structured implementation of btf-ADST was
initially provided in [6]. The btf-ADST is a basis-wise approximation
to the original ADST, and well preserves the its superior coding
performance given the boundary condition. We hence use this btf-
scheme. The overall scheme thus selects the appropriate 1-D trans-
forms amongst the btf-ADST and DCT depending on the prediction
direction to form the 2-D transformation. Further note that while there
are many variants of butterﬂy design for the btf-ADST and DCT,
most of them targeted only on reducing the number of multiplica-
tions without consideration on the rounding effects of intermediate
steps
1
, which potentially affects the accuracy of the computation
hence incurring round-trip error. In our implementation, we use the
structure that has more rotation (which involves more multiplication)
steps in the ﬁrst few stages, so as to reduce the accuracy impact
due to rounding error. It is experimentally demonstrated that our
proposed approach, in conjunction with SIMD, signiﬁcantly reduces
the runtime of hybrid transform coding scheme in terms of CPU
cycles, while largely retaining its compression gains.
II. SPATIAL PREDICTION AND TRANSFORM CODING
We revisit the mathematical theory that derived the original ADST,
in the context 1-D ﬁrst-order Gauss-Markov model, given partial
1
In practice, all the computations are performed in the integer format for
speed reasons.

this work.
Consider a zero-mean, unit variance, ﬁrst-order Gauss-Markov
sequence
x
k
= ρx
k1
+ e
k
, (1)
where ρ is the correlation coefﬁcient, and e
k
is a white Gaussian noise
process with variance 1 ρ
2
. Let x = [x
1
, x
2
, ··· , x
N
]
T
denote
the random vector to be encoded given x
0
as the available (one-
sided) boundary. The superscript T denotes matrix transposition. The
recursion (1) translates into the following set of equations:
x
1
= ρx
0
+ e
1
x
2
ρx
1
= e
2
.
.
.
x
N
ρx
(N1)
= e
N
, (2)
or in compact notation:
Qx = b + e (3)
where
Q =
1 0 0 0 . . .
ρ 1 0 0 . . .
0 ρ 1 0 . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 . . . 0 ρ 1
, (4)
and b
= [ρx
0
, 0, ··· , 0]
T
and e = [e
1
, e
2
, ··· , e
N
]
T
capture the
boundary information and innovation process, respectively. It can be
shown that Q is invertible, and thus:
x = Q
1
b + Q
1
e , (5)
where the superscript 1 indicates matrix inversion. As expected, the
“boundary response” or prediction, Q
1
b, in (5) satisﬁes
Q
1
b = [ρx
0
, ρ
2
x
0
, ··· , ρ
N
x
0
]
T
. (6)
The prediction residual
y = Q
1
e (7)
is to be compressed and transmitted, which motivates the derivation
of its KLT. The autocorrelation matrix of y is given by:
R
yy
= E{yy
T
} = Q
1
E{ee
T
}(Q
T
)
1
= (1 ρ
2
)Q
1
(Q
T
)
1
.
(8)
Thus, the KLT for y is a unitary matrix that diagonalizes
Q
1
(Q
T
)
1
, and hence also the more convenient:
P
1
= Q
T
Q =
1 + ρ
2
ρ 0 0 . . .
ρ 1 + ρ
2
ρ 0 . . .
0 ρ 1 + ρ
2
ρ . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 . . . ρ 1 + ρ
2
ρ
0 . . . 0 ρ 1
.
(9)
Although P
1
is Toeplitz, note that the element at the bottom right
corner is different from all the other elements on the principal
diagonal, i.e., it is not 1+ρ
2
. This irregularity complicates an analytic
derivation of the eigenvalues and eigenvectors of P
1
. As a subterfuge,
we approximate P
1
with
ˆ
P
1
=
1 + ρ
2
ρ 0 0 . . .
ρ 1 + ρ
2
ρ 0 . . .
0 ρ 1 + ρ
2
ρ . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 . . . ρ 1 + ρ
2
ρ
0 . . . 0 ρ 1 + ρ
2
ρ
(10)
which is obtained by replacing the bottom-right corner element with
1 + ρ
2
ρ. The approximation clearly holds for ρ 1, which is
indeed a common approximation that describes the spatial correlation
of video/image signals. The unitary matrix T
S
that diagonalizes
ˆ
P
1
,
and hence an approximation for the required KLT of y, can be shown
as the following relative of the common DST:
[T
S
]
j,i
=
2
2N + 1
sin
(2j 1)
2N + 1
(11)
where j, i {1, 2, ··· , N} are the frequency and time indexes of
the transform kernel respectively. Needless to say, the constant matrix
T
S
is independent of the statistics of the innovation e
k
, and can be
used as an approximation for KLT when boundary information x
0
is
available.
III. BUTTERFLY STRUCTURED VARIANT OF ADST
A key observation of the above derived ADST is that the rows
of T
S
(i.e., basis functions of the transform) possess smaller values
in the beginning (closer to the known boundary), and larger values
towards the other end. For instance, consider the row with j = 1
(i.e., the basis function with the lowest frequency). In the case where
N 1, the ﬁrst sample (i = 1) is
2
2N+1
sin
π
2N+1
0, while the
last sample (i = N ) takes the maximum value
2
2N+1
sin
Nπ
2N+1
2
2N+1
. This effectively exploits the fact that pixels closer to the
known boundary are better predicted and hence have statistically
smaller variance than those at far end. It inspires our search for a
unitary sinusoidal transform that resembles the compression perfor-
mance of the ADST, to overcome the intricacy of butterﬂy design
of ADST and hence hybrid transform coding for parallel computing.
We hence devise a new variant of DST as an alternative:
[T
btf
]
j,i
=
r
2
N
sin
(2j 1)(2i 1)π
4N
!
, (12)
where j, i {1, 2, ··· , N } denote the frequency and time indexes
respectively. Clearly, it also possesses the property of asymmetric
basis function, but has the denominator of kernel argument, 4N,
consistent with that of DCT, thereby allowing the butterﬂy structured
implementation. We refer to it as btf-ADST henceforth.
A butterﬂy structure of btf-ADST was initially provided in [6],
where it assumed no precision loss in the intermediate steps that
involve multiplications by irrational numbers. In practice, all these
computations are performed in the integer format, which inevitably
incurs rounding effects accumulated through every stage. To minimize
the round-trip error, we modify the structure to make the initial stages
consist more multiplications, so that the rounding errors are less
magniﬁed. In keeping the conventions used in [6], let I and D
N

denote the re-ordering operations:
I =
1
···
···
1
D
N
=
1
1
1
···
1
. (13)
Let P
J
be the permutation matrix that move the ﬁrst half of the vector
entries to the even-numbered position, and the second half entries to
the odd-numbered position but in a reversed order:
P
J
=
1 0 ···
0 ··· 0 1
0 1 ···
0 ··· 1 0
··· ··· ···
, (14)
where J is the height of the matrix. It formulates a second permuta-
tion:
H
N
= P
N
P
N/2
P
N/2
···
P
4
I
4
P
4
I
4
···
P
4
I
4
P
4
I
4
. (15)
Similarly the permutation operator Q
J
moves the odd-numbered
entries to be in reversed order:
Q
J
=
1 0 ···
0 ··· 0 1
0 0 1 ···
0 ··· 1 0
··· ··· ···
0 1 0 ··· 0
. (16)
Let J = log
2
N, we deﬁne the following building blocks that
formulate the butterﬂy structure.
Type 1 Translational Operators: matrices U
N
(j), j =
1, 2, ··· , J 1 are deﬁned as
U
N
(j) =
B(j)
B(j)
···
B(j)
, (17)
where
B(j) =
I
2
j
I
2
j
I
2
j
I
2
j
. (18)
Type 2 Rotational Operators: V
N
(j), j = 1, 2, ··· , J 1, are
block diagonal matrices:
V
N
(j) =
I
2
j
E(j)
I
2
j
···
E(j)
, (19)
where E(j) = diag{T
1/2
j+1
, T
5/2
j+1
, ··· , T
(2
j+1
3)/2
j+1
} and
T
r
=
cos rπ sin rπ
sin rπ cos rπ
. (20)
As a special case,
V
N
(J) =
T
1/4N
T
5/4N
···
T
(2N3)/4N
. (21)
Given the above established building blocks, the btf-ADST can be
decomposed as:
T
btf
= D
N
· H
T
N
· V
N
(1) · U
N
(1) · V
N
(2)
···U
N
(J 1) · V
N
(J) · Q
N
· I
N
, (22)
which directly translates into a butterﬂy graph. The theoretical coding
performance of the two ADST variants and DCT compared against
the KLT will be provided next, followed by the experimental codec
evaluation.
IV. QUANTITATIVE ANALYSIS
We quantitatively evaluate the performance of the btf-ADST,
original ADST, and DCT, against the KLT (of y in Sec. II) in terms
of coding gains [7] under the assumed signal model, at different
correlation coefﬁcient values.
Let the prediction residual, y, be transformed to
z = Ay = [z
1
, z
2
, ..., z
N
]
T
(23)
with an N ×N unitary matrix A. The objective of the encoder is to
distribute a ﬁxed number of bits to the different elements of z such
that the average distortion is minimized. This bit-allocation problem
is addressed by water ﬁlling algorithm of [7]. Under assumptions such
as a Gaussian source, high-quantizer resolution, negligible quantizer
overload, and with non-integer bit-allocation allowed, it can be shown
that the minimum distortion (mean squared error) obtainable is
proportional to the geometric mean of the transform domain sample
variances σ
2
z
i
, i.e.,
D
A
(
N
Y
i=1
σ
2
z
i
)
1/N
, (24)
where for Gaussian source the proportionality coefﬁcient is indepen-
dent of the transform A. These variances can be obtained as the
diagonal elements of the autocorrelation matrix of z:
R
zz
= E[zz
T
] = AE[yy
T
]A
T
= (1 ρ
2
)AP
1
1
A
T
(25)
where we have used (8) and (9). The coding gain in dB of any
transform A is now deﬁned as:
G
A
= 10log
10
(D
I
/D
A
) . (26)
Here I is the N ×N identity matrix, and hence D
I
is the distortion
resulting from direct quantization of the untransformed vector y.
The coding gain, G
A
thus provides a comparison of the average
distortion incurred with and without the transformation A. Note that
of y), computing R
zz
, and hence σ
2
z
i
, does not require making any
approximations for P
1
. When A is the KLT of y, R
zz
is a diagonal
matrix (with diagonal elements equal to the eigen values of P
1
1
),
the transform coefﬁcients z
i
are uncorrelated, and the coding gain
reaches its maximum.
coding gains, relative to KLT, speciﬁcally it depicts G
T
btf
G
KLT
,
G
T
s
G
KLT
, and G
T
c
G
KLT
, versus the correlation coefﬁcient
ρ. Clearly the original ADST well approximates KLT at various
values of the correlation coefﬁcient ρ. The maximum gap between

ADST and KLT, or the maximum loss of optimality, is less than 0.05
dB. The proposed btf-ADST closely resembles the performance of
ADST and KLT, at a maximum loss of 0.15 dB. In comparison, DCT
performs poorly (by about 0.55 dB loss) for the practically relevant
case of high correlation (ρ 0.95). At low correlation (ρ 0),
the autocorrelation matrix of the prediction residual, R
yy
I, and
hence any unitary matrix, including ADST and DCT, will function
as a KLT. The block length used in obtaining the results of Fig. 1 is
N = 8.
0 0.2 0.4 0.6 0.8 1
−0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
inter pixel correlation coefficient
coding gains (dB)
KLT
DCT
to KLT, all applied to the prediction residuals, plotted versus the inter pixel
correlation coefﬁcient; the block dimension is 8 × 8.
V. EXPERIMENTAL RESULTS
The proposed btf-ADST was employed to replace the original
hybrid transform coding scheme was implemented in the VP9 codec
[8]. We ﬁrst evaluated its compression efﬁciency as compared to
the original ADST/DCT hybrid transform and the conventional 2D-
DCT coding scheme. Fig. 2 demonstrates the rate-distortion per-
formance comparison for sequence harbour at CIF resolution.
both signiﬁcantly outperform the conventional 2D-DCT approach.
Similar results were observed over a wide varieties of sequences and
resolutions.
We next consider the computational efﬁciency of the proposed
terﬂy structured implementation, which in conjunction with SIMD
operations provides signiﬁcant speed-up as compared to the original
ADST operating via matrix multiplication. We compare the runtime
schemes, in terms of the average CPU cycles, as shown in Fig. 3.
The implementation was using streaming SIMD extension 2 (SSE2)
and the experiments were running on a 64-bit platform. Evidently the
proposed btf-ADST allows efﬁcient hardware utilization and thereby
substantial codec speed-up, while closely resembling the compression
VI. CONCLUSIONS
This work devised a novel variant of ADST transform whose kernel
approximates the original ADST basis-wisely and is consistent with
the DCT kernel, thereby enabling the butterﬂy structured implemen-
tation. The proposed scheme allows efﬁcient hardware utilization for
signiﬁcant codec speed-up, while largely retaining the advantageous
compression performance of hybrid transform coding scheme.
450 500 550 600 650 700 750 800 850 900
36
37
38
39
40
41
42
43
bit−rate (kbit/s)
PSNR (dB)
2D−DCT
Fig. 2. Coding performance comparison of 2D-DCT, ADST/DCT hybrid
transform, and btf-ADST/DCT hybrid transform for sequence harbour at
CI F resolution.
4 6 8 10 12 14 16
0
2000
4000
6000
8000
10000
12000
14000
transform dimension
CPU cycles
Fig. 3. Computational complexity in terms of CPU cycles. The btf-ADST
was implemented using streaming SIMD extension 2 and the experiments
were running on a 64-bit platform.
REFERENCES
[1] J. Han, A. Saxena, and K. Rose, “Towards jointly optimal spatial
prediction and adaptive transform in video/image coding, IEEE Proc.
ICASSP, pp. 726–729, Mar. 2010.
[2] J. Han, A. Saxena, V. Melkote, and K. Rose, “Jointly optimized spatial
prediction and block transform for video and image coding, IEEE Trans.
on Image Processing, vol. 21, pp. 1874–1884, April 2012.
[3] J. L. Hennessy and D. A. Patterson, Computer Architecture-a quantitative
approach, 4th Ed., Mogan Kaufmann, 2007.
[4] W.-H. Chen, C. Smith, and S. Fraclick, A fast computational algorithm
for the discrete cosine transform, IEEE Trans. on Communications, vol.
25, pp. 1004–1009, Sep. 1977.
[5] H. S. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, “Low-
complexity transform and quantization in H.264/AVC, IEEE Trans. on
Circuits and Systems for Video Technology, vol. 13, pp. 598–603, July
2003.
[6] Z. Wang, “Fast algorithms for the discrete W transform and for the
discrete Fourier transform, IEEE Trans. on Acoustics, Speech, and Signal
Proc., vol. 32, no. 4, pp. 803–816, Aug. 1984.
[7] N. S. Jaynat and P. Noll, “Digital coding of waveforms, Engle-wood
Cliffs, NJ: Prentice-Hall, 1984.
[8] open source, http://www.webmproject.org/code/.
Citations
More filters

Journal ArticleDOI

1,003 citations

Proceedings ArticleDOI
, Hui Su1, Zoe Liu1, Yaowu Xu1
TL;DR: A set of new experimental coding tools have already been added to baseline VP9 to achieve modest coding gains over a large enough test set, and this paper provides a technical overview of these coding tools.
Abstract: Google started an opensource project, entitled the WebM Project, in 2010 to develop royaltyfree video codecs for the web The present generation codec developed in the WebM project called VP9 was finalized in mid2013 and is currently being served extensively by YouTube, resulting in billions of views per day Even though adoption of VP9 outside Google is still in its infancy, the WebM project has already embarked on an ambitious project to develop a next edition codec VP10 that achieves at least a generational bitrate reduction over the current generation codec VP9 Although the project is still in early stages, a set of new experimental coding tools have already been added to baseline VP9 to achieve modest coding gains over a large enough test set This paper provides a technical overview of these coding tools

26 citations

Journal ArticleDOI

Abstract: The principal component analysis (PCA) is widely used for data decorrelation and dimensionality reduction. However, the use of PCA may be impractical in real-time applications, or in situations were energy and computing constraints are severe. In this context, the discrete cosine transform (DCT) becomes a low-cost alternative to data decorrelation. This paper presents a method to derive computationally efficient approximations to the DCT. The proposed method aims at the minimization of the angle between the rows of the exact DCT matrix and the rows of the approximated transformation matrix. The resulting transformations matrices are orthogonal and have extremely low arithmetic complexity. Considering popular performance measures, one of the proposed transformation matrices outperforms the best competitors in both matrix error and coding capabilities. Practical applications in image and video coding demonstrate the relevance of the proposed transformation. In fact, we show that the proposed approximate DCT can outperform the exact DCT for image encoding under certain compression ratios. The proposed transform and its direct competitors are also physically realized as digital prototype circuits using FPGA technology.

17 citations

Journal ArticleDOI
TL;DR: It is shown that Haar units (Givens rotations with angle $\pi /4$) can be used to reduce GFT computation cost when the graph is bipartite or satisfies certain symmetry properties based on node pairing.
Abstract: The graph Fourier transform (GFT) is an important tool for graph signal processing, with applications ranging from graph-based image processing to spectral clustering. However, unlike the discrete Fourier transform, the GFT typically does not have a fast algorithm. In this work, we develop new approaches to accelerate the GFT computation. In particular, we show that Haar units (Givens rotations with angle $\pi /4$ ) can be used to reduce GFT computation cost when the graph is bipartite or satisfies certain symmetry properties based on node pairing. We also propose a graph decomposition method based on graph topological symmetry, which allows us to identify and exploit butterfly structures in stages. This method is particularly useful for graphs that are nearly regular or have some specific structures, e.g., line graphs, cycle graphs, grid graphs, and human skeletal graphs. Though butterfly stages based on graph topological symmetry cannot be used for general graphs, they are useful in applications, including video compression and human action analysis, where symmetric graphs, such as symmetric line graphs and human skeletal graphs, are used. Our proposed fast GFT implementations are shown to reduce computation costs significantly, in terms of both number of operations and empirical runtimes.

13 citations

Cites background or methods from "A butterfly structured design of th..."

• ...2 as Haar unit, as opposed to general Givens rotations, which are often referred to as “butterflies” [21], [25], [33]....

[...]

• ...An n dimensional Givens rotation [30], commonly referred to as a butterfly [20], [21], [25], is a linear transformation that applies a rotation of angle θ to two coordinates, denoted as p and q....

[...]

• ...This means that those sub-GFTs can also be implemented using fast DCT and ADST algorithms [23]–[25]....

[...]

• ...Because of the availability of fast algorithms, DCT and Type-4 DST have been adopted in 1053-587X © 2019 IEEE....

[...]

• ...We also note that, for any steerable DFT with a length n that is a multiple of 4, the GFTs of G++c and G−+c are Type-2 DCT and Type-4 DST, respectively....

[...]

Journal ArticleDOI
, Bohan Li1, Hui Su1, Sai Deng1, Yue Chen1, Yaowu Xu1
26 Feb 2021
TL;DR: A technical overview of the AV1 codec design that enables the compression performance gains with considerations for hardware feasibility is provided.
Abstract: The AV1 video compression format is developed by the Alliance for Open Media consortium. It achieves more than a 30% reduction in bit rate compared to its predecessor VP9 for the same decoded video quality. This article provides a technical overview of the AV1 codec design that enables the compression performance gains with considerations for hardware feasibility.

11 citations

Cites methods from "A butterfly structured design of th..."

• ...Note that since the original ADST derived in [33] cannot be decomposed for the butterfly structure, a variant of it, as introduced in [36] and also as shown in Figure 27, is adopted by AV1 for transform block sizes of 8× 8 and above....

[...]

References
More filters

Book
01 Dec 1989
TL;DR: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today.
Abstract: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today. In this edition, the authors bring their trademark method of quantitative analysis not only to high-performance desktop machine design, but also to the design of embedded and server systems. They have illustrated their principles with designs from all three of these domains, including examples from consumer electronics, multimedia and Web technologies, and high-performance computing.

11,485 citations

"A butterfly structured design of th..." refers background in this paper

• ...On the hardware design side, the transform module typically contributes a large portion of codec computational complexity, and hence a butterfly structured implementation that allows parallel computing via single instruction multiple data (SIMD) operations [3] is highly desirable....

[...]

01 Jan 1986

1,694 citations

Journal ArticleDOI
Wen-Hsiung Chen, C. Smith1
TL;DR: A Fast Discrete Cosine Transform algorithm has been developed which provides a factor of six improvement in computational complexity when compared to conventional DiscreteCosine Transform algorithms using the Fast Fourier Transform.
Abstract: A Fast Discrete Cosine Transform algorithm has been developed which provides a factor of six improvement in computational complexity when compared to conventional Discrete Cosine Transform algorithms using the Fast Fourier Transform. The algorithm is derived in the form of matrices and illustrated by a signal-flow graph, which may be readily translated to hardware or software implementations.

1,272 citations

Journal ArticleDOI

1,003 citations

"A butterfly structured design of th..." refers background in this paper

• ...II) in terms of coding gains [7] under the assumed signal model, at different correlation coefficient values....

[...]

• ...This bit-allocation problem is addressed by water filling algorithm of [7]....

[...]

Journal ArticleDOI
TL;DR: The 4/spl times/4 transforms in H.264 can be computed exactly in integer arithmetic, thus avoiding inverse transform mismatch problems and minimizing computational complexity, especially for low-end processors.
Abstract: This paper presents an overview of the transform and quantization designs in H.264. Unlike the popular 8/spl times/8 discrete cosine transform used in previous standards, the 4/spl times/4 transforms in H.264 can be computed exactly in integer arithmetic, thus avoiding inverse transform mismatch problems. The new transforms can also be computed without multiplications, just additions and shifts, in 16-bit arithmetic, thus minimizing computational complexity, especially for low-end processors. By using short tables, the new quantization formulas use multiplications but avoid divisions.

720 citations

"A butterfly structured design of th..." refers methods in this paper

• ...) A recent development on fast transform using integer transform was proposed in [5], where it approximates the DCT transform element-wisely using a matrix whose entries are all small integers....

[...]