scispace - formally typeset
Open AccessJournal ArticleDOI

HEVC Complexity and Implementation Analysis

Reads0
Chats0
TLDR
Overall, the complexity of HEVC decoders does not appear to be significantly different from that of H.264/AVC decoder; this makes HEVC decoding in software very practical on current hardware.
Abstract
Advances in video compression technology have been driven by ever-increasing processing power available in software and hardware. The emerging High Efficiency Video Coding (HEVC) standard aims to provide a doubling in coding efficiency with respect to the H.264/AVC high profile, delivering the same video quality at half the bit rate. In this paper, complexity-related aspects that were considered in the standardization process are described. Furthermore, profiling of reference software and optimized software gives an indication of where HEVC may be more complex than its predecessors and where it may be simpler. Overall, the complexity of HEVC decoders does not appear to be significantly different from that of H.264/AVC decoders; this makes HEVC decoding in software very practical on current hardware. HEVC encoders are expected to be several times more complex than H.264/AVC encoders and will be a subject of research in years to come.

read more

Content maybe subject to copyright    Report

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012 1685
HEVC Complexity and Implementation Analysis
Frank Bossen, Member, IEEE, Benjamin Bross, Student Member, IEEE, Karsten S
¨
uhring, and David Flynn
Abstract—Advances in video compression technology have
been driven by ever-increasing processing power available in
software and hardware. The emerging High Efficiency Video
Coding (HEVC) standard aims to provide a doubling in coding
efficiency with respect to the H.264/AVC high profile, delivering
the same video quality at half the bit rate. In this paper,
complexity-related aspects that were considered in the standard-
ization process are described. Furthermore, profiling of reference
software and optimized software gives an indication of where
HEVC may be more complex than its predecessors and where it
may be simpler. Overall, the complexity of HEVC decoders does
not appear to be significantly different from that of H.264/AVC
decoders; this makes HEVC decoding in software very practical
on current hardware. HEVC encoders are expected to be several
times more complex than H.264/AVC encoders and will be a
subject of research in years to come.
Index Terms—High Efficiency Video Coding (HEVC), video
coding.
I. Introduction
T
HIS PAPER gives an overview of complexity and im-
plementation issues in the context of the emerging High
Efficiency Video Coding (HEVC) standard. The HEVC project
is conducted by the Joint Collaborative Team on Video Coding
(JCT-VC), and is a joint effort between ITU-T and ISO/IEC.
Reference software, called the HEVC test model (HM), is
being developed along with the draft standard. At the time of
writing, the current version of HM is 8.0, which corresponds
to the HEVC text specification draft 8 [1]. It is assumed that
the reader has some familiarity with the draft HEVC standard,
an overview of which can be found in [2].
Complexity assessment is a complex topic in itself, and one
aim of this paper is to highlight some aspects of the HEVC
design where some notion of complexity was considered. This
is the topic of Section II.
A second aim of this paper is to provide and discuss data
resulting from profiling existing software implementations of
HEVC. Sections III and IV present results obtained with
Manuscript received May 26, 2012; revised August 19, 2012; accepted
August 24, 2012. Date of publication October 2, 2012; date of current
version January 8, 2013. This paper was recommended by Associate Editor
H. Gharavi.
F. Bossen is with DOCOMO Innovations, Inc., Palo Alto, CA 94304 USA
(e-mail: bossen@docomoinnovations.com).
B. Bross and K. S
¨
uhring are with the Image and Video Coding
Group, Fraunhofer Institute for Telecommunications–Heinrich Hertz In-
stitute, Berlin 10587, Germany (e-mail: benjamin.bross@hhi.fraunhofer.de;
karsten.suehring@hhi.fraunhofer.de).
D. Flynn is with Research In Motion, Ltd., Waterloo, ON N2L 3W8, Canada
(e-mail: dflynn@iee.org).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSVT.2012.2221255
the HM encoder and decoder, and Section V discusses an
optimized implementation of a decoder.
II. Design Aspects
A. Quadtree-Based Block Partitioning
HEVC retains the basic hybrid coding architecture of prior
video coding standards, such as H.264/AVC [3]. A significant
difference lies in the use of a more adaptive quadtree structure
based on a coding tree unit (CTU) instead of a macroblock.
In principle, the quadtree coding structure is described by
means of blocks and units. A block defines an array of
samples and sizes thereof, whereas a unit encapsulates one
luma and corresponding chroma blocks together with syntax
needed to code these. Consequently, a CTU includes coding
tree blocks (CTB) and syntax specifying coding data and
further subdivision. This subdivision results in coding unit
(CU) leaves with coding blocks (CB). Each CU incorporates
more entities for the purpose of prediction, so-called prediction
units (PU), and of transform, so-called transform units (TU).
Similarly, each CB is split into prediction blocks (PB) and
transform blocks (TB). This variable-size, adaptive approach
is particularly suited to larger resolutions, such as 4k × 2k,
which is a target resolution for some HEVC applications. An
exemplary CB and TB quadtree structure is given in Fig. 1. All
partitioning modes specifying how to split a CB into PBs are
depicted in Fig. 2. The decoding of the quadtree structures
is not much of an additional burden because the quadtrees
can be easily traversed in a depth-first fashion using in a z-
scan order. Partitioning modes for inter picture coded CUs
feature nonsquare PUs. Support for these nonsquare shapes
requires additional logic in a decoder as multiple conversions
between z-scan and raster scan orders may be required. At the
encoder side, simple tree-pruning algorithms exist to estimate
the optimal partitioning in a rate-distortion sense [4], [5].
Sections below describe various tools of HEVC and review
complexity aspects that were considered in the development
of the HEVC specification, using H.264/AVC as a reference
where appropriate.
B. Intra Picture Prediction
Intra picture prediction in HEVC is quite similar to
H.264/AVC. Samples are predicted from reconstructed sam-
ples of neighboring blocks. The mode categories remain iden-
tical: DC, plane, horizontal/vertical, and directional; although
the nomenclature has somewhat changed with planar and
angular, respectively, corresponding to H.264/AVC’s plane and
directional modes. A significant change comes from the intro-
duction of larger block sizes, where intra picture prediction
1051-8215/$31.00
c
2012 IEEE

1686 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012
Fig. 1. Detail of 4k × 2k Traffic sequence showing the coding block (white)
and nested transform block (red) structure resulting from recursive quadtree
partitioning.
Fig. 2. All prediction block partitioning modes. Inter picture coded CUs can
apply all modes, while intra picture coded CUs can apply only the first two.
using one of 35 modes may be performed for blocks of size
up to 32 × 32 samples. The smallest block size is unchanged
from H.264/AVC at 4 × 4 and remains a complexity bottleneck
because of the serial nature of intra picture prediction.
For the DC, horizontal, and vertical modes an additional
postprocess is defined in HEVC wherein a row and/or column
is filtered such as to maintain continuity across block bound-
aries. This addition is not expected to have an impact on the
worst case complexity since these three modes are the simplest
to begin with.
In the case of the planar mode, consider that the generating
equations are probably not adequate to determine complexity,
as it is possible to easily incrementally compute predicted sam-
ple values. For the H.264/AVC plane mode it is expected that
one 16-bit addition, one 16-bit shift, and one clip to the 8-bit
range are required per sample. For the HEVC planar mode,
this becomes three 16-bit additions and one 16-bit shift. These
two modes are thus expected to have similar complexities.
The angular modes of HEVC are more complex than the
directional H.264/AVC modes as multiplication is required.
Each predicted sample is computed as ((32 w) · x
i
+ w ·
x
i+1
+ 16) >> 5, where x
i
are reference samples and w is a
weighting factor. The weighting factor remains constant across
a predicted row or column that facilitates single-instruction
multiple-data (SIMD) implementations. A single function may
be used to cover all 33 prediction angles, thereby reducing the
amount of code needed to implement this feature.
As in H.264/AVC, reference samples may be smoothed prior
to prediction. The smoothing process is the same although it is
applied more selectively, depending upon the prediction mode.
From an encoding perspective, the increased number of
prediction modes (35 in HEVC versus 9 in H.264/AVC)
will require good mode selection heuristics to maintain a
reasonable search complexity.
C. Inter Picture Prediction
Inter picture prediction, or motion compensation, is concep-
tually very simple in HEVC, but comes with some overhead
compared to H.264/AVC. The use of a separable 8-tap filter
for luma sub-pel positions leads to an increase in memory
bandwidth and in the number of multiply-accumulate opera-
tions required for motion compensation. Filter coefficients are
limited to the 7-bit signed range to minimize hardware cost.
In software, motion compensation of an N × N block consists
of8+56/N 8-bit multiply-accumulate operations per sample
and eight 16-bit multiply-accumulate operations per sample.
For chroma sub-pel positions, a separable 4-tap filter with the
same limitations as for the luma filter coefficients is applied.
This also increases the memory bandwidth and the number
of operations compared to H.264/AVC where bilinear inter
polation is used for chroma sub-pel positions.
Another area where the implementation cost is increased is
the intermediate storage buffers, particularly in the bipredictive
case. Indeed, two 16-bit buffers are required to hold data,
whereas in H.264/AVC, one 8-bit buffer and one 16-bit buffer
are sufficient. In an HEVC implementation these buffers do
not necessarily need to be increased to reflect the maximum
PB size of 64 × 64. Motion compensation of larger blocks
may be decomposed into, and processed in, smaller blocks to
achieve a desired trade-off between memory requirements and
a number of operations.
H.264/AVC defines restrictions on motion data that are
aimed at reducing memory bandwidth. For example, the num-
ber of motion vectors used in two consecutive macroblocks
is limited. HEVC adopts a different approach and defines
restrictions that are much simpler for an encoder to conform
to: the smallest motion compensation blocks are of luma
size 4× 8 and 8 × 4, thereby prohibiting 4 × 4 inter picture
prediction, and are constrained to make use of only the first
reference picture list (i.e., no biprediction for 4 × 8 and 8 × 4
luma blocks).
HEVC introduces a so-called merge mode, which sets all
motion parameters of an inter picture predicted block equal
to the parameters of a merge candidate [6]. The merge mode
and the motion vector prediction process optionally allow a
picture to reuse motion vectors of prior pictures for motion
vector coding, in essence similar to the H.264/AVC temporal
direct mode. While H.264/AVC downsamples motion vectors
to the 8 × 8 level, HEVC further reduces memory requirements
by keeping a single motion vector per 16 × 16 block.
HEVC offers more ways to split a picture into motion-
compensated partition patterns. While this does not signifi-
cantly impact a decoder, it leaves an encoder with many more

BOSSEN et al.: HEVC COMPLEXITY AND IMPLEMENTATION ANALYSIS 1687
choices. This additional freedom is expected to increase the
complexity of encoders that fully leverage the capabilities of
HEVC.
D. Transforms and Quantization
H.264/AVC features 4-point and 8-point transforms that
have a very low implementation cost. This low cost is achieved
by relying on simple sequences of shift and add operations.
This design strategy does not easily extend to larger transform
sizes, such as 16- and 32-point. HEVC thus takes a different
approach and simply defines transforms (of size 4 × 4, 8 × 8,
16 × 16, and 32 × 32) as straightforward fixed-point matrix
multiplications. The matrix multiplications for the vertical and
horizontal component of the inverse transform are shown in
(1) and (2), respectively
Y = s
C
T
· T
(1)
R = Y
T
· T (2)
where s() is a scaling and saturating function that guarantees
that values of Y can be represented using 16 bit. Each factor
in the transform matrix T is represented using signed 8-bit
numbers. Operations are defined such that 16-bit signed coef-
ficients C are multiplied with the factors and, hence, greater
than 16-bit accumulation is required. As the transforms are
integer approximations of a discrete cosine transform, they
retain the symmetry properties thereof, thereby enabling a
partial butterfly implementation. For the 4-point transform, an
alternative transform approximating a discrete sine transform
is also defined.
Although there has been some concern about the implemen-
tation complexity of the 32-point transform, data given in [7]
indicates 158 cycles for an 8 × 8 inverse transform, 861 cycles
fora16× 16 inverse transform, and 4696 cycles for a 32 × 32
inverse transform on an Intel processor. If normalizing these
values by the associated block sizes, respectively, 2.47, 3.36,
and 4.59 cycles are required per sample. The time cost per
sample of a 32 × 32 inverse transform is thus less than twice
that of an 8 × 8 inverse transform. Furthermore, the cycle
count for larger transforms may often be reduced by taking
advantage of the fact that the most high-frequency coefficients
are typically zero. Determining which bounding subblock of
coefficients is nonzero is facilitated by using a 4 × 4 coding
structure for the entropy coding of transform coefficients. The
bounding subblock may thus be determined at a reasonable
granularity (4 × 4) without having to consider the position of
each nonzero coefficient.
It should also be noted that the transform order is changed
with respect to H.264/AVC. HEVC defines a column–row
order for the inverse transform. Due to the regular uniform
structure of the matrix multiplication and partial butterfly
designs, this approach may be preferred in both hardware and
software. In software it is preferable to transform rows, as one
entire row of coefficients may easily be held in registers (a row
of thirty-two 32-bit accumulators requires eight 128-bit regis-
ters, which is implementable on several architectures without
register spilling). This property is not necessarily maintained
with more irregular but fully decomposed transform designs,
which look attractive in terms of primitive operation counts,
but require a greater number of registers and software op-
erations to implement. As can be seen from (1), applying
the transpose to the coefficients C allows implementations to
transforms rows only. Note that the transpose can be integrated
in the inverse scan without adding complexity.
E. Entropy Coding
Unlike the H.264/AVC specification that features CAVLC
and CABAC [8] entropy coders, HEVC defines CABAC as
the single entropy coding method. CABAC incorporates three
stages: binarization of syntax elements, context modeling, and
binary arithmetic coding. While the acronym and the core
arithmetic coding engine remain the same as in H.264/AVC,
there are a number of differences in context modeling and
binarization as described below.
In the development of HEVC, a substantial amount of effort
has been devoted to reduce the number of contexts. While the
version 1.0 of the HM featured in excess of 700 contexts,
version 8.0 has only 154. This number compares favorably to
H.264/AVC, where 299 contexts are used, assuming support
for frame coding in the 4:2:0 color format (progressive high
profile). 237 of these 299 contexts are involved in residual
signal coding whereas HEVC uses 112 of the 154 for this
purpose. When comparing the reduction of 53% in residual
coding with the reduction of 32% for the remaining syntax
elements, it becomes clear that most effort has been put into
reducing the number of contexts associated with the residual
syntax. This reduction in the number of contexts contributes
to lowering the amount of memory required by the entropy
decoder and the cost of initializing the engine. Initialization
values of the states are defined with 8 bit per context, reduced
from 16 in H.264/AVC, thereby further reducing memory
requirements.
One widely used method for determining contexts in
H.264/AVC is to use spatial neighborhood relationships. For
example, using the value above and to the left to derive a con-
text for the current block. In HEVC such spatial dependencies
have been mostly avoided such as to reduce the number of
line buffers.
Substantial effort has also been devoted to enable par-
allel context processing, where a decoder has the abil-
ity to derive multiple context indices in parallel. These
techniques apply mostly to transform coefficient coding,
which becomes the entropy decoding bottleneck at high
bit rates. One example is the modification of the signif-
icance map coding. In H.264/AVC, two interleaved flags
are used to signal whether the current coefficient has a
nonzero value (significant
coeff
flag) and whether it is the
last one in coding order (last
significant
coeff
flag). This
makes it impossible to derive the significant
coeff
flag and
last
significant
coeff
flag contexts in parallel. HEVC breaks
this dependency by explicitly signaling the horizontal and
vertical offset of the last significant coefficient in the current
block before parsing the significant
coeff
flags [9].
The burden of entropy decoding with context modeling
grows with bit rate as more bins need to be processed. There-

1688 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012
Fig. 3. Alignment of 8 × 8 blocks (dashed lines) to which the deblocking
filter can be applied independently. Solid lines represent CTB boundaries.
fore, the bin strings of large syntax elements are divided into
a prefix and a suffix. All prefix bins are coded in regular mode
(i.e., using context modeling), whereas all suffix bins are coded
in a bypass mode. The cost of decoding a bin in bypass mode
is lower than in regular mode. Furthermore, the ratio of bins
to bits is fixed at 1:1 for bypass mode, whereas it is generally
higher for the regular mode. In H.264/AVC, motion vector
differences and transform coefficient levels are binarized using
this method as their values might become quite large. The
boundary between prefix and suffix in H.264/AVC is quite high
for the transform coefficient levels (15 bins). At the highest bit
rates, level coding becomes the bottleneck as it consumes most
of the bits and bins. It is thus desirable to maximize the use
of bypass mode at high bit rates. Consequently, in HEVC, a
new binarization scheme using Golomb-Rice codes reduces the
theoretical worst case number of regular transform coefficient
bins from 15 to 3 [10]. When processing large coefficients, the
boundary between prefix and suffix can be lowered such that
in the worst case a maximum of approximately 1.6 regular
bins need to be processed per coefficient [11]. This average
holds for any block of 16 transform coefficients.
F. Deblocking Filter
The deblocking filter relies on the same principles as in
H.264/AVC and shares many design aspects with it. However,
it differs in ways that have a significant impact on complexity.
While in H.264/AVC each edge on a 4 × 4 grid may be filtered,
HEVC limits the filtering to the edges lying on an 8 × 8
grid. This immediately reduces by half the number of filter
modes that need to be computed and the number of samples
that may be filtered. The order in which edges are processed
is also modified such as to enable parallel processing. A
picture may be segmented into 8 × 8 blocks that can all be
processed in parallel, as only edges internal to these blocks
need to be filtered. The position of these blocks is depicted
in Fig. 3. Some of these blocks overlap CTB boundaries, and
slice boundaries when multiple slices are present. This feature
makes it possible to filter slice boundaries in any order without
affecting the reconstructed picture.
Note that vertical edges are filtered before horizontal edges.
Consequently, modified samples resulting from filtering verti-
cal edges are used in filtering horizontal edges. This allows for
different parallel implementations. In one, all vertical edges
are filtered in parallel, then horizontal edges are filtered in
parallel. Another implementation would enable simultaneous
parallel processing of vertical and horizontal edges, where the
horizontal edge filtering process is delayed in a way such that
the samples to be filtered have already been processed by the
vertical edge filter.
However, there are also aspects of HEVC that increase the
complexity of the filter, such as the addition of clipping in the
strong filter mode.
G. Sample-Adaptive Offset Filter
Compared to H.264/AVC, where only a deblocking filter
is applied in the decoding loop, the current draft HEVC
specification features an additional sample-adaptive offset
(SAO) filter. This filter represents an additional stage, thereby
increasing complexity.
The SAO filter simply adds offset values to certain sample
values and it can be implemented in a fairly straightforward
way, where the offset to be added to each sample may be
obtained by indexing a small lookup table. The index into
the lookup table may be computed according to one of the
two modes being used. For one of the modes, the so-called
band offset, the sample values are quantized to index the
table. So all samples lying in one band of the value range
are using the same offset. Edge offset, as the other mode,
requires more operations since it calculates the index based on
differences between the current and two neighboring samples.
Although the operations are simple, SAO represents an added
burden as it may require either an additional decoding pass,
or an increase in line buffers. The offsets are transmitted in
the bitstream and thus need to be derived by an encoder. If
considering all SAO modes, the search process in the encoder
can be expected to require about an order of magnitude more
computation than the SAO decoding process.
H. High-Level Parallelism
High-level parallelism refers to the ability to simultaneously
process multiple regions of a single picture. Support for such
parallelism may be advantageous to both encoders and de-
coders where multiple identical processing cores may be used
in parallel. HEVC includes three concepts that enable some
degree of high-level parallelism: slices, tiles, and wavefronts.
Slices follow the same concept as in H.264/AVC and allow
a picture to be partitioned into groups of consecutive CTUs
in raster scan order, each for transmission in a separate
network adaptation layer unit that may be parsed and decoded
independently, except for optional interslice filtering. Slices
break prediction dependences at their boundary, which causes
a loss in coding efficiency and can also create visible artifacts
at these borders. The design of slices is more concerned with
error resilience or maximum transmission unit size matching
than a parallel coding technique, although it has undoubtedly
been exploited for this purpose in the past.
Tiles can be used to split a picture horizontally and vertically
into multiple rectangular regions. Like slices, tiles break
prediction dependences at their boundaries. Within a picture,
consecutive tiles are represented in raster scan order. The
scan order of CTBs remains a raster scan, but is limited to
the confines of each tile boundary. When splitting a picture

BOSSEN et al.: HEVC COMPLEXITY AND IMPLEMENTATION ANALYSIS 1689
Fig. 4. Example of wavefront processing. Each CTB row can be processed
in parallel. For processing the striped CTB in each row, the processing of the
shaded CTBs in the row above needs to be finished.
horizontally, tiles may be used to reduce line buffer sizes in an
encoder, as it operates on regions narrower than a full picture.
Tiles also permit the composition of a picture from multiple
rectangular sources that are encoded independently.
Wavefronts split a picture into CTU rows, where each CTU
row may be processed in a different thread. Dependences
between rows are maintained except for the CABAC context
state, which is reinitialized at the beginning of each CTU row.
To improve the compression efficiency, rather than performing
a normal CABAC reinitialization, the context state is inherited
from the second CTU of the previous row, permitting a simple
form of 2-D adaptation. Fig. 4 illustrates this process.
To enable a decoder to exploit parallel processing of tiles
and wavefronts, it must be possible to identify the position in
the bitstream where each tile or slice starts. This overhead is
kept to a minimum by providing a table of offsets, describing
the entry point of each tile or slice. While it may seem
excessive to signal every entry point without the option to
omit some, in the case of tiles, their presence allows decoder
designers to choose between decoding each tile individually
following the per tile raster scan, or decoding CTUs in the
picture raster scan order. As for wavefronts, requiring there to
be as many wavefront entry points as CTU rows resolves the
conflict between the optimal number of wavefronts for differ-
ent encoder and decoder architectures, especially in situations
where the encoder has no knowledge of the decoder.
The current draft HEVC standard does not permit the
simultaneous use of tiles and wavefronts when there is more
than one tile per picture. However, neither tiles nor wavefronts
prohibit the use of slices.
It is interesting to examine the implementation burden of
the tile and wavefront tools in the context of a single-core
architecture and that of a multicore architecture. In the case
of a single-core implementation for tiles, the extra overhead
comes in the form of more complicated boundary condition
checking, performing a CABAC reset for each tile and the need
to perform the optional filtering of tile boundaries. There is
also the potential for improved data-locality and cache access
associated with operating on a subregion of the picture. In a
wavefront implementation, additional storage is required to
save the CABAC context state between CTU rows and to
perform a CABAC reset at the start of each row using this
saved state.
In the case of a multicore implementation, the additional
overhead compared to the single-core case relates to memory-
bandwidth. Since each tile is completely independent, each
processing core may decode any tile with little intercore
communication or synchronization. A complication is the man-
agement of performing in-loop filtering across the tile bound-
aries, which can either be delegated to a postprocess, or with
some loose synchronization and some data exchange, may be
performed on the fly. A multicore wavefront implementation
will require a higher degree of communication between cores
and more frequent synchronization operations than a tile-based
alternative, due to the sharing of reconstructed samples and
mode predictors between CTU rows. The maximum parallel
improvement from a wavefront implementation is limited by
the ramp-up time required for all cores to become fully
utilized and a higher susceptibility to dependency related stalls
between CTB rows.
All high-level parallelization tools become more useful
with image sizes growing beyond HD for both encoder and
decoder. At small image sizes where real-time decoding in a
single-threaded manner is possible, the overhead associated
with parallelization might be too high for there to be any
meaningful benefit. For large image sizes it might be useful to
enforce a minimum number of picture partitions to guarantee
a minimum level of parallelism for the decoder. However, the
current draft HEVC standard does not mandate the use of any
high-level parallelism tools. As such, their use in decoders is
only a benefit to architectures that can opportunistically exploit
them.
I. Miscellaneous
The total amount of memory required for HEVC decoding
can be expected to be similar to that for H.264/AVC decoding.
Most of the memory is required for the decoded picture buffer
that holds multiple pictures. The size of this buffer, as defined
by levels, may be larger in HEVC for a given maximum
picture size. Such an increase in memory requirement is not
a fundamental property of the HEVC design, but comes from
the desire to harmonize the size of the buffer in picture units
across all levels.
HEVC may also require more cache memory due to the
larger block sizes that it supports. In H.264/AVC, macroblocks
of size 16 × 16 define the buffer size required for storing
predictions and residuals. In HEVC, intra picture prediction
and transforms may be of size 32 × 32, and the size of the
associated buffers thus quadruples.
It should also be noted that HEVC lacks coding tools
specific to field coding. The absence of such tools, in particular
tools that enable switching between frame and field coding
within a frame (such as MBAFF in H.264/AVC), considerably
simplifies the design.
J. Summary
While the complexity of some key modules such as trans-
forms, intra picture prediction, and motion compensation is
likely higher in HEVC than in H.264/AVC, complexity was
reduced in others such as entropy coding and deblocking.
Complexity differences in motion compensation, entropy cod-
ing, and in-loop filtering are expected to be the most substan-
tial. The implementation cost of an HEVC decoder is thus

Citations
More filters
Journal ArticleDOI

Overview of the High Efficiency Video Coding (HEVC) Standard

TL;DR: The main goal of the HEVC standardization effort is to enable significantly improved compression performance relative to existing standards-in the range of 50% bit-rate reduction for equal perceptual video quality.
Journal ArticleDOI

Efficient Parallel Framework for HEVC Motion Estimation on Many-Core Processors

TL;DR: This paper analyzes the ME structure in HEVC and proposes a parallel framework to decouple ME for different partitions on many-core processors and achieves more than 30 and 40 times speedup for 1920 × 1080 and 2560 × 1600 video sequences, respectively.
Journal ArticleDOI

A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors

TL;DR: This paper proposes a parallel framework to decide coding unit trees through in-depth understanding of the dependency among different coding units, and achieves averagely more than 11 and 16 times speedup for 1920x1080 and 2560x1600 video sequences, respectively, without any coding efficiency degradation.
Proceedings ArticleDOI

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

TL;DR: This work comprehensively analyzes the energy and performance impact of data movement for several widely-used Google consumer workloads, and finds that processing-in-memory (PIM) can significantly reduceData movement for all of these workloads by performing part of the computation close to memory.
Journal ArticleDOI

Effective CU Size Decision for HEVC Intracoding

TL;DR: A fast CU size decision algorithm for HEVC intracoding is proposed to speed up the process by reducing the number of candidate CU sizes required to be checked for each treeblock, and a novel bypass strategy for intraprediction on large CU size is proposed based on the combination of texture property and coding information from neighboring coded CUs.
References
More filters
Journal ArticleDOI

Overview of the High Efficiency Video Coding (HEVC) Standard

TL;DR: The main goal of the HEVC standardization effort is to enable significantly improved compression performance relative to existing standards-in the range of 50% bit-rate reduction for equal perceptual video quality.
Journal ArticleDOI

Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard

TL;DR: Context-based adaptive binary arithmetic coding (CABAC) as a normative part of the new ITU-T/ISO/IEC standard H.264/AVC for video compression is presented, and significantly outperforms the baseline entropy coding method of H.265.
Related Papers (5)
Frequently Asked Questions (16)
Q1. What is the use of tiles in a decoder?

When splitting a picturehorizontally, tiles may be used to reduce line buffer sizes in an encoder, as it operates on regions narrower than a full picture. 

Determining which bounding subblock of coefficients is nonzero is facilitated by using a 4 × 4 coding structure for the entropy coding of transform coefficients. 

two 16-bit buffers are required to hold data, whereas in H.264/AVC, one 8-bit buffer and one 16-bit buffer are sufficient. 

When processing large coefficients, the boundary between prefix and suffix can be lowered such that in the worst case a maximum of approximately 1.6 regular bins need to be processed per coefficient [11]. 

there are also aspects of HEVC that increase the complexity of the filter, such as the addition of clipping in the strong filter mode. 

A picture may be segmented into 8 × 8 blocks that can all be processed in parallel, as only edges internal to these blocks need to be filtered. 

The design of slices is more concerned with error resilience or maximum transmission unit size matching than a parallel coding technique, although it has undoubtedly been exploited for this purpose in the past. 

From an encoding perspective, the increased number of prediction modes (35 in HEVC versus 9 in H.264/AVC) will require good mode selection heuristics to maintain a reasonable search complexity. 

This reduction in the number of contexts contributes to lowering the amount of memory required by the entropy decoder and the cost of initializing the engine. 

in HEVC, a new binarization scheme using Golomb-Rice codes reduces the theoretical worst case number of regular transform coefficient bins from 15 to 3 [10]. 

While the complexity of some key modules such as transforms, intra picture prediction, and motion compensation is likely higher in HEVC than in H.264/AVC, complexity was reduced in others such as entropy coding and deblocking. 

Due to the regular uniform structure of the matrix multiplication and partial butterfly designs, this approach may be preferred in both hardware and software. 

Such an increase in memory requirement is not a fundamental property of the HEVC design, but comes from the desire to harmonize the size of the buffer in picture units across all levels. 

In the case of the planar mode, consider that the generating equations are probably not adequate to determine complexity, as it is possible to easily incrementally compute predicted sample values. 

The use of a separable 8-tap filter for luma sub-pel positions leads to an increase in memory bandwidth and in the number of multiply-accumulate operations required for motion compensation. 

Although there has been some concern about the implementation complexity of the 32-point transform, data given in [7] indicates 158 cycles for an 8 × 8 inverse transform, 861 cycles for a 16 × 16 inverse transform, and 4696 cycles for a 32 × 32 inverse transform on an Intel processor.