What are some aspects of HEVC that increase the complexity of the filter?

there are also aspects of HEVC that increase the complexity of the filter, such as the addition of clipping in the strong filter mode.

Why is the matrix multiplication approach preferred in software?

Due to the regular uniform structure of the matrix multiplication and partial butterfly designs, this approach may be preferred in both hardware and software.

(Open Access) HEVC Complexity and Implementation Analysis (2012) | Frank Bossen

Q: What is the use of tiles in a decoder?

When splitting a picturehorizontally, tiles may be used to reduce line buffer sizes in an encoder, as it operates on regions narrower than a full picture.

Q: How is the entropy coding of transform coefficients facilitated?

Determining which bounding subblock of coefficients is nonzero is facilitated by using a 4 × 4 coding structure for the entropy coding of transform coefficients.

Q: How many regular bins are needed to process a coefficient?

When processing large coefficients, the boundary between prefix and suffix can be lowered such that in the worst case a maximum of approximately 1.6 regular bins need to be processed per coefficient [11].

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012 1685

HEVC Complexity and Implementation Analysis

Frank Bossen, Member, IEEE, Benjamin Bross, Student Member, IEEE, Karsten S

uhring, and David Flynn

Abstract—Advances in video compression technology have

been driven by ever-increasing processing power available in

software and hardware. The emerging High Efﬁciency Video

Coding (HEVC) standard aims to provide a doubling in coding

efﬁciency with respect to the H.264/AVC high proﬁle, delivering

the same video quality at half the bit rate. In this paper,

complexity-related aspects that were considered in the standard-

ization process are described. Furthermore, proﬁling of reference

software and optimized software gives an indication of where

HEVC may be more complex than its predecessors and where it

may be simpler. Overall, the complexity of HEVC decoders does

not appear to be signiﬁcantly different from that of H.264/AVC

decoders; this makes HEVC decoding in software very practical

on current hardware. HEVC encoders are expected to be several

times more complex than H.264/AVC encoders and will be a

subject of research in years to come.

Index Terms—High Efﬁciency Video Coding (HEVC), video

coding.

I. Introduction

HIS PAPER gives an overview of complexity and im-

plementation issues in the context of the emerging High

Efﬁciency Video Coding (HEVC) standard. The HEVC project

is conducted by the Joint Collaborative Team on Video Coding

(JCT-VC), and is a joint effort between ITU-T and ISO/IEC.

Reference software, called the HEVC test model (HM), is

being developed along with the draft standard. At the time of

writing, the current version of HM is 8.0, which corresponds

to the HEVC text speciﬁcation draft 8 [1]. It is assumed that

the reader has some familiarity with the draft HEVC standard,

an overview of which can be found in [2].

Complexity assessment is a complex topic in itself, and one

aim of this paper is to highlight some aspects of the HEVC

design where some notion of complexity was considered. This

is the topic of Section II.

A second aim of this paper is to provide and discuss data

resulting from proﬁling existing software implementations of

HEVC. Sections III and IV present results obtained with

Manuscript received May 26, 2012; revised August 19, 2012; accepted

August 24, 2012. Date of publication October 2, 2012; date of current

version January 8, 2013. This paper was recommended by Associate Editor

H. Gharavi.

F. Bossen is with DOCOMO Innovations, Inc., Palo Alto, CA 94304 USA

(e-mail: bossen@docomoinnovations.com).

B. Bross and K. S

uhring are with the Image and Video Coding

Group, Fraunhofer Institute for Telecommunications–Heinrich Hertz In-

stitute, Berlin 10587, Germany (e-mail: benjamin.bross@hhi.fraunhofer.de;

karsten.suehring@hhi.fraunhofer.de).

D. Flynn is with Research In Motion, Ltd., Waterloo, ON N2L 3W8, Canada

(e-mail: dﬂynn@iee.org).

Color versions of one or more of the ﬁgures in this paper are available

online at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TCSVT.2012.2221255

the HM encoder and decoder, and Section V discusses an

optimized implementation of a decoder.

II. Design Aspects

A. Quadtree-Based Block Partitioning

HEVC retains the basic hybrid coding architecture of prior

video coding standards, such as H.264/AVC [3]. A signiﬁcant

difference lies in the use of a more adaptive quadtree structure

based on a coding tree unit (CTU) instead of a macroblock.

In principle, the quadtree coding structure is described by

means of blocks and units. A block deﬁnes an array of

samples and sizes thereof, whereas a unit encapsulates one

luma and corresponding chroma blocks together with syntax

needed to code these. Consequently, a CTU includes coding

tree blocks (CTB) and syntax specifying coding data and

further subdivision. This subdivision results in coding unit

(CU) leaves with coding blocks (CB). Each CU incorporates

more entities for the purpose of prediction, so-called prediction

units (PU), and of transform, so-called transform units (TU).

Similarly, each CB is split into prediction blocks (PB) and

transform blocks (TB). This variable-size, adaptive approach

is particularly suited to larger resolutions, such as 4k × 2k,

which is a target resolution for some HEVC applications. An

exemplary CB and TB quadtree structure is given in Fig. 1. All

partitioning modes specifying how to split a CB into PBs are

depicted in Fig. 2. The decoding of the quadtree structures

is not much of an additional burden because the quadtrees

can be easily traversed in a depth-ﬁrst fashion using in a z-

scan order. Partitioning modes for inter picture coded CUs

feature nonsquare PUs. Support for these nonsquare shapes

requires additional logic in a decoder as multiple conversions

between z-scan and raster scan orders may be required. At the

encoder side, simple tree-pruning algorithms exist to estimate

the optimal partitioning in a rate-distortion sense [4], [5].

Sections below describe various tools of HEVC and review

complexity aspects that were considered in the development

of the HEVC speciﬁcation, using H.264/AVC as a reference

where appropriate.

B. Intra Picture Prediction

Intra picture prediction in HEVC is quite similar to

H.264/AVC. Samples are predicted from reconstructed sam-

ples of neighboring blocks. The mode categories remain iden-

tical: DC, plane, horizontal/vertical, and directional; although

the nomenclature has somewhat changed with planar and

angular, respectively, corresponding to H.264/AVC’s plane and

directional modes. A signiﬁcant change comes from the intro-

duction of larger block sizes, where intra picture prediction

1051-8215/$31.00

 2012 IEEE

1686 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

Fig. 1. Detail of 4k × 2k Trafﬁc sequence showing the coding block (white)

and nested transform block (red) structure resulting from recursive quadtree

partitioning.

Fig. 2. All prediction block partitioning modes. Inter picture coded CUs can

apply all modes, while intra picture coded CUs can apply only the ﬁrst two.

using one of 35 modes may be performed for blocks of size

up to 32 × 32 samples. The smallest block size is unchanged

from H.264/AVC at 4 × 4 and remains a complexity bottleneck

because of the serial nature of intra picture prediction.

For the DC, horizontal, and vertical modes an additional

postprocess is deﬁned in HEVC wherein a row and/or column

is ﬁltered such as to maintain continuity across block bound-

aries. This addition is not expected to have an impact on the

worst case complexity since these three modes are the simplest

to begin with.

In the case of the planar mode, consider that the generating

equations are probably not adequate to determine complexity,

as it is possible to easily incrementally compute predicted sam-

ple values. For the H.264/AVC plane mode it is expected that

one 16-bit addition, one 16-bit shift, and one clip to the 8-bit

range are required per sample. For the HEVC planar mode,

this becomes three 16-bit additions and one 16-bit shift. These

two modes are thus expected to have similar complexities.

The angular modes of HEVC are more complex than the

directional H.264/AVC modes as multiplication is required.

Each predicted sample is computed as ((32 − w) · x

+ w ·

i+1

+ 16) >> 5, where x

are reference samples and w is a

weighting factor. The weighting factor remains constant across

a predicted row or column that facilitates single-instruction

multiple-data (SIMD) implementations. A single function may

be used to cover all 33 prediction angles, thereby reducing the

amount of code needed to implement this feature.

As in H.264/AVC, reference samples may be smoothed prior

to prediction. The smoothing process is the same although it is

applied more selectively, depending upon the prediction mode.

From an encoding perspective, the increased number of

prediction modes (35 in HEVC versus 9 in H.264/AVC)

will require good mode selection heuristics to maintain a

reasonable search complexity.

C. Inter Picture Prediction

Inter picture prediction, or motion compensation, is concep-

tually very simple in HEVC, but comes with some overhead

compared to H.264/AVC. The use of a separable 8-tap ﬁlter

for luma sub-pel positions leads to an increase in memory

bandwidth and in the number of multiply-accumulate opera-

tions required for motion compensation. Filter coefﬁcients are

limited to the 7-bit signed range to minimize hardware cost.

In software, motion compensation of an N × N block consists

of8+56/N 8-bit multiply-accumulate operations per sample

and eight 16-bit multiply-accumulate operations per sample.

For chroma sub-pel positions, a separable 4-tap ﬁlter with the

same limitations as for the luma ﬁlter coefﬁcients is applied.

This also increases the memory bandwidth and the number

of operations compared to H.264/AVC where bilinear inter

polation is used for chroma sub-pel positions.

Another area where the implementation cost is increased is

the intermediate storage buffers, particularly in the bipredictive

case. Indeed, two 16-bit buffers are required to hold data,

whereas in H.264/AVC, one 8-bit buffer and one 16-bit buffer

are sufﬁcient. In an HEVC implementation these buffers do

not necessarily need to be increased to reﬂect the maximum

PB size of 64 × 64. Motion compensation of larger blocks

may be decomposed into, and processed in, smaller blocks to

achieve a desired trade-off between memory requirements and

a number of operations.

H.264/AVC deﬁnes restrictions on motion data that are

aimed at reducing memory bandwidth. For example, the num-

ber of motion vectors used in two consecutive macroblocks

is limited. HEVC adopts a different approach and deﬁnes

restrictions that are much simpler for an encoder to conform

to: the smallest motion compensation blocks are of luma

size 4× 8 and 8 × 4, thereby prohibiting 4 × 4 inter picture

prediction, and are constrained to make use of only the ﬁrst

reference picture list (i.e., no biprediction for 4 × 8 and 8 × 4

luma blocks).

HEVC introduces a so-called merge mode, which sets all

motion parameters of an inter picture predicted block equal

to the parameters of a merge candidate [6]. The merge mode

and the motion vector prediction process optionally allow a

picture to reuse motion vectors of prior pictures for motion

vector coding, in essence similar to the H.264/AVC temporal

direct mode. While H.264/AVC downsamples motion vectors

to the 8 × 8 level, HEVC further reduces memory requirements

by keeping a single motion vector per 16 × 16 block.

HEVC offers more ways to split a picture into motion-

compensated partition patterns. While this does not signiﬁ-

cantly impact a decoder, it leaves an encoder with many more

BOSSEN et al.: HEVC COMPLEXITY AND IMPLEMENTATION ANALYSIS 1687

choices. This additional freedom is expected to increase the

complexity of encoders that fully leverage the capabilities of

HEVC.

D. Transforms and Quantization

H.264/AVC features 4-point and 8-point transforms that

have a very low implementation cost. This low cost is achieved

by relying on simple sequences of shift and add operations.

This design strategy does not easily extend to larger transform

sizes, such as 16- and 32-point. HEVC thus takes a different

approach and simply deﬁnes transforms (of size 4 × 4, 8 × 8,

16 × 16, and 32 × 32) as straightforward ﬁxed-point matrix

multiplications. The matrix multiplications for the vertical and

horizontal component of the inverse transform are shown in

(1) and (2), respectively

Y = s



· T



(1)

R = Y

· T (2)

where s() is a scaling and saturating function that guarantees

that values of Y can be represented using 16 bit. Each factor

in the transform matrix T is represented using signed 8-bit

numbers. Operations are deﬁned such that 16-bit signed coef-

ﬁcients C are multiplied with the factors and, hence, greater

than 16-bit accumulation is required. As the transforms are

integer approximations of a discrete cosine transform, they

retain the symmetry properties thereof, thereby enabling a

partial butterﬂy implementation. For the 4-point transform, an

alternative transform approximating a discrete sine transform

is also deﬁned.

Although there has been some concern about the implemen-

tation complexity of the 32-point transform, data given in [7]

indicates 158 cycles for an 8 × 8 inverse transform, 861 cycles

fora16× 16 inverse transform, and 4696 cycles for a 32 × 32

inverse transform on an Intel processor. If normalizing these

values by the associated block sizes, respectively, 2.47, 3.36,

and 4.59 cycles are required per sample. The time cost per

sample of a 32 × 32 inverse transform is thus less than twice

that of an 8 × 8 inverse transform. Furthermore, the cycle

count for larger transforms may often be reduced by taking

advantage of the fact that the most high-frequency coefﬁcients

are typically zero. Determining which bounding subblock of

coefﬁcients is nonzero is facilitated by using a 4 × 4 coding

structure for the entropy coding of transform coefﬁcients. The

bounding subblock may thus be determined at a reasonable

granularity (4 × 4) without having to consider the position of

each nonzero coefﬁcient.

It should also be noted that the transform order is changed

with respect to H.264/AVC. HEVC deﬁnes a column–row

order for the inverse transform. Due to the regular uniform

structure of the matrix multiplication and partial butterﬂy

designs, this approach may be preferred in both hardware and

software. In software it is preferable to transform rows, as one

entire row of coefﬁcients may easily be held in registers (a row

of thirty-two 32-bit accumulators requires eight 128-bit regis-

ters, which is implementable on several architectures without

with more irregular but fully decomposed transform designs,

which look attractive in terms of primitive operation counts,

but require a greater number of registers and software op-

erations to implement. As can be seen from (1), applying

the transpose to the coefﬁcients C allows implementations to

transforms rows only. Note that the transpose can be integrated

in the inverse scan without adding complexity.

E. Entropy Coding

Unlike the H.264/AVC speciﬁcation that features CAVLC

and CABAC [8] entropy coders, HEVC deﬁnes CABAC as

the single entropy coding method. CABAC incorporates three

stages: binarization of syntax elements, context modeling, and

binary arithmetic coding. While the acronym and the core

arithmetic coding engine remain the same as in H.264/AVC,

there are a number of differences in context modeling and

binarization as described below.

In the development of HEVC, a substantial amount of effort

has been devoted to reduce the number of contexts. While the

version 1.0 of the HM featured in excess of 700 contexts,

version 8.0 has only 154. This number compares favorably to

H.264/AVC, where 299 contexts are used, assuming support

for frame coding in the 4:2:0 color format (progressive high

proﬁle). 237 of these 299 contexts are involved in residual

signal coding whereas HEVC uses 112 of the 154 for this

purpose. When comparing the reduction of 53% in residual

coding with the reduction of 32% for the remaining syntax

elements, it becomes clear that most effort has been put into

reducing the number of contexts associated with the residual

syntax. This reduction in the number of contexts contributes

to lowering the amount of memory required by the entropy

decoder and the cost of initializing the engine. Initialization

values of the states are deﬁned with 8 bit per context, reduced

from 16 in H.264/AVC, thereby further reducing memory

requirements.

One widely used method for determining contexts in

H.264/AVC is to use spatial neighborhood relationships. For

example, using the value above and to the left to derive a con-

text for the current block. In HEVC such spatial dependencies

have been mostly avoided such as to reduce the number of

line buffers.

Substantial effort has also been devoted to enable par-

allel context processing, where a decoder has the abil-

ity to derive multiple context indices in parallel. These

techniques apply mostly to transform coefﬁcient coding,

which becomes the entropy decoding bottleneck at high

bit rates. One example is the modiﬁcation of the signif-

icance map coding. In H.264/AVC, two interleaved ﬂags

are used to signal whether the current coefﬁcient has a

nonzero value (signiﬁcant

−

coeff

−

ﬂag) and whether it is the

last one in coding order (last

−

signiﬁcant

−

coeff

−

ﬂag). This

makes it impossible to derive the signiﬁcant

−

coeff

−

ﬂag and

last

−

signiﬁcant

−

coeff

−

ﬂag contexts in parallel. HEVC breaks

this dependency by explicitly signaling the horizontal and

vertical offset of the last signiﬁcant coefﬁcient in the current

block before parsing the signiﬁcant

−

coeff

−

ﬂags [9].

The burden of entropy decoding with context modeling

grows with bit rate as more bins need to be processed. There-

1688 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

Fig. 3. Alignment of 8 × 8 blocks (dashed lines) to which the deblocking

ﬁlter can be applied independently. Solid lines represent CTB boundaries.

fore, the bin strings of large syntax elements are divided into

a preﬁx and a sufﬁx. All preﬁx bins are coded in regular mode

(i.e., using context modeling), whereas all sufﬁx bins are coded

in a bypass mode. The cost of decoding a bin in bypass mode

is lower than in regular mode. Furthermore, the ratio of bins

to bits is ﬁxed at 1:1 for bypass mode, whereas it is generally

higher for the regular mode. In H.264/AVC, motion vector

differences and transform coefﬁcient levels are binarized using

this method as their values might become quite large. The

boundary between preﬁx and sufﬁx in H.264/AVC is quite high

for the transform coefﬁcient levels (15 bins). At the highest bit

rates, level coding becomes the bottleneck as it consumes most

of the bits and bins. It is thus desirable to maximize the use

of bypass mode at high bit rates. Consequently, in HEVC, a

new binarization scheme using Golomb-Rice codes reduces the

theoretical worst case number of regular transform coefﬁcient

bins from 15 to 3 [10]. When processing large coefﬁcients, the

boundary between preﬁx and sufﬁx can be lowered such that

in the worst case a maximum of approximately 1.6 regular

bins need to be processed per coefﬁcient [11]. This average

holds for any block of 16 transform coefﬁcients.

F. Deblocking Filter

The deblocking ﬁlter relies on the same principles as in

H.264/AVC and shares many design aspects with it. However,

it differs in ways that have a signiﬁcant impact on complexity.

While in H.264/AVC each edge on a 4 × 4 grid may be ﬁltered,

HEVC limits the ﬁltering to the edges lying on an 8 × 8

grid. This immediately reduces by half the number of ﬁlter

modes that need to be computed and the number of samples

that may be ﬁltered. The order in which edges are processed

is also modiﬁed such as to enable parallel processing. A

picture may be segmented into 8 × 8 blocks that can all be

processed in parallel, as only edges internal to these blocks

need to be ﬁltered. The position of these blocks is depicted

in Fig. 3. Some of these blocks overlap CTB boundaries, and

slice boundaries when multiple slices are present. This feature

makes it possible to ﬁlter slice boundaries in any order without

affecting the reconstructed picture.

Note that vertical edges are ﬁltered before horizontal edges.

Consequently, modiﬁed samples resulting from ﬁltering verti-

cal edges are used in ﬁltering horizontal edges. This allows for

different parallel implementations. In one, all vertical edges

are ﬁltered in parallel, then horizontal edges are ﬁltered in

parallel. Another implementation would enable simultaneous

parallel processing of vertical and horizontal edges, where the

horizontal edge ﬁltering process is delayed in a way such that

the samples to be ﬁltered have already been processed by the

vertical edge ﬁlter.

However, there are also aspects of HEVC that increase the

complexity of the ﬁlter, such as the addition of clipping in the

strong ﬁlter mode.

G. Sample-Adaptive Offset Filter

Compared to H.264/AVC, where only a deblocking ﬁlter

is applied in the decoding loop, the current draft HEVC

speciﬁcation features an additional sample-adaptive offset

(SAO) ﬁlter. This ﬁlter represents an additional stage, thereby

increasing complexity.

The SAO ﬁlter simply adds offset values to certain sample

values and it can be implemented in a fairly straightforward

way, where the offset to be added to each sample may be

obtained by indexing a small lookup table. The index into

the lookup table may be computed according to one of the

two modes being used. For one of the modes, the so-called

band offset, the sample values are quantized to index the

table. So all samples lying in one band of the value range

are using the same offset. Edge offset, as the other mode,

requires more operations since it calculates the index based on

differences between the current and two neighboring samples.

Although the operations are simple, SAO represents an added

burden as it may require either an additional decoding pass,

or an increase in line buffers. The offsets are transmitted in

the bitstream and thus need to be derived by an encoder. If

considering all SAO modes, the search process in the encoder

can be expected to require about an order of magnitude more

computation than the SAO decoding process.

H. High-Level Parallelism

High-level parallelism refers to the ability to simultaneously

process multiple regions of a single picture. Support for such

parallelism may be advantageous to both encoders and de-

coders where multiple identical processing cores may be used

in parallel. HEVC includes three concepts that enable some

degree of high-level parallelism: slices, tiles, and wavefronts.

Slices follow the same concept as in H.264/AVC and allow

a picture to be partitioned into groups of consecutive CTUs

in raster scan order, each for transmission in a separate

network adaptation layer unit that may be parsed and decoded

independently, except for optional interslice ﬁltering. Slices

break prediction dependences at their boundary, which causes

a loss in coding efﬁciency and can also create visible artifacts

at these borders. The design of slices is more concerned with

error resilience or maximum transmission unit size matching

than a parallel coding technique, although it has undoubtedly

been exploited for this purpose in the past.

Tiles can be used to split a picture horizontally and vertically

into multiple rectangular regions. Like slices, tiles break

prediction dependences at their boundaries. Within a picture,

consecutive tiles are represented in raster scan order. The

scan order of CTBs remains a raster scan, but is limited to

the conﬁnes of each tile boundary. When splitting a picture

BOSSEN et al.: HEVC COMPLEXITY AND IMPLEMENTATION ANALYSIS 1689

Fig. 4. Example of wavefront processing. Each CTB row can be processed

in parallel. For processing the striped CTB in each row, the processing of the

shaded CTBs in the row above needs to be ﬁnished.

horizontally, tiles may be used to reduce line buffer sizes in an

encoder, as it operates on regions narrower than a full picture.

Tiles also permit the composition of a picture from multiple

rectangular sources that are encoded independently.

Wavefronts split a picture into CTU rows, where each CTU

row may be processed in a different thread. Dependences

between rows are maintained except for the CABAC context

state, which is reinitialized at the beginning of each CTU row.

To improve the compression efﬁciency, rather than performing

a normal CABAC reinitialization, the context state is inherited

from the second CTU of the previous row, permitting a simple

form of 2-D adaptation. Fig. 4 illustrates this process.

To enable a decoder to exploit parallel processing of tiles

and wavefronts, it must be possible to identify the position in

the bitstream where each tile or slice starts. This overhead is

kept to a minimum by providing a table of offsets, describing

the entry point of each tile or slice. While it may seem

excessive to signal every entry point without the option to

omit some, in the case of tiles, their presence allows decoder

designers to choose between decoding each tile individually

following the per tile raster scan, or decoding CTUs in the

picture raster scan order. As for wavefronts, requiring there to

be as many wavefront entry points as CTU rows resolves the

conﬂict between the optimal number of wavefronts for differ-

ent encoder and decoder architectures, especially in situations

where the encoder has no knowledge of the decoder.

The current draft HEVC standard does not permit the

simultaneous use of tiles and wavefronts when there is more

than one tile per picture. However, neither tiles nor wavefronts

prohibit the use of slices.

It is interesting to examine the implementation burden of

the tile and wavefront tools in the context of a single-core

architecture and that of a multicore architecture. In the case

of a single-core implementation for tiles, the extra overhead

comes in the form of more complicated boundary condition

checking, performing a CABAC reset for each tile and the need

to perform the optional ﬁltering of tile boundaries. There is

also the potential for improved data-locality and cache access

associated with operating on a subregion of the picture. In a

wavefront implementation, additional storage is required to

save the CABAC context state between CTU rows and to

perform a CABAC reset at the start of each row using this

saved state.

In the case of a multicore implementation, the additional

overhead compared to the single-core case relates to memory-

bandwidth. Since each tile is completely independent, each

processing core may decode any tile with little intercore

communication or synchronization. A complication is the man-

agement of performing in-loop ﬁltering across the tile bound-

aries, which can either be delegated to a postprocess, or with

some loose synchronization and some data exchange, may be

performed on the ﬂy. A multicore wavefront implementation

will require a higher degree of communication between cores

and more frequent synchronization operations than a tile-based

alternative, due to the sharing of reconstructed samples and

mode predictors between CTU rows. The maximum parallel

improvement from a wavefront implementation is limited by

the ramp-up time required for all cores to become fully

utilized and a higher susceptibility to dependency related stalls

between CTB rows.

All high-level parallelization tools become more useful

with image sizes growing beyond HD for both encoder and

decoder. At small image sizes where real-time decoding in a

single-threaded manner is possible, the overhead associated

with parallelization might be too high for there to be any

meaningful beneﬁt. For large image sizes it might be useful to

enforce a minimum number of picture partitions to guarantee

a minimum level of parallelism for the decoder. However, the

current draft HEVC standard does not mandate the use of any

high-level parallelism tools. As such, their use in decoders is

only a beneﬁt to architectures that can opportunistically exploit

them.

I. Miscellaneous

The total amount of memory required for HEVC decoding

can be expected to be similar to that for H.264/AVC decoding.

Most of the memory is required for the decoded picture buffer

that holds multiple pictures. The size of this buffer, as deﬁned

by levels, may be larger in HEVC for a given maximum

picture size. Such an increase in memory requirement is not

a fundamental property of the HEVC design, but comes from

the desire to harmonize the size of the buffer in picture units

across all levels.

HEVC may also require more cache memory due to the

larger block sizes that it supports. In H.264/AVC, macroblocks

of size 16 × 16 deﬁne the buffer size required for storing

predictions and residuals. In HEVC, intra picture prediction

and transforms may be of size 32 × 32, and the size of the

associated buffers thus quadruples.

It should also be noted that HEVC lacks coding tools

speciﬁc to ﬁeld coding. The absence of such tools, in particular

tools that enable switching between frame and ﬁeld coding

within a frame (such as MBAFF in H.264/AVC), considerably

simpliﬁes the design.

J. Summary

While the complexity of some key modules such as trans-

forms, intra picture prediction, and motion compensation is

likely higher in HEVC than in H.264/AVC, complexity was

reduced in others such as entropy coding and deblocking.

Complexity differences in motion compensation, entropy cod-

ing, and in-loop ﬁltering are expected to be the most substan-

tial. The implementation cost of an HEVC decoder is thus

HEVC Complexity and Implementation Analysis

Figures

Citations

Overview of the High Efficiency Video Coding (HEVC) Standard

Efficient Parallel Framework for HEVC Motion Estimation on Many-Core Processors

A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

Effective CU Size Decision for HEVC Intracoding

References

Overview of the High Efficiency Video Coding (HEVC) Standard

Calculation of average PSNR differences between RD-curves

Advanced video coding for generic audiovisual services

Common test conditions and software reference configurations

Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard

Related Papers (5)

Overview of the High Efficiency Video Coding (HEVC) Standard

Comparison of the Coding Efficiency of Video Coding Standards—Including High Efficiency Video Coding (HEVC)

Overview of the H.264/AVC video coding standard

Common test conditions and software reference configurations

Calculation of average PSNR differences between RD-curves

Frequently Asked Questions (16)

Q1. What is the use of tiles in a decoder?

Q2. How is the entropy coding of transform coefficients facilitated?

Q3. How many buffers are required to hold data?

Q4. How many regular bins are needed to process a coefficient?

Q5. What are some aspects of HEVC that increase the complexity of the filter?

Q6. How many edges can be processed in parallel?

Q7. What is the purpose of the design of slices?

Q8. How many modes are required to maintain a reasonable search complexity?

Q9. What is the effect of reducing the number of contexts in HEVC?

Q10. How many regular transform coefficient bins are there in HEVC?

Q11. What are the key modules that are likely to be more complex in HEVC than in A?

Q12. Why is the matrix multiplication approach preferred in software?

Q13. Why is a memory increase required for HEVC decoding?

Q14. What is the way to determine the complexity of the HEVC planar mode?

Q15. What is the difference between a separable 8-tap filter and a lum?

Q16. How many cycles for a 32 32 inverse transform?