HEVC Complexity and Implementation Analysis
read more
Citations
Overview of the High Efficiency Video Coding (HEVC) Standard
Efficient Parallel Framework for HEVC Motion Estimation on Many-Core Processors
A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors
Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks
Effective CU Size Decision for HEVC Intracoding
References
Overview of the High Efficiency Video Coding (HEVC) Standard
Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard
Related Papers (5)
Frequently Asked Questions (16)
Q2. How is the entropy coding of transform coefficients facilitated?
Determining which bounding subblock of coefficients is nonzero is facilitated by using a 4 × 4 coding structure for the entropy coding of transform coefficients.
Q3. How many buffers are required to hold data?
two 16-bit buffers are required to hold data, whereas in H.264/AVC, one 8-bit buffer and one 16-bit buffer are sufficient.
Q4. How many regular bins are needed to process a coefficient?
When processing large coefficients, the boundary between prefix and suffix can be lowered such that in the worst case a maximum of approximately 1.6 regular bins need to be processed per coefficient [11].
Q5. What are some aspects of HEVC that increase the complexity of the filter?
there are also aspects of HEVC that increase the complexity of the filter, such as the addition of clipping in the strong filter mode.
Q6. How many edges can be processed in parallel?
A picture may be segmented into 8 × 8 blocks that can all be processed in parallel, as only edges internal to these blocks need to be filtered.
Q7. What is the purpose of the design of slices?
The design of slices is more concerned with error resilience or maximum transmission unit size matching than a parallel coding technique, although it has undoubtedly been exploited for this purpose in the past.
Q8. How many modes are required to maintain a reasonable search complexity?
From an encoding perspective, the increased number of prediction modes (35 in HEVC versus 9 in H.264/AVC) will require good mode selection heuristics to maintain a reasonable search complexity.
Q9. What is the effect of reducing the number of contexts in HEVC?
This reduction in the number of contexts contributes to lowering the amount of memory required by the entropy decoder and the cost of initializing the engine.
Q10. How many regular transform coefficient bins are there in HEVC?
in HEVC, a new binarization scheme using Golomb-Rice codes reduces the theoretical worst case number of regular transform coefficient bins from 15 to 3 [10].
Q11. What are the key modules that are likely to be more complex in HEVC than in A?
While the complexity of some key modules such as transforms, intra picture prediction, and motion compensation is likely higher in HEVC than in H.264/AVC, complexity was reduced in others such as entropy coding and deblocking.
Q12. Why is the matrix multiplication approach preferred in software?
Due to the regular uniform structure of the matrix multiplication and partial butterfly designs, this approach may be preferred in both hardware and software.
Q13. Why is a memory increase required for HEVC decoding?
Such an increase in memory requirement is not a fundamental property of the HEVC design, but comes from the desire to harmonize the size of the buffer in picture units across all levels.
Q14. What is the way to determine the complexity of the HEVC planar mode?
In the case of the planar mode, consider that the generating equations are probably not adequate to determine complexity, as it is possible to easily incrementally compute predicted sample values.
Q15. What is the difference between a separable 8-tap filter and a lum?
The use of a separable 8-tap filter for luma sub-pel positions leads to an increase in memory bandwidth and in the number of multiply-accumulate operations required for motion compensation.
Q16. How many cycles for a 32 32 inverse transform?
Although there has been some concern about the implementation complexity of the 32-point transform, data given in [7] indicates 158 cycles for an 8 × 8 inverse transform, 861 cycles for a 16 × 16 inverse transform, and 4696 cycles for a 32 × 32 inverse transform on an Intel processor.