scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

An Area and Power Efficient 1-D $4\times 4$ Integer DCT Architecture for HEVC

01 Dec 2017-
TL;DR: A one dimensional (1-D) integer DCT architecture to be used in the High Efficiency Video Coding (HEVC) standard and serves as a basic building block to construct architectures for different transform lengths like $8\times 8,\ 16\times 16 and $32\times 32$.
Abstract: In this paper, we present a one dimensional (1-D) $4\times 4$ integer DCT architecture to be used in the High Efficiency Video Coding (HEVC) standard. This architecture serves as a basic building block to construct architectures for different transform lengths like $8\times 8,\ 16\times 16$ and $32\times 32$ . Also, 2-D integer DCT architectures can be constructed using the proposed architecture and its scaled versions. The architecture detailed in this paper occupies an area of 1572 square microns and consumes 0.65 mW of power at a maximum operating frequency of 200 MHz. Compared to other such architectures, the proposed design achieves a 58.6% savings in area and a 53.9% savings in power. And compared to the reference algorithm, the proposed design saves 66.2% area and 78.5% power. Moreover, the proposed architecture offers higher throughput at a lower operating frequency when compared to other existing architectures. Therefore, with a processing rate of 8 pixels/cycle and a throughput of 1.6 Gsps, the proposed architecture is capable of processing 8K UHD ( $7680\times 4320$ ) video at 30 frames per second, which is an application of HEVC.
References
More filters
Journal ArticleDOI
TL;DR: It is found that the proposed architecture involves nearly 14% less area-delay product (ADP) and 19% less energy per sample (EPS) compared to the direct implementation of the reference algorithm, on average, for integer DCT of lengths 4, 8, 16, and 32.
Abstract: In this paper, we present area- and power-efficient architectures for the implementation of integer discrete cosine transform (DCT) of different lengths to be used in High Efficiency Video Coding (HEVC). We show that an efficient constant matrix-multiplication scheme can be used to derive parallel architectures for 1-D integer DCT of different lengths. We also show that the proposed structure could be reusable for DCT of lengths 4, 8, 16, and 32 with a throughput of 32 DCT coefficients per cycle irrespective of the transform size. Moreover, the proposed architecture could be pruned to reduce the complexity of implementation substantially with only a marginal affect on the coding performance. We propose power-efficient structures for folded and full-parallel implementations of 2-D DCT. From the synthesis result, it is found that the proposed architecture involves nearly 14% less area-delay product (ADP) and 19% less energy per sample (EPS) compared to the direct implementation of the reference algorithm, on average, for integer DCT of lengths 4, 8, 16, and 32. Also, an additional 19% saving in ADP and 20% saving in EPS can be achieved by the proposed pruning algorithm with nearly the same throughput rate. The proposed architecture is found to support ultrahigh definition 7680 × 4320 at 60 frames/s video, which is one of the applications of HEVC.

184 citations

Proceedings ArticleDOI
09 Jul 2012
TL;DR: This work proposes a fast computational algorithm of large size integer IDCT, which can support the following video standards: MPEG-2/4, H.264, AVS, VC-1 and HEVC.
Abstract: 4 or 8-point IDCT are widely used in traditional video coding standards. However larger size (16/32-point) IDCT has been proposed in the next generation video standard such as HEVC. To fulfill this requirement, this work proposes a fast computational algorithm of large size integer IDCT. A unified VLSI architecture for 4/8/16/32-point integer IDCT is also proposed accordingly. It can support the following video standards: MPEG-2/4, H.264, AVS, VC-1 and HEVC. Multiplier less MCM (Multiple Constant Multiplication) is used for 4/8-point IDCT. The regular multipliers and sharing technique are used for 16/32-point IDCT. The transpose memory uses SRAM instead of the traditional register array in order to further reduce the hardware overhead. It can support real-time decoding of 4Kx2K (4096x2048) 30fps video sequence at 191MHz working frequency, with 93K gate count and 18944-bit SRAM. We suggest a normalized criterion called design efficiency to compare with previous works. It shows that this design is 31% more efficient than previous work.

75 citations

Journal ArticleDOI
TL;DR: A new large inverse transform architecture based on hardware reuse for HEVC (High Efficiency Video Coding) is proposed, which is optimized by exploiting fully recursive and regular butterfly structure to achieve low area.
Abstract: This paper proposes a 16×16 and 32×32 inverse transform architecture for HEVC (High Efficiency Video Coding). HEVC large transform of 16×16 and 32×32 suffers from huge computational complexity. To resolve this problem, we proposed a new large inverse transform architecture based on hardware reuse. The processing element is optimized by exploiting fully recursive and regular butterfly structure. To achieve low area, the processing element is implemented by shifters and adders without multiplier. Implementation of the proposed 2-D inverse transform architecture in 0.18 ㎛ technology shows about 300 ㎒ frequency and 287 Kgates area, which can process 4K (3840×2160)@ 30 fps image.

64 citations

Journal ArticleDOI
TL;DR: The proposed approximation has nearly the same arithmetic complexity and hardware requirement as those of recently proposed related methods, but involves significantly less error energy and offers better peak signal-to-noise ratio than the others when DCTs of length more than 8 are used.
Abstract: An approximate kernel for the discrete cosine transform (DCT) of length 4 is derived from the 4-point DCT defined by the High Efficiency Video Coding (HEVC) standard and used for the computation of DCT and inverse DCT (IDCT) of power-of-two lengths. There are two reasons for considering the DCT of length 4 as the basic module. First, it allows computation of DCTs of lengths 4, 8, 16, and 32 prescribed by the HEVC. Second, the DCTs generated by the 4-point DCT not only involve lower complexity, but also offer better compression performance. Fully parallel and area-constrained architectures for the proposed approximate DCT are proposed to have flexible tradeoff between the area and time complexities. In addition, a reconfigurable architecture is proposed where an 8-point DCT can be used in place of a pair of 4-point DCTs. Using the same reconfiguration scheme, a 32-point DCT could be configured for parallel computation of two 16-point DCTs or four 8-point DCTs or eight 4-point DCTs. The proposed reconfigurable design can support real-time coding for high-definition video sequences in the 8k ultrahigh-definition television format ( $7680\times 4320$ at 30 frames/s). A unified forward and inverse transform architecture is also proposed where the hardware complexity is reduced by sharing hardware between the DCT and IDCT computations. The proposed approximation has nearly the same arithmetic complexity and hardware requirement as those of recently proposed related methods, but involves significantly less error energy and offers better peak signal-to-noise ratio than the others when DCTs of length more than 8 are used. A detailed comparison of the complexity, energy efficiency, and compression performance of different DCT approximation schemes for video coding is also presented. It is shown that the proposed approximation provides a better compressed-image quality than other approximate DCTs. The proposed method can perform HEVC-compliant video coding with marginal degradation of video quality and a slight increase the in bit rate, with a fraction of computational complexity of the latter.

54 citations


"An Area and Power Efficient 1-D $4\..." refers background or methods in this paper

  • ...They used the algorithm Meher et al. proposed to develop the 4x4 DCT module and used this recursively to build higher length DCTs with sizes 8x8, 16x16 and 32x32....

    [...]

  • ...Algorithm Jridi & Meher [9] Proposed Po w er (m W )...

    [...]

  • ...9% when compared with the architecture in [9]....

    [...]

  • ...The proposed architecture has the least area when compared with the reference algorithm and the architecture proposed by Meher et al. [2] as shown in Fig....

    [...]

  • ...3 that the proposed architecture uses considerably less power compared to the reference algorithm and the one proposed by Jridi and Meher [9]....

    [...]

Proceedings ArticleDOI
19 May 2013
TL;DR: A unified architecture for IDCT and DCT through the algorithm optimization is devised and one proposed engine provides the throughput for 8K-UHDTV real-time decoding, and it also fully supports the real- time encoding of HDTV1080p@20fps with 311MHz clock speed1.
Abstract: Great amount of two-dimensional (2D) discrete cosine transforms and Hadamard transforms are executed in HEVC. Upon the end of real-time UHDTV Codec, the full pipeline variable block size 2D transform engine with the efficient hardware utilization is proposed to handle the DCT/IDCT and Hadamard transforms. The efficiency comes from two aspects. First, the hardware for small-size transforms is fully reused by other larger-size transform processing. Second, we devise the unified architecture for IDCT and DCT through the algorithm optimization. The maximum clock speed of our design is 311MHz under 90nm technology. Experiments demonstrate that, at 47MHz clock frequency, one proposed engine provides the throughput for 8K-UHDTV real-time decoding, and it also fully supports the real-time encoding of HDTV1080p@20fps with 311MHz clock speed1.

45 citations


"An Area and Power Efficient 1-D $4\..." refers background or result in this paper

  • ...When compared with the 5stage reference pipeline design, [5] reports a 610% increase in throughput for the 32x32 transform....

    [...]

  • ...It is further shown that at 47 MHz the proposed engine provides the required throughput for 8K UHD video decoding and supports real-time encoding of 1080p at 20 frames per second (fps) with a 311 MHz clock speed [5]....

    [...]