scispace - formally typeset
Search or ask a question
Author

Tung-Chien Chen

Bio: Tung-Chien Chen is an academic researcher from National Taiwan University. The author has contributed to research in topics: Encoder & Motion estimation. The author has an hindex of 22, co-authored 62 publications receiving 2323 citations.


Papers
More filters
Journal Article•DOI•
TL;DR: This paper proposed two solutions for platform-based design of H.264/AVC intra frame coder with comprehensive analysis of instructions and exploration of parallelism, and proposed a system architecture with four-parallel intra prediction and mode decision to enhance the processing capability.
Abstract: Intra prediction with rate-distortion constrained mode decision is the most important technology in H.264/AVC intra frame coder, which is competitive with the latest image coding standard JPEG2000, in terms of both coding performance and computational complexity. The predictor generation engine for intra prediction and the transform engine for mode decision are critical because the operations require a lot of memory access and occupy 80% of the computation time of the entire intra compression process. A low cost general purpose processor cannot process these operations in real time. In this paper, we proposed two solutions for platform-based design of H.264/AVC intra frame coder. One solution is a software implementation targeted at low-end applications. Context-based decimation of unlikely candidates, subsampling of matching operations, bit-width truncation to reduce the computations, and interleaved full-search/partial-search strategy to stop the error propagation and to maintain the image quality, are proposed and combined as our fast algorithm. Experimental results show that our method can reduce 60% of the computation used for intra prediction and mode decision while keeping the peak signal-to-noise ratio degradation less than 0.3 dB. The other solution is a hardware accelerator targeted at high-end applications. After comprehensive analysis of instructions and exploration of parallelism, we proposed our system architecture with four-parallel intra prediction and mode decision to enhance the processing capability. Hadamard-based mode decision is modified as discrete cosine transform-based version to reduce 40% of memory access. Two-stage macroblock pipelining is also proposed to double the processing speed and hardware utilization. The other features of our design are reconfigurable predictor generator supporting all of the 13 intra prediction modes, parallel multitransform and inverse transform engine, and CAVLC bitstream engine. A prototype chip is fabricated with TSMC 0.25-/spl mu/m CMOS 1P5M technology. Simulation results show that our implementation can process 16 mega-pixels (4096/spl times/4096) within 1 s, or namely 720/spl times/480 4:2:0 30 Hz video in real time, at the operating frequency of 54 MHz. The transistor count is 429 K, and the core size is only 1.855/spl times/1.885 mm/sup 2/.

331 citations

Journal Article•DOI•
TL;DR: The four-stage macroblock pipelined system architecture is proposed with an efficient scheduling and memory hierarchy, and the prototype chip of the efficient H.264/AVC video encoder for HDTV applications is implemented.
Abstract: H.264/AVC significantly outperforms previous video coding standards with many new coding tools. However, the better performance comes at the price of the extraordinarily huge computational complexity and memory access requirement, which makes it difficult to design a hardwired encoder for real-time applications. In addition, due to the complex, sequential, and highly data-dependent characteristics of the essential algorithms in H.264/AVC, both the pipelining and the parallel processing techniques are constrained to be employed. The hardware utilization and throughput are also decreased because of the block/MB/frame-level reconstruction loops. In this paper, we describe our techniques to design the H.264/AVC video encoder for HDTV applications. On the system design level, in consideration of the characteristics of the key components and the reconstruction loops, the four-stage macroblock pipelined system architecture is first proposed with an efficient scheduling and memory hierarchy. On the module design level, the design considerations of the significant modules are addressed followed by the hardware architectures, including low-bandwidth integer motion estimation, parallel fractional motion estimation, reconfigurable intrapredictor generator, dual-buffer block-pipelined entropy coder, and deblocking filter. With these techniques, the prototype chip of the efficient H.264/AVC encoder is implemented with 922.8 K logic gates and 34.72-KB SRAM at 108-MHz operation frequency.

295 citations

Journal Article•DOI•
TL;DR: Two hardware architectures are proposed that can support traditional fixed block-size motion estimation as well as VBSME with less chip area overhead compared to previous approaches and an eight-parallel SAD tree with a shared reference buffer for H.264/AVC integer motion estimation is proposed.
Abstract: Variable block-size motion estimation (VBSME) has become an important video coding technique, but it increases the difficulty of hardware design. In this paper, we use inter-/intra-level classification and various data flows to analyze the impact of supporting VBSME in different hardware architectures. Furthermore, we propose two hardware architectures that can support traditional fixed block-size motion estimation as well as VBSME with less chip area overhead compared to previous approaches. By broadcasting reference pixel rows and propagating partial sums of absolute differences (SADs), the first design has the fewer reference pixel registers and a shorter critical path. The second design utilizes a two-dimensional distortion array and one adder tree with the reference buffer that can maximize the data reuse between successive searching candidates. The first design is suitable for low resolution or a small search range, and the second design has advantages of supporting a high degree of parallelism and VBSME. Finally, we propose an eight-parallel SAD tree with a shared reference buffer for H.264/AVC integer motion estimation (IME). Its processing ability is eight times of the single SAD tree, but the reference buffer size is only doubled. Moreover, the most critical issue of H.264 IME, which is huge memory bandwidth, is overcome. We are able to save 99.9% off-chip memory bandwidth and 99.22% on-chip memory bandwidth. We demonstrate a 720-p, 30-fps solution at 108 MHz with 330.2k gate count and 208k bits on-chip memory

269 citations

Proceedings Article•DOI•
29 Aug 2005
TL;DR: An H.264/AVC encoder is implemented on a 31.72mm/sup 2/ die with 0.18/spl mu/m CMOS technology and the encoded video quality is competitive with reference software requiring 3.6TOPS on a general-purpose processor-based platform.
Abstract: An H.264/AVC encoder is implemented on a 31.72mm/sup 2/ die with 0.18/spl mu/m CMOS technology. A four-stage macroblock pipelined architecture encodes 720p 30f/s HDTV videos in real time at 108MHz. The encoded video quality is competitive with reference software requiring 3.6TOPS on a general-purpose processor-based platform.

142 citations

Proceedings Article•DOI•
17 May 2004
TL;DR: A new VLSI architecture for fractional motion estimation of the H.264/AVC video compression standard is contributed, characterized by a reusable feature, that can support situations in different specifications, multiple standards, fast algorithms and some cost considerations.
Abstract: We contributed a new VLSI architecture for fractional motion estimation of the H.264/AVC video compression standard. Seven inter-related loops extracted from the complex procedure are analyzed and two decomposing techniques are proposed to parallelize the algorithm for hardware with a regular schedule and full utilization. The proposed architecture, also characterized by a reusable feature, can support situations in different specifications, multiple standards, fast algorithms and some cost considerations. H.264/AVC baseline profile level 3 with complete Lagrangian mode decision can be realized with 290K gates at operating frequency of 100 MHz. It is a useful intellectual property (IP) design for platform based multimedia systems.

128 citations


Cited by
More filters
Proceedings Article•DOI•
19 Jun 2010
TL;DR: The sources of these performance and energy overheads in general-purpose processing systems are explored by quantifying the overheads of a 720p HD H.264 encoder running on a general- Purpose CMP system and exploring methods to eliminate these overheads by transforming the CPU into a specialized system for H. 264 encoding.
Abstract: Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units. The ASIC is 500x more energy efficient than our original four-processor CMP. Broadly applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90% of the energy used in these solutions is still "overhead". Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution's performance within 3x of its energy and within comparable area.

460 citations

Journal Article•DOI•
TL;DR: This study proves that the time complexity of the EMD/EEMD is actually equivalent to that of the Fourier Transform.
Abstract: It has been claimed that the empirical mode decomposition (EMD) and its improved version the ensemble EMD (EEMD) are computation intensive. In this study we will prove that the time complexity of the EMD/EEMD, which has never been analyzed before, is actually equivalent to that of the Fourier Transform. Numerical examples are presented to verify that EMD/EEMD is, in fact, a computationally efficient method.

324 citations

Proceedings Article•DOI•
29 Dec 2011
TL;DR: Experimental results show that the fast intra mode decision scheme provides almost 20% time savings in all intra low complexity cases on average with negligible loss of coding efficiency.
Abstract: As the next generation standard of video coding, the High Efficiency Video Coding (HEVC) is intended to provide significantly better coding efficiency than all existing video coding standards. To improve the coding efficiency of intra frame coding, up to 34 intra prediction modes are defined in HEVC. The best mode among these pre-defined intra prediction modes is selected by rate-distortion optimization (RDO) for each block. If all directions are tested in the RDO process, it will be very time-consuming. To alleviate the encoder computation load, this paper proposes a new method to reduce the candidates in RDO process. In addition, the direction information of the neighboring blocks is made full use of to speed up intra mode decision. Experimental results show that the proposed scheme provides 20% and 28% time savings in intra high efficiency and low complexity cases on average compared to the default encoding scheme in HM 1.0 with almost the same coding efficiency. This algorithm has been proposed to HEVC standard and partially adopted into the HEVC test model.

311 citations

Journal Article•DOI•
Seung-Hyun Cho1, Munchurl Kim1•
TL;DR: A fast CU splitting and pruning method is presented for HEVC intra coding, which allows for significant reduction in computational complexity with small degradations in rate-distortion (RD) performance.
Abstract: High Efficiency Video Coding (HEVC), a new video coding standard currently being established, adopts a quadtree-based Coding Unit (CU) block partitioning structure that is flexible in adapting various texture characteristics of images. However, this causes a dramatic increase in computational complexity compared to previous video coding standards due to the necessity of finding the best CU partitions. In this paper, a fast CU splitting and pruning method is presented for HEVC intra coding, which allows for significant reduction in computational complexity with small degradations in rate-distortion (RD) performance. The proposed fast splitting and pruning method is performed in two complementary steps: 1) early CU split decision and 2) early CU pruning decision. For CU blocks, the early CU splitting and pruning tests are performed at each CU depth level according to a Bayes decision rule method based on low-complexity RD costs and full RD costs, respectively. The statistical parameters for the early CU split and pruning tests are periodically updated on the fly for each CU depth level to cope with varying signal characteristics. Experimental results show that our proposed fast CU splitting and pruning method reduces the computational complexity of the current HM to about 50% in encoding time with only 0.6% increases in BD rate.

306 citations

Journal Article•DOI•
TL;DR: The four-stage macroblock pipelined system architecture is proposed with an efficient scheduling and memory hierarchy, and the prototype chip of the efficient H.264/AVC video encoder for HDTV applications is implemented.
Abstract: H.264/AVC significantly outperforms previous video coding standards with many new coding tools. However, the better performance comes at the price of the extraordinarily huge computational complexity and memory access requirement, which makes it difficult to design a hardwired encoder for real-time applications. In addition, due to the complex, sequential, and highly data-dependent characteristics of the essential algorithms in H.264/AVC, both the pipelining and the parallel processing techniques are constrained to be employed. The hardware utilization and throughput are also decreased because of the block/MB/frame-level reconstruction loops. In this paper, we describe our techniques to design the H.264/AVC video encoder for HDTV applications. On the system design level, in consideration of the characteristics of the key components and the reconstruction loops, the four-stage macroblock pipelined system architecture is first proposed with an efficient scheduling and memory hierarchy. On the module design level, the design considerations of the significant modules are addressed followed by the hardware architectures, including low-bandwidth integer motion estimation, parallel fractional motion estimation, reconfigurable intrapredictor generator, dual-buffer block-pipelined entropy coder, and deblocking filter. With these techniques, the prototype chip of the efficient H.264/AVC encoder is implemented with 922.8 K logic gates and 34.72-KB SRAM at 108-MHz operation frequency.

295 citations