Home
/
Authors
/
Tu-Chih Wang

Author

Tu-Chih Wang

Bio: Tu-Chih Wang is an academic researcher from National Taiwan University. The author has contributed to research in topics: Motion estimation & Encoder. The author has an hindex of 15, co-authored 25 publications receiving 1295 citations.

Topics: Motion estimation, Encoder, Macroblock, JPEG 2000, MPEG-4 ...read more

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Analysis and architecture design of variable block-size motion estimation for H.264/AVC

[...]

Ching-Yeh Chen¹, Shao-Yi Chien¹, Yu-Wen Huang¹, Tung-Chien Chen¹, Tu-Chih Wang¹, Liang-Gee Chen¹ - Show less +2 more•Institutions (1)

National Taiwan University¹

27 Mar 2006-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: Two hardware architectures are proposed that can support traditional fixed block-size motion estimation as well as VBSME with less chip area overhead compared to previous approaches and an eight-parallel SAD tree with a shared reference buffer for H.264/AVC integer motion estimation is proposed.

...read moreread less

Abstract: Variable block-size motion estimation (VBSME) has become an important video coding technique, but it increases the difficulty of hardware design. In this paper, we use inter-/intra-level classification and various data flows to analyze the impact of supporting VBSME in different hardware architectures. Furthermore, we propose two hardware architectures that can support traditional fixed block-size motion estimation as well as VBSME with less chip area overhead compared to previous approaches. By broadcasting reference pixel rows and propagating partial sums of absolute differences (SADs), the first design has the fewer reference pixel registers and a shorter critical path. The second design utilizes a two-dimensional distortion array and one adder tree with the reference buffer that can maximize the data reuse between successive searching candidates. The first design is suitable for low resolution or a small search range, and the second design has advantages of supporting a high degree of parallelism and VBSME. Finally, we propose an eight-parallel SAD tree with a shared reference buffer for H.264/AVC integer motion estimation (IME). Its processing ability is eight times of the single SAD tree, but the reference buffer size is only doubled. Moreover, the most critical issue of H.264 IME, which is huge memory bandwidth, is overcome. We are able to save 99.9% off-chip memory bandwidth and 99.22% on-chip memory bandwidth. We demonstrate a 720-p, 30-fps solution at 108 MHz with 330.2k gate count and 208k bits on-chip memory

...read moreread less

269 citations

Proceedings Article•DOI•

A 1.3TOPS H.264/AVC single-chip encoder for HDTV applications

[...]

Yu-Wen Huang¹, Tung-Chien Chen¹, Chen-Han Tsai¹, Ching-Yeh Chen¹, To-Wei Chen¹, Chi-Shi Chen, Chun-Fu Shen, Shyh-Yih Ma, Tu-Chih Wang, Bing-Yu Hsieh, Hung-Chi Fang, Liang-Gee Chen - Show less +8 more•Institutions (1)

National Taiwan University¹

29 Aug 2005

TL;DR: An H.264/AVC encoder is implemented on a 31.72mm/sup 2/ die with 0.18/spl mu/m CMOS technology and the encoded video quality is competitive with reference software requiring 3.6TOPS on a general-purpose processor-based platform.

...read moreread less

Abstract: An H.264/AVC encoder is implemented on a 31.72mm/sup 2/ die with 0.18/spl mu/m CMOS technology. A four-stage macroblock pipelined architecture encodes 720p 30f/s HDTV videos in real time at 108MHz. The encoded video quality is competitive with reference software requiring 3.6TOPS on a general-purpose processor-based platform.

...read moreread less

142 citations

Proceedings Article•DOI•

Architecture design for deblocking filter in H.264/JVT/AVC

[...]

Yu-Wen Huang¹, To-Wei Chen¹, Bing-Yu Hsieh¹, Tu-Chih Wang¹, Te-Hao Chang¹, Liang-Gee Chen² - Show less +2 more•Institutions (2)

National Taiwan University¹, Institute for Infocomm Research Singapore²

06 Jul 2003

TL;DR: An efficient VLSI architecture for the deblocking filter in H.264/JVT/AVC using an array of 8/spl times/4 8-bit shift registers with reconfigurable data path to support both horizontal filtering and vertical filtering on the same circuit.

...read moreread less

Abstract: This paper presents an efficient VLSI architecture for the deblocking filter in H.264/JVT/AVC. We use an array of 8/spl times/4 8-bit shift registers with reconfigurable data path to support both horizontal filtering and vertical filtering on the same circuit (a parallel-in parallel-out reconfigurable FIR filter). Two SRAM modules are carefully organized not only for the storage of current macroblock data and adjacent block data but also for the efficient access of pixels in different blocks. Simulation results show that under 0.25 /spl mu/m technology, the synthesized logic gate count is only 19.1 K (not including a 96/spl times/32 SRAM and a 64/spl times/32 SRAM) when the maximum frequency is 100 MHz. Our architecture design can easily support real-time deblocking of 720p (1280/spl times/720) 30 Hz video. It is valuable for platform-based design of H.264 codec.

...read moreread less

138 citations

Proceedings Article•DOI•

Hardware architecture design for variable block size motion estimation in MPEG-4 AVC/JVT/ITU-T H.264

[...]

Yu-Wen Huang¹, Tu-Chih Wang¹, Bing-Yu Hsieh¹, Liang-Gee Chen¹•Institutions (1)

National Taiwan University¹

25 May 2003

TL;DR: A new hardware architecture for variable block size motion estimation with full search at integer-pixel accuracy is proposed and can achieve real-time applications under the operating frequency of 64.11 MHz for 720/spl times/480 frame at 30 Hz.

...read moreread less

Abstract: Variable block size motion estimation is adopted in the new video coding standard, MPEG-4 AVC/JVT/ITU-T H.264, due to its superior performance compared to the advanced prediction mode in MPEG-4 and H.263+. In this paper, we modified the reference software in a hardware-friendly way. Our main idea is to convert the sequential processing of each 8/spl times/8 sub-partition of a macro-block into parallel processing without sacrifice of video quality. Based on our algorithm, we proposed a new hardware architecture for variable block size motion estimation with full search at integer-pixel accuracy. The features of our design are 2-D processing element array with 1-D data broadcasting and 1-D partial result reuse, parallel adder tree, memory interleaving scheme, and high utilization. Simulation shows that our chip can achieve real-time applications under the operating frequency of 64.11 MHz for 720/spl times/480 frame at 30 Hz with search range of [-24, +23] in horizontal direction and [-16, +15] in vertical direction, which requires the computation power of more than 50 GOPS.

...read moreread less

135 citations

Proceedings Article•DOI•

Parallel 4/spl times/4 2D transform and inverse transform architecture for MPEG-4 AVC/H.264

[...]

Tu-Chih Wang¹, Yu-Wen Huang¹, Hung-Chi Fang¹, Liang-Gee Chen¹•Institutions (1)

National Taiwan University¹

25 May 2003

TL;DR: A hardware architecture for accelerating transform coding operations in MPEG-4 AVC/H.264 is presented and has been mapped into a 4 /spl times/ 4 multiple transforms unit and synthesized in TSMC 0.35um technology.

...read moreread less

Abstract: Transform coding has been widely used in video coding standards. In this paper, a hardware architecture for accelerating transform coding operations in MPEG-4 AVC/H.264 is presented. This architecture calculates 4 inputs in parallel by fast algorithms described previously. The transpose operations are implemented by a register array with directional transfers. This architecture has been mapped into a 4 /spl times/ 4 multiple transforms unit and synthesized in TSMC 0.35um technology. The multiple transform processor can process 320M pixels/sec at 80Mhz for all 4 /spl times/ 4 transforms used in MPEG-4 AVC/ H.264.

...read moreread less

125 citations

1
2
3
4
…
5

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Understanding sources of inefficiency in general-purpose chips

[...]

Rehan Hameed¹, Wajahat Qadeer¹, Megan Wachs¹, Omid Azizi¹, Alex Solomatnikov, Benjamin C. Lee¹, Stephen Richardson¹, Christos Kozyrakis¹, Mark Horowitz¹ - Show less +5 more•Institutions (1)

Stanford University¹

19 Jun 2010

TL;DR: The sources of these performance and energy overheads in general-purpose processing systems are explored by quantifying the overheads of a 720p HD H.264 encoder running on a general- Purpose CMP system and exploring methods to eliminate these overheads by transforming the CPU into a specialized system for H. 264 encoding.

...read moreread less

Abstract: Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units. The ASIC is 500x more energy efficient than our original four-processor CMP. Broadly applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90% of the energy used in these solutions is still "overhead". Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution's performance within 3x of its energy and within comparable area.

...read moreread less

460 citations

Journal Article•DOI•

Digital Video Transcoding

[...]

Jun Xin¹, Chia-Wen Lin², Ming-Ting Sun³•Institutions (3)

Mitsubishi¹, National Chung Cheng University², University of Washington³

27 Jun 2005

TL;DR: The technical issues and research results related to video transcoding are outlined and techniques for reducing the complexity and improving the video quality are discussed, by exploiting the information extracted from the input video bit stream.

...read moreread less

Abstract: Video transcoding, due to its high practical values for a wide range of networked video applications, has become an active research topic. We outline the technical issues and research results related to video transcoding. We also discuss techniques for reducing the complexity, and techniques for improving the video quality, by exploiting the information extracted from the input video bit stream.

...read moreread less

389 citations

Journal Article•DOI•

Analysis, fast algorithm, and VLSI architecture design for H.264/AVC intra frame coder

[...]

Yu-Wen Huang¹, Bing-Yu Hsieh¹, Tung-Chien Chen¹, Liang-Gee Chen¹•Institutions (1)

National Taiwan University¹

01 Mar 2005-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: This paper proposed two solutions for platform-based design of H.264/AVC intra frame coder with comprehensive analysis of instructions and exploration of parallelism, and proposed a system architecture with four-parallel intra prediction and mode decision to enhance the processing capability.

...read moreread less

Abstract: Intra prediction with rate-distortion constrained mode decision is the most important technology in H.264/AVC intra frame coder, which is competitive with the latest image coding standard JPEG2000, in terms of both coding performance and computational complexity. The predictor generation engine for intra prediction and the transform engine for mode decision are critical because the operations require a lot of memory access and occupy 80% of the computation time of the entire intra compression process. A low cost general purpose processor cannot process these operations in real time. In this paper, we proposed two solutions for platform-based design of H.264/AVC intra frame coder. One solution is a software implementation targeted at low-end applications. Context-based decimation of unlikely candidates, subsampling of matching operations, bit-width truncation to reduce the computations, and interleaved full-search/partial-search strategy to stop the error propagation and to maintain the image quality, are proposed and combined as our fast algorithm. Experimental results show that our method can reduce 60% of the computation used for intra prediction and mode decision while keeping the peak signal-to-noise ratio degradation less than 0.3 dB. The other solution is a hardware accelerator targeted at high-end applications. After comprehensive analysis of instructions and exploration of parallelism, we proposed our system architecture with four-parallel intra prediction and mode decision to enhance the processing capability. Hadamard-based mode decision is modified as discrete cosine transform-based version to reduce 40% of memory access. Two-stage macroblock pipelining is also proposed to double the processing speed and hardware utilization. The other features of our design are reconfigurable predictor generator supporting all of the 13 intra prediction modes, parallel multitransform and inverse transform engine, and CAVLC bitstream engine. A prototype chip is fabricated with TSMC 0.25-/spl mu/m CMOS 1P5M technology. Simulation results show that our implementation can process 16 mega-pixels (4096/spl times/4096) within 1 s, or namely 720/spl times/480 4:2:0 30 Hz video in real time, at the operating frequency of 54 MHz. The transistor count is 429 K, and the core size is only 1.855/spl times/1.885 mm/sup 2/.

...read moreread less

331 citations

Journal Article•DOI•

Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder

[...]

Tung-Chien Chen¹, Shao-Yi Chien¹, Yu-Wen Huang¹, Chen-Han Tsai¹, Ching-Yeh Chen¹, To-Wei Chen¹, Liang-Gee Chen¹ - Show less +3 more•Institutions (1)

National Taiwan University¹

01 Sep 2006-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: The four-stage macroblock pipelined system architecture is proposed with an efficient scheduling and memory hierarchy, and the prototype chip of the efficient H.264/AVC video encoder for HDTV applications is implemented.

...read moreread less

Abstract: H.264/AVC significantly outperforms previous video coding standards with many new coding tools. However, the better performance comes at the price of the extraordinarily huge computational complexity and memory access requirement, which makes it difficult to design a hardwired encoder for real-time applications. In addition, due to the complex, sequential, and highly data-dependent characteristics of the essential algorithms in H.264/AVC, both the pipelining and the parallel processing techniques are constrained to be employed. The hardware utilization and throughput are also decreased because of the block/MB/frame-level reconstruction loops. In this paper, we describe our techniques to design the H.264/AVC video encoder for HDTV applications. On the system design level, in consideration of the characteristics of the key components and the reconstruction loops, the four-stage macroblock pipelined system architecture is first proposed with an efficient scheduling and memory hierarchy. On the module design level, the design considerations of the significant modules are addressed followed by the hardware architectures, including low-bandwidth integer motion estimation, parallel fractional motion estimation, reconfigurable intrapredictor generator, dual-buffer block-pipelined entropy coder, and deblocking filter. With these techniques, the prototype chip of the efficient H.264/AVC encoder is implemented with 922.8 K logic gates and 34.72-KB SRAM at 108-MHz operation frequency.

...read moreread less

295 citations