# Overview of ITRI PAC Project – from VLIW DSP Processor to Multicore Computing Platform

Tay-Jyi Lin, Chun-Nan Liu, Shau-Yin Tseng, Yuan-Hua Chu, and An-Yeu (Andy) Wu

SoC Technology Center Industrial Technology Research Institute, Hsinchu, Taiwan

# **ABSTRACT**

The Industrial Technology Research Institute (ITRI) PAC (Parallel Architecture Core) project was initiated in 2003. The target is to develop a low-power and high-performance programmable SoC platform for multimedia applications. In the first PAC project phase (2004~2006), a 5-way VLIW DSP (PACDSP) processor has been developed with our patented distributed & ping-pong register file and variable-length VLIW encoding techniques. A dual-core PAC SoC, which is composed of a PACDSP core and an ARM9 core, has also been designed and fabricated in the TSMC 0.13µm technology to demonstrate its outstanding performance and energy efficiency for multimedia processing such as real-time H.264 codec. This paper summarizes the technical contents of PACDSP, DVFS (dynamic voltage and frequency scaling) -enabled PAC SoC, and the energy-aware multimedia codec. The research directions of our second-phase PAC project (PAC II), including multicore architectures, ESL (electronics system-level) technology, and low-power multimedia framework, are also addressed in this paper.

# 1. Introduction

In consumer electronics, high product volumes are increasing going along with short lifetimes. Besides, driven by the advances in IC technology with the needs for new applications, the system functionalities realized on a single chip are enormously growing. The introduced complexity and time-to-market constraints make the designer's productivity a vital factor for success. Thus, more and more system functions are implemented in software by employing embedded processor cores. The programmability helps to raise the designer's productivity, while the flexibility of software allows late design changes and provides a high grade of reusability, thus shortening design cycles. In addition, multimedia processing is usually implemented in software rather than fixedfunction hardware, for software provides flexibility that is not available in hard-wired solutions. For example, compressed audio players typically need to support a variety of different algorithms such as MPEG-1/Layer 3 (MP3), Windows Media Audio (WMA), and MPEG Advanced Audio Coding (AAC). evolve and new algorithms are introduced, software-based products can be upgraded rapidly. Digital signal processors (DSP) are general-purpose and programmable processors designed for digital signal processing applications, of which the instruction sets and the architectures are customized to perform computationintensive tasks more efficiently. Since the inception in the late 1970s, DSPs have expanded into various domains, such as media processing, communications, and industrial control.

The PAC (Parallel Architecture Core) project was initiated in 2003 and executed by the SoC Technology Center (STC) of the Industrial Technology Research Institute (ITRI). The target is to provide a fully-programmable solution for next-generation media-

rich and multi-function portable devices, such as portable media player (PMP) and smart phones. These speech, audio, image and video processing demand extremely high computing power, and the inherent parallelisms should be extensively exploited to meet the stringent requirements. Very-long-instruction-word (VLIW) and single-instruction-multiple-data (SIMD) architectures are now the mainstream of the high-performance designs. However, they have serious drawbacks - the register complexity and poor code density. In the first PAC project phase, a 5-way VLIW DSP (PACDSP) core has been developed with an innovative distributed & ping-pong register file and novel variable-length VLIW encoding techniques to overcome these problems. A dual-core PAC SoC composed of a PACDSP core and an ARM9 core has also been designed and fabricated to demonstrate its outstanding performance and energy efficiency for multimedia processing. This paper summarizes the technical contents of PACDSP, DVFS (dynamic voltage and frequency scaling) -enabled PAC SoC, and energy-aware H.264/AVC decoding on the PAC SoC. Moreover, the research directions of our second-phase PAC project (PAC II) including multicore architectures, ESL (electronics system-level) technology, and low-power multimedia framework are also addressed in this paper.

# 2. PAC Hardware

PAC hardware designs include embedded high-performance and low-power PACDSP cores and computing platforms based on PACDSP (i.e. PAC SoC). Fig. 1 shows the roadmap of PAC. PACDSP V3 is the current instruction set architecture (ISA) based on the VLIW microarchitecture for high-performance multimedia processing. PAC-plus! is our most updated PACDSP core based on the V3 ISA. Our future development plans include multi-PACDSP architecture (e.g. PAC Duo & PAC Quad with 2 & 4 PACDSP cores respectively) and next-generation ISA (PAC-lite & PAC-SIMD) for ultra low-power and energy-efficient applications.



Fig. 1. PAC roadmap

## 2.1 PACDSP & PAC-plus!

PACDSP V3 is the ISA of a 32-bit VLIW DSP developed in ITRI, which combines the high-performance and low-power signal processing capability of ASIC and the flexibility of microprocessor. Its microarchitecture features scalable datapath for easy adaptation to different applications, an innovative and patented distributed register organization [1][2], a rich & optimized instruction set with 8-bit/16-bit SIMD operations, and a high-performance memory subsystem [3]. Fig. 2 illustrates the PACDSP microarchitecture, which contains a program sequencer, a scalar unit and a clustered datapath. The distributed & ping-pong register file can support comparable high-bandwidth data operands to and from the parallel functional units as a centralized register file widely used in stateof-the-art high-performance VLIW processors, but it only needs few access ports. Compared with an equivalent centralized register file, our approach reduces 76.8% silicon area and shortens 46.9% access times. PACDSP has an internal power management unit for power reduction. The parallel clusters can be individually turned off with appropriate power mode setting. PACDSP has high VLIW code density through variable-length operation encoding, NOP removal, and embedded code replication techniques. The program sequencer will dynamically align the VLIW packets with different numbers of operations, each of which is itself variable-length encoded [2]. PACDSP 3.0 is AMBA2 AHB compliant with one master port and one slave port. A highbandwidth internal memory subsystem with 32KB direct-mapped instruction cache and 64KB data memory is incorporated in the core.



Fig. 2. PACDSP architecture

Table 1. Comparison of licensable DSP cores

| Vender       | StarCore   | CEVA               | ITRI/STC            |
|--------------|------------|--------------------|---------------------|
| Product      | SC1000     | CEVA-X             | PACDSP              |
| Architecture | 6-way VLIW | 8-way VLIW         | 5-way VLIW          |
| Frequency    | 305 MHz    | 450 MHz            | 300 MHz             |
| Power w/o    | 0.098      | 0.08               | 0.08                |
| memory       | mW/MIPS    | mW/MIPS            | mW/MIPS             |
| Area         | -          | 1.6mm <sup>2</sup> | ~1.2mm <sup>2</sup> |

The PACDSP core has been implemented and fabricated in the TSMC 130nm generic process. The maximum operating frequency is 300MHz and the gate count is only 250K. The average power consumption is only 0.08mW/MIPS owing to its highly optimized microarchitecture and the embedded power management. Table 1 summarizes the comparison of PACDSP

and two licensable DSP cores with comparable computing resources. StarCore and CEVA use relatively higher instruction bandwidth (6-issue and 8-issue respectively) to keep arithmetic units busy. The implementations are all based on the  $0.13\mu m$  CMOS technology. PACDSP has better energy efficiency than its competitors for similar computing workloads.



Fig. 3. PAC-plus!

PAC-plus! is the most updated PACDSP core based on the V3 ISA (i.e. PACDSP V3.3 core). Fig. 3 shows its top-level block diagram with PACDSP, a data memory unit (DMU), an instruction memory unit (IMU), a bus interface unit (BIU), a host interface unit (HIU), and an embedded in-circuit emulator (ICE). The PACDSP is identical to that in Fig. 2. DMU and IMU are highly configurable, of which the memory blocks and cache lines can be individually turned off to reduce the power dissipation. BIU supports AMBA3 AXI, which enables multiple outstanding transactions, simultaneous read/write accesses and out-of-order data capability for even higher-bandwidth multimedia applications. Finally, a complete tool chain (C/C++ compiler, assembler, linker and debugger, etc [4][5]) is available for PACDSP V3. Highperformance DSP libraries are also provided for multimedia applications. With the complete hardware & software tools and rich I/O peripherals, designers can develop their own applications with customized features easily and quickly.

# 2.2 Dual-Core PAC SoC (PAC Solo)

Besides the development of PACDSP, a dual-core PAC SoC composed of an ARM9 core and a single PACDSP core (i.e. PAC Solo) has also been designed to demonstrate the outstanding performance and energy efficiency for multimedia processing such as real-time H.264 codec. A multi-layered AHB is implemented to connect the two cores and a system DMA, an on-chip SDRAM controller, vector interrupt controllers (VIC) and so on, as shown in Fig. 4. PAC Solo adopts various power optimization techniques to reduce both dynamic and leakage power dissipations. First, clock gating is extensively applied in its silicon implementation. The SoC is then divided into six independent power domains for individual control on the power supplies and operating frequencies [6]. Table 2 lists the power domains and the corresponding power modes. Due to the insufficient control over the ARM9 hard macro, the MPU domain has only frequency controls. On the other hand, DSP has full control over the operating frequencies and the supply voltages. The power modes of each power domain can be switched according to the performance requirements of the

applications. A set of control registers are implemented in the DVFS controller, which records the current states, the next states and all possible states of all domains. The optimum power consumption can be achieved by software programming on these control registers.



Fig. 4. Dual-core PAC SoC (PAC Solo)

Table 2. Power domains of PAC Solo

|              | C                      | C          |            |        |  |
|--------------|------------------------|------------|------------|--------|--|
| Power domain | Supported power modes  |            |            | Power  |  |
|              | Name                   | Supply (V) | Freq (MHz) | gating |  |
| MPU          | Active-1               | 1.2        | 228        |        |  |
|              | Active-2               | 1.2        | 152        |        |  |
|              | Active-3               | 1.2        | 114        | N      |  |
|              | Inactive               | 1.2        | 0          |        |  |
|              | Sleep                  | 0          | 0          |        |  |
| DSP          | Active-1               | 1.2        | 228        |        |  |
|              | Active-2               | 1.0        | 152        |        |  |
|              | Active-3               | 0.9        | 114        | Y      |  |
|              | Inactive               | 0.9        | 0          | 1      |  |
|              | Pending                | 0.9        | 0          |        |  |
|              | Sleep                  | 0          | 0          |        |  |
| ME           | Same as the DSP domain |            |            | Y      |  |
| AHB          | Full speed             | 1.2        | 152/114    | N      |  |
|              | Low power              | 1.0        | 76         | IN     |  |
| SRAM & LCD   | Same as the AHB domain |            |            | Y      |  |
| APB & DVFS   | Fixed V                | 1.0        | 48         | N      |  |
| PLL          | Fixed V                | 1.2        | 456        | N      |  |

## 3. Energy-Aware H.264 Decoding on PAC SoC

To fully exploit the VLIW capability based on instruction-level parallelism, traditional algorithms usually need to be reformed. Besides, the DVFS function is application-dependent and some specific control scheme should be developed to adapt to the current computation dynamically. For an H.264/AVC decoder, various resolutions and profiles exist to support different applications on different devices. For example, the video can be decoded at low DSP operating frequency for videoconferencing at low bitrates on handheld devices. The supply voltage can thus be lowered to reduce the quadratic power dissipation. On the other hand, the supply voltage should be raised to boost the operating frequency to meet the performance requirements in high-definition television with high bitrates. By the way, even in the same video sequence, the decoding complexity and thus the decoding time are varying frame by frame. Therefore, the DVFS capability on the

PAC platform can be extensively utilized to implement an energy-aware H.264/AVC decoder.

In our energy-aware H.264/AVC decoder, the voltage and the frequency of the PACDSP core are dynamically adjusted to meet the performance requirements while the ARM9 core is running at fixed 114 MHz. In order to understand the relationship between the power dissipation and the operating frequency/voltage, we first characterize the power dissipations under different power modes of PACDSP (as Table 2). The measurements on real silicon, where the same video sequence is decoded repeatedly, are summarized in Table 3. Energy reductions of 39% and 49% can be observed for Active-2 and Active-3 modes respectively.

Table 3. Power dissipation in different power modes

| Modes    | DSP condition | Power     | Time   | Energy (mJ)           |
|----------|---------------|-----------|--------|-----------------------|
| Active-1 | 1.2V/228MHz   | 161.20 mW | 10 sec | 1612                  |
| Active-2 | 1.0V/152MHz   | 75.99 mW  | 13 sec | 987.87 ( <b>39%</b> ) |
| Active-3 | 0.9V/114MHz   | 48.82 mW  | 17 sec | 829.94 ( <b>49%</b> ) |

To adapt to the characteristics of input sequence, the voltage and frequency are dynamically adjusted by the DVFS controller. The DVFS control flow is shown in Fig. 5, where Te and Ta are the execution time of previous frame and the allowed execution time of current frame respectively. Initially, the first 30 frames are used to train the input sequence and to establish the performance database. In the following frames, the power mode is decided by the mode prediction model with the timing and power information. Based on the prediction model, the occurrence rate of each power mode is analyzed. Two cases with the same sequence (cars, 1888 frames) but different frame rates are summarized in Table 4. Higher frame rates imply that PACDSP will likely be in the Active-1 mode to meet the computation requirements. Energy savings of 35% and 43% can be observed over their non-DVFS counterparts.



Fig. 5. DVFS control flow

Table 4. Power dissipation with DVFS

|        | Frame rate (fps) | Mode     | Occurrence | Saved energy |
|--------|------------------|----------|------------|--------------|
|        |                  | Active-1 | 359 (19%)  |              |
| Case 1 | 22               | Active-2 | 874 (46%)  | 35 %         |
|        |                  | Active-3 | 655 (35%)  |              |
| Case 2 |                  | Active-1 | 82 (4%)    |              |
|        | 20               | Active-2 | 688 (36%)  | 43 %         |
|        |                  | Active-3 | 1118 (60%) |              |

#### 4. PAC II

Based on the PACDSP core, its associated tool chain, the DVFS technology, the optimized DSP libraries and the multimedia codec developed in the first-phase PAC project, the second phase (PAC II) is to pursue next-generation embedded computing platforms with even lower power dissipation, higher performance, and improved energy efficiency. Compared to the first-phase project, PAC II will focus on the platform technology such as lowpower and high-bandwidth on-chip network, optimized embedded memory organization with enhanced DMA controllers, platformoptimized runtime software, and electronic system-level (ESL) design methodology. Moreover, the hardware research includes the improved ISA (PAC-lite & PAC-SIMD) for ultra low-power & energy-efficient applications and the multi-PACDSP architecture (e.g. PAC Duo & PAC Quad with two & four PACDSP cores respectively). The hardware and software architectures for PAC II are illustrated in Fig. 6 and Fig. 7 respectively.



Fig. 6. PAC II hardware architecture



Fig. 7. PAC II software architecture

Multicore architectures with many simple processors have already been proved to be more energy-efficient than a single high-performance processor, once the overheads on the communications and the software programming can be well controlled. Based on the rich software components already optimized for PACDSP and the expertise of software development on a single PACDSP, we have developed a component-based development environment as shown in Fig. 8. The ESL tool allows drag & drop binding of software tasks developed on the single-core processor for multicore simulation with cycle-accurate transaction modeling. We are now working on the automatic task binding with mixed integer linear programming (MILP), which models the significant system-level overheads, such as cache misses and DMA accesses on the on-chip interconnect



Fig. 8. PAC multicore simulator

# 5. Summary

Designing a flexible and programmable architecture is always a daunting task. Considering the evolving mobile standards to the newest video compression techniques, the algorithms are rapidly growing in complexity. For example, customers that were satisfied with standard-definition resolution MPEG-2 video might now demand that the next product support high-definition resolution H.264, which requires more than an order of increase in system performance. Consequently, designers should consider not only today's requirements when starting a new design, but understand that the system may also be soon called upon to address unforeseen challenges. In this paper, we have summarized the research and implementation results of a high-performance and low-power PACDSP core, a DVFS-enabled energy-aware PAC SoC with a PACDSP core and an ARM9 core, and energy-aware H.264/AVC decoding. In the future, we will continue to pursue an even lower-power, higher-performance and more energy-efficient computing platform for next-generation applications, with special focuses on the platform technology, multicore architecture, ESL techniques, and platform-dependent software optimizations.

# References

- [1] T. J. Lin, P. C. Hsiao, C. W. Liu, and C. W. Jen, "Area-efficient register organization for fully-synthesizable VLIW DSP cores," *International Journal of Electrical Engineering*, vol. 13, pp.117-127, May 2006
- [2] T. J. Lin, P. C. Hsiao, S. K. Chen, Y. T. Kuo, and C. W. Liu, "Design & implementation of a high-performance & complexityeffective VLIW DSP for multimedia applications," to appear in *Journal of VLSI Signal Processing*
- [3] C. W. Chang, et al., "PAC DSP core and application processors," in *Proc. ICME*, pp.289-292, 2006
- [4] Y. C. Lin, et al., "Compiler supports and optimizations for PAC VLIW DSP processors," in *Proc. LCPC*, pp.466-474, 2005
- [5] C. Wu, et al., "Integrating compiler and system toolkit flow for embedded VLIW DSP processors," in *Proc. RTCSA*, pp.215-222, 2006
- [6] C. Y. Lai, J. H. Lin, and Y. F. Wang, "DVFS SoC architecture and implementation," SoC Technology Journal, vol. 3, pp.84-91, 2005