What is the effect of memory operations on other measurements?

Note that in order to reduce the effect of memory operations on other measurements, the target machine has 32 KB instruction cache and 32 KB data cache, resulting in high cache hit rates.

What are the run-time characteristics of the machine?

The run-time characteristics include the available ILP, demand on various hardware components such as cache memory units, register files, and the number of function units.

What is the main idea behind the idea of programmable processors?

The arrival of production quality ILP compilers and commercial DSPs with VLIW architecture stimulated the idea of programmable processors that are aggressively tuned to specific applications.

how many machine configurations can be run on a processor?

The objective function of the optimization problem is minimization of the number of selected machine configurations, thereby, on average, maximizing the number of benchmarks that can be run on a processor as though it is optimized for each individual benchmark.

How many mm2 of area is needed for the given compiler technology?

for the given compiler technology and benchmarks, there is no need to have more than 100 mm2 of area since the speed-up increase achieved by machines greater than 100 mm2 are minimal.

How many applications are used in this study?

The collection is composed of 21 applications culled from available image processing, communications, cryptography, and DSP applications.

How many byte-sized instruction caches are used in the benchmark?

For each executable of a benchmark, the authors simulate 25 combina-tions of instruction cache and data cache ranging from (512 bytes, 512 bytes) to (8 KB, 8 KB).

What is the reason why the ILP is not fully exploited?

One of the underlying reasons that causes the phenomenon is that the ILP found by the compiler and hardware scheduler is fully exploited by having a certain amount of hardware, thereby performance increase possibility is exhausted.

How many mm2 of area are available for the given compiler technology?

for the given compiler technology and benchmarks, there is little need to have more than 100 mm2 of area sincethe speed-up increase achieved by machines greater than 100 mm2 are minimal.

What are the parameters of the Lsim simulator?

Memory latency, misprediction penalty and ALU latency are specified as Lsim parameters (Fig. 2) in the system model being simulated.

What are the characteristics of the benchmarks?

Note that the combination of the instructions per cycle (IPC), bus utilization, branch issue, and ALU issue exhibit distinctive characteristics for each benchmark.

(Open Access) Exploring the diversity of multimedia systems (2001) | Johnson Kin

Q: What are the contributions mentioned in the paper "Exploring the diversity of multimedia systems" ?

The authors follow the tradition of the architecture community by using comprehensive real-life benchmarks and production quality tools. This combination enables us to build a unique framework for system-level synthesis and to gain valuable insights about design and use of application-specific programmable processors for modern applications. The authors explore the application-specific programmable processor ( ASSP ) design space to understand the relationship between performance and area. For example, the authors found that in most cases a single programmable architecture performs similarly to a set of architectures that are tuned to individual application. The authors found that the framework introduced in this paper can be very valuable in making early design decisions such as area and architectural configuration tradeoff, cache and issue width tradeoff under area constraint, and the number of branch units and issue width.

Q: What is the role of cache memory units in machine performance?

The authors incorporate the role of cache memory units in machine performance into the machine model, which is essential for producing meaningful results.

Q: what is the framework used in the design of a chip?

The authors have found that the framework introduced in this paper can be very valuable in making early design decisions such as area and architectural configuration tradeoff, cache and issue width tradeoff under area constraint, and the number of branch units and issue width.

474 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 3, JUNE 2001

Transactions Briefs__________________________________________________________________

Exploring the Diversity of Multimedia Systems

Johnson Kin, Chunho Lee, William H. Mangione-Smith, and

Miodrag Potkonjak

Abstract—We evaluate the validity of the fundamental assumption

behind application-specific programmable processors: that applications

differ from each other in key parameters which are exploitable, such as the

available instruction-level parallelism (ILP), demand on various hardware

resources, and the desired mix of function units. Following the tradition of

the CAD community, we develop an accurate chip area estimate and a set

of aggressive hardware optimization algorithms. We follow the tradition of

the architecture community by using comprehensive real-life benchmarks

and production quality tools. This combination enables us to build a

unique framework for system-level synthesis and to gain valuable insights

about design and use of application-specific programmable processors for

modern applications. We explore the application-specific programmable

processor (ASSP) design space to understand the relationship between

performance and area. The architecture model we used is the Hewlett

Packard PA-RISC [1] with single level caches. The system, including all

memory and bus latencies, is simulated and no other specialized ALU

or memory structures are being used. The experimental results reveal

a number of important characteristics of the ASSP design space. For

example, we found that in most cases a single programmable architecture

performs similarly to a set of architectures that are tuned to individual

application. A notable exception is highly cost sensitive designs, which

we observe need a small number of specialized architectures that require

smaller areas. Also, it is clear that there is enough parallelism in the typical

media and communication applications to justify use of high number of

function units. We found that the framework introduced in this paper

can be very valuable in making early design decisions such as area and

architectural configuration tradeoff, cache and issue width tradeoff under

area constraint, and the number of branch units and issue width.

Index Terms—Application-specificprogrammable processor,instruction

level parallelism, mediabench, mediaprocessor, system-level synthesis.

I. INTRODUCTION

It has been predicted that the “micro-brain boom” (sic) will greatly

increase demand for application-specific microprocessors for media

applications [2]. Sales of handheld computers and personal digital as-

sistants grew almost sixfold from 1994’s total, to 5.6 million units in

1999. The market for programmable DSP chips increased 20% in 1998

to the

3.9 billion level. The new DSP markets, which are beginning

to emerge, including digital cameras, satellite phones, smart antennas,

voice over IP, ac motor control, and even digital TV, is forecast to grow

at a 33% compound rate to the

13.4 billion level in 2002 [3].

This market growth coincides with an interesting technological ad-

vance that will change both the semiconductor business and micropro-

cessor design. Since 1992, microprocessors account for 23% of total

semiconductor sales. In 1998, these chips accounted for 30% of total

value of the semiconductor production. The increasing share of micro-

processors in semiconductor market is due to a new phase of silicon

integration enabled by deep submicron fabrication technology.

Manuscript received July 16, 1998; revised October 22, 1999.

J. Kin and W.-H. Mangione-Smith are with the Electrical Engineering De-

partment, University of California, Los Angeles, CA 90095 USA (e-mail: john-

sonk@icsl.ucla.edu; billms@icsl.ucla.edu).

C. Lee and M. Potkonjak are with the Computer Science Department, Uni-

versity of California, Los Angeles, CA 90095 USA (e-mail: leec@cs.ucla.edu;

miodrag@cs.ucla.edu).

Publisher Item Identifier S 1063-8210(01)00703-X.

For example, SA-1100 from Intel [4] incorporates many functions

such as a memory controller, color LCD driver, PCMCIA interface,

IrDA and USB communication channels, and extensive power manage-

ment into a single chip along with its core logic, previously available

only through “glue logic” chips. One implication of this technology is

that almost all semiconductor manufacturers are entering the micropro-

cessor business.

As a consequence of this trend, the market will be more crowded

and competitive in spite of increasing demand. This pressure will force

manufacturers to focus on microprocessors that are cheaper and more

aggressively optimized for specific applications. A challenge to micro-

processor designers will be to design a microprocessor that executes a

targeted application very well yet can achieve economy-of-scale. For

example, video-game players such as PlayStation from Sony and Nin-

tendos64 from Nintendo need to employ ever-more powerful proces-

sors for the application and yet remain cheap enough to sell for under

300.

On the technical side, recent advances in compiler technology

and microprocessor architecture for instruction-level parallelism

(ILP) have significantly increased the ability of a microprocessor to

exploit the opportunities for parallel execution that exist in various

programs. Key ILP compiler technologies, such as trace scheduling

[4], superblock scheduling [5], treegion-scheduling [6], hyperblock

scheduling [8], and software pipelining [9] are in the process of

migrating from research labs to product groups.

At the same time, a number of new microprocessor architectures

have been introduced. These designs present hardware structures that

are well matched to most ILP compilers. Architectural enhancements

found in commercial products include predicated instruction execu-

tion, VLIW execution, and split register files. One of the best examples

that has these features is TMS320C6X from Texas Instruments [9]. Al-

though TI considers the TMS320C6X to be a DSP, the architecture is

almost a copy of the Multiflow Trace [10]. Multi-gauge arithmetic (or

variable-width SIMD) is found in the family of MPACT architectures

from Chromatic [11] and the designs from MicroUnity [12]. Most of

the multimedia extensions of programmable processors also adopt this

architectural enhancement [14].

The arrival of production quality ILP compilers and commercial

DSPs with VLIW and SIMD architectures stimulated the idea of

custom-fit processors [15]. The premise of such an approach is

that applications differ from each other in exploitable measures, for

example the available ILP, demand on various hardware components

(e.g., cache memory units, register files) and the number of function

units. The presumption is that a microprocessor can be designed by

adding hardware components tailored to a specific application so that

it can execute the single application extremely well. Of course, an

obvious drawback of this approach is that it provides no guarantee

that other applications will run as well as the targeted application.

While the current microprocessors for media applications (mediapro-

cessors) are claimed to target general applications in a domain [13], a

custom-fit processor targets a single application (although they remain

programmable).

We report on a method of system-levelsynthesis of single or multiple

application programmable processors. We use a benchmark suite con-

sisting of complete applications written in a high level language [16].

We use the IMPACT tool suit [18] to collect performance measure-

ments of benchmarks on variousmachine configurations. The IMPACT

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 3, JUNE 2001 475

TABLE I

ACHINE CONFIGURATION EXAMPLES AND THEIR AREA ESTIMATES (mm ): A MACHINE CONFIGURATION CONSISTS OF:ISSUE WIDTH,NUMBER OF

ALUS,

UMBER OF BRANCH UNITS,NUMBER OF MEMORY UNITS,SIZE OF INSTRUCTION CACHE (KB), SIZE OF DATA

CACHE (KB)

TABLE II

PPLICATIONS USED IN THE

EXPERIMENT.DYNAMIC INSTRUCTION COUNTS WERE MEASURED ON A SPARC

C compiler is a retargetable compiler with code optimization compo-

nents supporting multiple-instruction-issue processors. The target ma-

chine is described using the high-level machine description language.

A high-level machine description supplied by a user is compiled by the

IMPACT machine description language compiler. IMPACT provides

cycle-level simulation tools.

This paper is organized as follows. The next section briefly sur-

veys related works and summarizes the contributions of this work. Sec-

tion III presents the background materials including machine model,

benchmarks, experiment platform (such as tools), and an example set

of results obtained using the tools. Our approach in this project is ex-

plained in Section IV in detail. Section V formulates the search problem

defined in the previous section in formal terms. The solution space ex-

ploration strategy and algorithm is described in Section VI. Extensive

experimental results are reported in Section VII. Finally, Section VIII

draws conclusions.

II. R

ELATED WORKS AND OUR CONTRIBUTIONS

The work on synthesis and evaluation of application-specific pro-

grammable processors has been conducted independently in two re-

search communities, computer-aided design and architecture. There is,

however, a strong converging trend of the two areas due to recent tech-

nological advancesand application trends. In this section we surveythe

related works in these two fields.

There have been a number of efforts related to the design of ap-

plication-specific programmable processors and application-specific

instruction sets. Comprehensive survey of the works on com-

puter-aided design of application-specific programmable processors

have been conducted by Goosens [18], Paulin [19], and Marwedel

[20]. In particular, a great deal of effort has been made in combining

retargetable compilation technologies and design of instruction sets

[22]–[26]. Several research groups have published results on the topic

of selecting and designing instruction set and processor architecture

for a particular application domains [27], [28].

Early work in the area of processor architecture synthesis tended to

employ ad hoc methods on small code kernels, in large part due to the

lack of good retargetable compiler technology. Conte and Mangione-

Smith [29] presented one of the first efforts that focused on large appli-

cation codes (i.e., SPEC) written in a high-level language. While they

had a similar goal to ours, i.e., evaluating performance efficiency by in-

cluding hardware cost, their evaluation approach was substantially dif-

ferent. Conte et al. [29] further refined this approach to consider power

consumption. Both of these efforts were limited by available compiler

technology and used a single applications binary scheduled for a scalar

476 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 3, JUNE 2001

TABLE III

ATA SET USED IN THE EXPERIMENT

TABLE IV

UN-TIME CHARACTERISTICS MEASURED ON HPPA-7100 USING IMPACT SUITES

machineforexecutiononsuperscalarimplementations.Fisheretal. [15]

studied the variability of application-specific VLIW processors using a

highly advanced and retargetable compiler. However, their study con-

sidered small program kernels rather than complete applications. They

also focused on finding the best possible architecture for a specific ap-

plication or workload, rather than understanding the difference among

attractive architectures across a set of applications.

We adopt a methodology of system synthesis combining the key

paradigms of both communities. Following the tradition of the CAD

community, we develop an accurate area estimate and aggressive

optimization algorithms. We follow the tradition of the architecture

community by using comprehensive real-life benchmarks and pro-

duction quality compilation and simulation tools. This combination

enables us to build a unique framework of system-level synthesis and

to gain valuable insights about design and use of application-specific

programmable processors for modern applications.

Unlike previous works, we use a set of complete applications written

in a high-level language as benchmarks. We incorporate the role of

cache memory units in machine performance into the machine model,

which is essential for producing meaningful results.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 3, JUNE 2001 477

Fig. 1. Performance measurement flow using IMPACT tools.

TABLE V

N EXAMPLE SET OF RESULTS

We focus on the number of machine configurations that should be

developed in order to maximize performance for all of the benchmarks

given an area constraint. We understand that it is in the best interest of

a processor designer to understand which architecture and how many

functional units or cache size is best for one particular application.

However, our first goal is to develop a framework for managers to un-

derstand howbig the chips portfolio should be in one particular domain.

It is not intended for a single designer to find his best application-spe-

cific system. The objective function of the optimizer is minimization

of selected machine configurations, thereby maximizing the number

of benchmarks that can be run on a processor as though it is optimized

for each individual benchmark. In one extreme case, we end up with as

many machine configurations as the number of individual benchmarks.

On the other extreme, we need only one machine. Clearly, the most in-

teresting solutions lie somewhere in the middle.

Power consumption evaluation and optimization is very often an

important aspect in multimedia processors; however, it is beyond the

scope of this paper. We have published a thorough investigation of

powerconsumption using similar frameworkand tools in another paper

[30].

III. P

RELIMINARY DISCUSSION

In this section we discuss the experimentalenvironment thathas been

adapted and developed for the investigation. First, we describe the ma-

chine model used to estimate the area of a machine configuration. The

benchmark suite is introduced along with the characteristics of its com-

ponents. Finally, we explain the experimental platform, including tools

and their example outputs.

A. Machine Model

To estimate the cost of a machine configuration, we adopt a simple

model developed by Argyres [31]. Given the area of the issue unit, the

cost of any scalar machine configuration is a linear function of the num-

bers of branch, memory, and arithmetic units. A machine may include

any number of each function unit. For a superscalar machine, the issue

unit area cannot be estimated using a simple linear model since it re-

quires more complex logic for runtime code scheduling. We assume

that the issue unit area will take

(

)

space since the complexity of

dependency checking algorithm is

(

)

. When a VLIW machine is

considered, the issue unit area is known to be of complexity

(

)

478 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 3, JUNE 2001

TABLE VI

ONFIGURATIONS OF THE MACHINES USED IN TABLE V

Fig. 2. System model being simulated.

sublinear. The cost function of arbitrarily configured superscalar ma-

chines is given by

Area

(1)

where

area of a baseline machine;

area of a baseline machine execution unit;

baseline machine cache area;

baseline machine issue unit area;

issue unit area;

number of branch units;

area of branch units;

number of memory units;

area of memory units;

number of ALU;

area of an ALU;

data cache area;

instruction cache area.

The baseline architecture chosen for the analysis is the PowerPC 604

[32], a four-issue processor. The 604 has two simple integer ALUs and

one complex integer ALU, one floating-point unit, one branch unit, and

one memory unit. We assume that machine configurations that have an

Fig. 3. An example high-level machine description (HMDES).

issue unit smaller than the baseline machine have at least one com-

plex integer ALU. The area of the complex integer unit is assumed

to be half of the baseline integer unit (two simple integer units and

one complex integer unit). The area of issue unit is scaled based on

the area complexity

(

))

. We did not include floating-point units

in any machine configurations because the benchmarks we used have

mostly integer (or fixed-point) operations. Finally, we scaled the area

for 0.35



m technology rather than the original 0.5



technology used

Exploring the diversity of multimedia systems

Figures

Citations

Processor evaluation cube : A classification and survey of processor evaluation techniques

References

Johnson: Computers and Intractability-A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness

Computer Architecture: A Quantitative Approach

The Art of Computer Systems Performance Analysis.

The art of computer systems performance analysis

Related Papers (5)

Power-performance simulation and design strategies for single-chip heterogeneous multiprocessors

On the design of multimedia software and future system architectures

Application specific architectures: a recipe for fast, flexible and power efficient designs

Performance models for network processor design

Resource allocation for coarse-grain FPGA development

Frequently Asked Questions (17)

Q1. What are the contributions mentioned in the paper "Exploring the diversity of multimedia systems" ?

Q2. What is the objective function of the optimizer?

Q3. What is the effect of memory operations on other measurements?

Q4. What are the run-time characteristics of the machine?

Q5. What is the main idea behind the idea of programmable processors?

Q6. how many machine configurations can be run on a processor?

Q7. Why did Conte and MangioneSmith use ad hoc methods?

Q8. What is the role of cache memory units in machine performance?

Q9. How many mm2 of area is needed for the given compiler technology?

Q10. what is the framework used in the design of a chip?

Q11. How can the authors estimate the issue unit area of a superscalar machine?

Q12. How many applications are used in this study?

Q13. How many byte-sized instruction caches are used in the benchmark?

Q14. What is the reason why the ILP is not fully exploited?

Q15. How many mm2 of area are available for the given compiler technology?

Q16. What are the parameters of the Lsim simulator?

Q17. What are the characteristics of the benchmarks?