scispace - formally typeset
Open AccessJournal ArticleDOI

Exploring the diversity of multimedia systems

TLDR
An accurate chip area estimate is developed and a set of aggressive hardware optimization algorithms are developed to build a unique framework for system-level synthesis and to gain valuable insights about design and use of application-specific programmable processors for modern applications.
Abstract
We evaluate the validity of the fundamental assumption behind application-specific programmable processors: that applications differ from each other in key parameters which are exploitable, such as the available instruction-level parallelism (ILP), demand on various hardware resources, and the desired mix of function units. Following the tradition of the CAD community, we develop an accurate chip area estimate and a set of aggressive hardware optimization algorithms. We follow the tradition of the architecture community by using comprehensive real-life benchmarks and production quality tools. This combination enables us to build a unique framework for system-level synthesis and to gain valuable insights about design and use of application-specific programmable processors for modern applications. We explore the application-specific programmable processor (ASSP) design space to understand the relationship between performance and area. The architecture model we used is the Hewlett Packard PA-RISC with single level caches. The system, including all memory and bus latencies, is simulated and no other specialized ALU or memory structures are being used. The experimental results reveal a number of important characteristics of the ASSP design space. For example, we found that in most cases a single programmable architecture performs similarly to a set of architectures that are tuned to individual application. A notable exception is highly cost sensitive designs, which we observe need a small number of specialized architectures that require smaller areas. Also, it is clear that there is enough parallelism in the typical media and communication applications to justify use of high number of function units. We found that the framework introduced in this paper can be very valuable in making early design decisions such as area and architectural configuration tradeoff, cache and issue width tradeoff under area constraint, and the number of branch units and issue width.

read more

Content maybe subject to copyright    Report

474 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 3, JUNE 2001
Transactions Briefs__________________________________________________________________
Exploring the Diversity of Multimedia Systems
Johnson Kin, Chunho Lee, William H. Mangione-Smith, and
Miodrag Potkonjak
Abstract—We evaluate the validity of the fundamental assumption
behind application-specific programmable processors: that applications
differ from each other in key parameters which are exploitable, such as the
available instruction-level parallelism (ILP), demand on various hardware
resources, and the desired mix of function units. Following the tradition of
the CAD community, we develop an accurate chip area estimate and a set
of aggressive hardware optimization algorithms. We follow the tradition of
the architecture community by using comprehensive real-life benchmarks
and production quality tools. This combination enables us to build a
unique framework for system-level synthesis and to gain valuable insights
about design and use of application-specific programmable processors for
modern applications. We explore the application-specific programmable
processor (ASSP) design space to understand the relationship between
performance and area. The architecture model we used is the Hewlett
Packard PA-RISC [1] with single level caches. The system, including all
memory and bus latencies, is simulated and no other specialized ALU
or memory structures are being used. The experimental results reveal
a number of important characteristics of the ASSP design space. For
example, we found that in most cases a single programmable architecture
performs similarly to a set of architectures that are tuned to individual
application. A notable exception is highly cost sensitive designs, which
we observe need a small number of specialized architectures that require
smaller areas. Also, it is clear that there is enough parallelism in the typical
media and communication applications to justify use of high number of
function units. We found that the framework introduced in this paper
can be very valuable in making early design decisions such as area and
architectural configuration tradeoff, cache and issue width tradeoff under
area constraint, and the number of branch units and issue width.
Index Terms—Application-specificprogrammable processor,instruction
level parallelism, mediabench, mediaprocessor, system-level synthesis.
I. INTRODUCTION
It has been predicted that the “micro-brain boom” (sic) will greatly
increase demand for application-specific microprocessors for media
applications [2]. Sales of handheld computers and personal digital as-
sistants grew almost sixfold from 1994’s total, to 5.6 million units in
1999. The market for programmable DSP chips increased 20% in 1998
to the
$
3.9 billion level. The new DSP markets, which are beginning
to emerge, including digital cameras, satellite phones, smart antennas,
voice over IP, ac motor control, and even digital TV, is forecast to grow
at a 33% compound rate to the
$
13.4 billion level in 2002 [3].
This market growth coincides with an interesting technological ad-
vance that will change both the semiconductor business and micropro-
cessor design. Since 1992, microprocessors account for 23% of total
semiconductor sales. In 1998, these chips accounted for 30% of total
value of the semiconductor production. The increasing share of micro-
processors in semiconductor market is due to a new phase of silicon
integration enabled by deep submicron fabrication technology.
Manuscript received July 16, 1998; revised October 22, 1999.
J. Kin and W.-H. Mangione-Smith are with the Electrical Engineering De-
partment, University of California, Los Angeles, CA 90095 USA (e-mail: john-
sonk@icsl.ucla.edu; billms@icsl.ucla.edu).
C. Lee and M. Potkonjak are with the Computer Science Department, Uni-
versity of California, Los Angeles, CA 90095 USA (e-mail: leec@cs.ucla.edu;
miodrag@cs.ucla.edu).
Publisher Item Identifier S 1063-8210(01)00703-X.
For example, SA-1100 from Intel [4] incorporates many functions
such as a memory controller, color LCD driver, PCMCIA interface,
IrDA and USB communication channels, and extensive power manage-
ment into a single chip along with its core logic, previously available
only through “glue logic” chips. One implication of this technology is
that almost all semiconductor manufacturers are entering the micropro-
cessor business.
As a consequence of this trend, the market will be more crowded
and competitive in spite of increasing demand. This pressure will force
manufacturers to focus on microprocessors that are cheaper and more
aggressively optimized for specific applications. A challenge to micro-
processor designers will be to design a microprocessor that executes a
targeted application very well yet can achieve economy-of-scale. For
example, video-game players such as PlayStation from Sony and Nin-
tendos64 from Nintendo need to employ ever-more powerful proces-
sors for the application and yet remain cheap enough to sell for under
$
300.
On the technical side, recent advances in compiler technology
and microprocessor architecture for instruction-level parallelism
(ILP) have significantly increased the ability of a microprocessor to
exploit the opportunities for parallel execution that exist in various
programs. Key ILP compiler technologies, such as trace scheduling
[4], superblock scheduling [5], treegion-scheduling [6], hyperblock
scheduling [8], and software pipelining [9] are in the process of
migrating from research labs to product groups.
At the same time, a number of new microprocessor architectures
have been introduced. These designs present hardware structures that
are well matched to most ILP compilers. Architectural enhancements
found in commercial products include predicated instruction execu-
tion, VLIW execution, and split register files. One of the best examples
that has these features is TMS320C6X from Texas Instruments [9]. Al-
though TI considers the TMS320C6X to be a DSP, the architecture is
almost a copy of the Multiflow Trace [10]. Multi-gauge arithmetic (or
variable-width SIMD) is found in the family of MPACT architectures
from Chromatic [11] and the designs from MicroUnity [12]. Most of
the multimedia extensions of programmable processors also adopt this
architectural enhancement [14].
The arrival of production quality ILP compilers and commercial
DSPs with VLIW and SIMD architectures stimulated the idea of
custom-fit processors [15]. The premise of such an approach is
that applications differ from each other in exploitable measures, for
example the available ILP, demand on various hardware components
(e.g., cache memory units, register files) and the number of function
units. The presumption is that a microprocessor can be designed by
adding hardware components tailored to a specific application so that
it can execute the single application extremely well. Of course, an
obvious drawback of this approach is that it provides no guarantee
that other applications will run as well as the targeted application.
While the current microprocessors for media applications (mediapro-
cessors) are claimed to target general applications in a domain [13], a
custom-fit processor targets a single application (although they remain
programmable).
We report on a method of system-levelsynthesis of single or multiple
application programmable processors. We use a benchmark suite con-
sisting of complete applications written in a high level language [16].
We use the IMPACT tool suit [18] to collect performance measure-
ments of benchmarks on variousmachine configurations. The IMPACT
1063–8210/01$10.00 © 2001 IEEE

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 3, JUNE 2001 475
TABLE I
M
ACHINE CONFIGURATION EXAMPLES AND THEIR AREA ESTIMATES (mm ): A MACHINE CONFIGURATION CONSISTS OF:ISSUE WIDTH,NUMBER OF
ALUS,
N
UMBER OF BRANCH UNITS,NUMBER OF MEMORY UNITS,SIZE OF INSTRUCTION CACHE (KB), SIZE OF DATA
CACHE (KB)
TABLE II
A
PPLICATIONS USED IN THE
EXPERIMENT.DYNAMIC INSTRUCTION COUNTS WERE MEASURED ON A SPARC
C compiler is a retargetable compiler with code optimization compo-
nents supporting multiple-instruction-issue processors. The target ma-
chine is described using the high-level machine description language.
A high-level machine description supplied by a user is compiled by the
IMPACT machine description language compiler. IMPACT provides
cycle-level simulation tools.
This paper is organized as follows. The next section briefly sur-
veys related works and summarizes the contributions of this work. Sec-
tion III presents the background materials including machine model,
benchmarks, experiment platform (such as tools), and an example set
of results obtained using the tools. Our approach in this project is ex-
plained in Section IV in detail. Section V formulates the search problem
defined in the previous section in formal terms. The solution space ex-
ploration strategy and algorithm is described in Section VI. Extensive
experimental results are reported in Section VII. Finally, Section VIII
draws conclusions.
II. R
ELATED WORKS AND OUR CONTRIBUTIONS
The work on synthesis and evaluation of application-specific pro-
grammable processors has been conducted independently in two re-
search communities, computer-aided design and architecture. There is,
however, a strong converging trend of the two areas due to recent tech-
nological advancesand application trends. In this section we surveythe
related works in these two fields.
There have been a number of efforts related to the design of ap-
plication-specific programmable processors and application-specific
instruction sets. Comprehensive survey of the works on com-
puter-aided design of application-specific programmable processors
have been conducted by Goosens [18], Paulin [19], and Marwedel
[20]. In particular, a great deal of effort has been made in combining
retargetable compilation technologies and design of instruction sets
[22]–[26]. Several research groups have published results on the topic
of selecting and designing instruction set and processor architecture
for a particular application domains [27], [28].
Early work in the area of processor architecture synthesis tended to
employ ad hoc methods on small code kernels, in large part due to the
lack of good retargetable compiler technology. Conte and Mangione-
Smith [29] presented one of the first efforts that focused on large appli-
cation codes (i.e., SPEC) written in a high-level language. While they
had a similar goal to ours, i.e., evaluating performance efficiency by in-
cluding hardware cost, their evaluation approach was substantially dif-
ferent. Conte et al. [29] further refined this approach to consider power
consumption. Both of these efforts were limited by available compiler
technology and used a single applications binary scheduled for a scalar

476 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 3, JUNE 2001
TABLE III
D
ATA SET USED IN THE EXPERIMENT
TABLE IV
R
UN-TIME CHARACTERISTICS MEASURED ON HPPA-7100 USING IMPACT SUITES
machineforexecutiononsuperscalarimplementations.Fisheretal. [15]
studied the variability of application-specific VLIW processors using a
highly advanced and retargetable compiler. However, their study con-
sidered small program kernels rather than complete applications. They
also focused on finding the best possible architecture for a specific ap-
plication or workload, rather than understanding the difference among
attractive architectures across a set of applications.
We adopt a methodology of system synthesis combining the key
paradigms of both communities. Following the tradition of the CAD
community, we develop an accurate area estimate and aggressive
optimization algorithms. We follow the tradition of the architecture
community by using comprehensive real-life benchmarks and pro-
duction quality compilation and simulation tools. This combination
enables us to build a unique framework of system-level synthesis and
to gain valuable insights about design and use of application-specific
programmable processors for modern applications.
Unlike previous works, we use a set of complete applications written
in a high-level language as benchmarks. We incorporate the role of
cache memory units in machine performance into the machine model,
which is essential for producing meaningful results.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 3, JUNE 2001 477
Fig. 1. Performance measurement flow using IMPACT tools.
TABLE V
A
N EXAMPLE SET OF RESULTS
We focus on the number of machine configurations that should be
developed in order to maximize performance for all of the benchmarks
given an area constraint. We understand that it is in the best interest of
a processor designer to understand which architecture and how many
functional units or cache size is best for one particular application.
However, our first goal is to develop a framework for managers to un-
derstand howbig the chips portfolio should be in one particular domain.
It is not intended for a single designer to find his best application-spe-
cific system. The objective function of the optimizer is minimization
of selected machine configurations, thereby maximizing the number
of benchmarks that can be run on a processor as though it is optimized
for each individual benchmark. In one extreme case, we end up with as
many machine configurations as the number of individual benchmarks.
On the other extreme, we need only one machine. Clearly, the most in-
teresting solutions lie somewhere in the middle.
Power consumption evaluation and optimization is very often an
important aspect in multimedia processors; however, it is beyond the
scope of this paper. We have published a thorough investigation of
powerconsumption using similar frameworkand tools in another paper
[30].
III. P
RELIMINARY DISCUSSION
In this section we discuss the experimentalenvironment thathas been
adapted and developed for the investigation. First, we describe the ma-
chine model used to estimate the area of a machine configuration. The
benchmark suite is introduced along with the characteristics of its com-
ponents. Finally, we explain the experimental platform, including tools
and their example outputs.
A. Machine Model
To estimate the cost of a machine configuration, we adopt a simple
model developed by Argyres [31]. Given the area of the issue unit, the
cost of any scalar machine configuration is a linear function of the num-
bers of branch, memory, and arithmetic units. A machine may include
any number of each function unit. For a superscalar machine, the issue
unit area cannot be estimated using a simple linear model since it re-
quires more complex logic for runtime code scheduling. We assume
that the issue unit area will take
O
(
n
2
)
space since the complexity of
dependency checking algorithm is
O
(
n
2
)
. When a VLIW machine is
considered, the issue unit area is known to be of complexity
O
(
n
)
or

478 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 3, JUNE 2001
TABLE VI
C
ONFIGURATIONS OF THE MACHINES USED IN TABLE V
Fig. 2. System model being simulated.
sublinear. The cost function of arbitrarily configured superscalar ma-
chines is given by
Area
=(
A
B
0
A
E
0
A
C
0
A
U
)+
A
u
+
n
b
A
b
+
n
m
A
m
+
n
i
A
i
+
A
dc
+
A
ic
(1)
where
A
B
area of a baseline machine;
A
E
area of a baseline machine execution unit;
A
C
baseline machine cache area;
A
U
baseline machine issue unit area;
A
u
issue unit area;
n
b
number of branch units;
A
b
area of branch units;
n
m
number of memory units;
A
m
area of memory units;
n
i
number of ALU;
A
i
area of an ALU;
A
dc
data cache area;
A
ic
instruction cache area.
The baseline architecture chosen for the analysis is the PowerPC 604
[32], a four-issue processor. The 604 has two simple integer ALUs and
one complex integer ALU, one floating-point unit, one branch unit, and
one memory unit. We assume that machine configurations that have an
Fig. 3. An example high-level machine description (HMDES).
issue unit smaller than the baseline machine have at least one com-
plex integer ALU. The area of the complex integer unit is assumed
to be half of the baseline integer unit (two simple integer units and
one complex integer unit). The area of issue unit is scaled based on
the area complexity
(
O
(
n
2
))
. We did not include floating-point units
in any machine configurations because the benchmarks we used have
mostly integer (or fixed-point) operations. Finally, we scaled the area
for 0.35
m technology rather than the original 0.5
technology used

Citations
More filters

Processor evaluation cube : A classification and survey of processor evaluation techniques

TL;DR: A large number of techniques have been proposed in literatu for selecting appropriate hardware resources corresponding to the application for embedded system or a SoC design.
References
More filters
Book

Computers and Intractability: A Guide to the Theory of NP-Completeness

TL;DR: The second edition of a quarterly column as discussed by the authors provides a continuing update to the list of problems (NP-complete and harder) presented by M. R. Garey and myself in our book "Computers and Intractability: A Guide to the Theory of NP-Completeness,” W. H. Freeman & Co., San Francisco, 1979.
Book

Computer Architecture: A Quantitative Approach

TL;DR: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today.
Proceedings Article

The Art of Computer Systems Performance Analysis.

Raj Jain
TL;DR: The authors' goal is always to offer you an assortment of cost-free ebooks too as aid resolve your troubles.
Book

The art of computer systems performance analysis

Raj Jain
TL;DR: The art of computer systems performance analysis by is one of the most effective vendor publications worldwide as discussed by the authors. But have you had it? Not at all? Ridiculous of you.
Related Papers (5)
Frequently Asked Questions (17)
Q1. What are the contributions mentioned in the paper "Exploring the diversity of multimedia systems" ?

The authors follow the tradition of the architecture community by using comprehensive real-life benchmarks and production quality tools. This combination enables us to build a unique framework for system-level synthesis and to gain valuable insights about design and use of application-specific programmable processors for modern applications. The authors explore the application-specific programmable processor ( ASSP ) design space to understand the relationship between performance and area. For example, the authors found that in most cases a single programmable architecture performs similarly to a set of architectures that are tuned to individual application. The authors found that the framework introduced in this paper can be very valuable in making early design decisions such as area and architectural configuration tradeoff, cache and issue width tradeoff under area constraint, and the number of branch units and issue width. 

The objective function of the optimizer is minimization of selected machine configurations, thereby maximizing the number of benchmarks that can be run on a processor as though it is optimized for each individual benchmark. 

Note that in order to reduce the effect of memory operations on other measurements, the target machine has 32 KB instruction cache and 32 KB data cache, resulting in high cache hit rates. 

The run-time characteristics include the available ILP, demand on various hardware components such as cache memory units, register files, and the number of function units. 

The arrival of production quality ILP compilers and commercial DSPs with VLIW architecture stimulated the idea of programmable processors that are aggressively tuned to specific applications. 

The objective function of the optimization problem is minimization of the number of selected machine configurations, thereby, on average, maximizing the number of benchmarks that can be run on a processor as though it is optimized for each individual benchmark. 

Early work in the area of processor architecture synthesis tended to employ ad hoc methods on small code kernels, in large part due to the lack of good retargetable compiler technology. 

The authors incorporate the role of cache memory units in machine performance into the machine model, which is essential for producing meaningful results. 

for the given compiler technology and benchmarks, there is no need to have more than 100 mm2 of area since the speed-up increase achieved by machines greater than 100 mm2 are minimal. 

The authors have found that the framework introduced in this paper can be very valuable in making early design decisions such as area and architectural configuration tradeoff, cache and issue width tradeoff under area constraint, and the number of branch units and issue width. 

For a superscalar machine, the issue unit area cannot be estimated using a simple linear model since it requires more complex logic for runtime code scheduling. 

The collection is composed of 21 applications culled from available image processing, communications, cryptography, and DSP applications. 

For each executable of a benchmark, the authors simulate 25 combina-tions of instruction cache and data cache ranging from (512 bytes, 512 bytes) to (8 KB, 8 KB). 

One of the underlying reasons that causes the phenomenon is that the ILP found by the compiler and hardware scheduler is fully exploited by having a certain amount of hardware, thereby performance increase possibility is exhausted. 

for the given compiler technology and benchmarks, there is little need to have more than 100 mm2 of area sincethe speed-up increase achieved by machines greater than 100 mm2 are minimal. 

Memory latency, misprediction penalty and ALU latency are specified as Lsim parameters (Fig. 2) in the system model being simulated. 

Note that the combination of the instructions per cycle (IPC), bus utilization, branch issue, and ALU issue exhibit distinctive characteristics for each benchmark.