scispace - formally typeset
Open AccessProceedings ArticleDOI

Multi-processor system design with ESPAM

Reads0
Chats0
TLDR
This paper explains how starting from system level platform, application, and mapping specifications, a multiprocessor platform is synthesized and programmed in a systematic and automated way in order to reduce the design time and to satisfy the performance needs of applications executed on these platforms.
Abstract
For modern embedded systems, the complexity of embedded applications has reached a point where the performance requirements of these applications can no longer be supported by embedded system architectures based on a single processor. Thus, the emerging embedded System-on-Chip platforms are increasingly becoming multiprocessor architectures. As a consequence, two major problems emerge, i.e., how to design and how to program such multiprocessor platforms in a systematic and automated way in order to reduce the design time and to satisfy the performance needs of applications executed on these platforms. Unfortunately, most of the current design methodologies and tools are based on Register Transfer Level (RTL) descriptions, mostly created by hand. Such methodologies are inadequate, because creating RTL descriptions of complex multiprocessor systems is error-prone and time consuming.As an efficient solution to these two problems, in this paper we propose a methodology and techniques implemented in a tool called Espam for automated multiprocessor system design and implementation. Espam moves the design specification from RTL to a higher, so called system level of abstraction. We explain how starting from system level platform, application, and mapping specifications, a multiprocessor platform is synthesized and programmed in a systematic and automated way. Furthermore, we present some results obtained by applying our methodology and Espam tool to automatically generate multiprocessor systems that execute a real-life application, namely a Motion-JPEG encoder.

read more

Content maybe subject to copyright    Report

Multi-processor System Design with ESPAM
Hristo Nikolov Todor Stefanov Ed Deprettere
Leiden Institute of Advanced Computer Science
Leiden University, The Netherlands
{nikolov,stefanov,edd}@liacs.nl
ABSTRACT
For modern embedded systems, the complexity of embedded appli-
cations has reached a point where the performance r equirements of
these applications can no longer be supported by embedded system
architectures based on a single processor. Thus, the emerging em-
bedded System-on-Chip platforms are increasingly becoming mul-
tiprocessor architectures. As a consequence, two major problems
emerge, i.e., how to design and how to program such multiproces-
sor platforms in a systematic and automated way in order to reduce
the design time and to satisfy the performance needs of applications
executed on these platforms. Unfortunately, most of the current de-
sign methodologies and tools are based on Register Transfer Level
(RTL) descriptions, mostly created by hand. Such methodologies
are inadequate, because creating RTL descriptions of complex mul-
tiprocessor systems is error-prone and time consuming.
As an efficient solution to these two problems, in this paper we
propose a methodology and techniques implemented in a tool called
E
SPAM for automated multiprocessor system design and implemen-
tation. E
SPAM moves the design specification from RTL to a higher,
so called system level of abstraction. We explain how starting from
system level platform, application, and mapping specifications, a
multiprocessor platform is synthesized and programmed in a sys-
tematic and automated way. Furthermore, we present some results
obtained by applying our methodology and E
SPAM tool to auto-
matically generate multiprocessor systems that execute a real-life
application, namely a Motion-JPEG encoder.
Categories and Subject Descriptors: J.6 [Computer-aided engineer-
ing]: Computer-aided design (CAD).
General Terms: Algorithms, Design, Experimentation.
Keywords: System-Level Design, Heterogeneous MPSoCs, Kahn Pro-
cess Networks.
1. INTRODUCTION
Moore’s law predicts exponential growth over time of the number
of transistors that can be integrated in a single chip. The intrinsic com-
putational power of a chip must not only be used efficiently and effec-
tively, but also the time and effort to design a system containing both
hardware and software must remain acceptable. Unfortunately, current
system design methodologies (including platform design and applica-
tion specification) are still based on Register Transfer Level (RTL)
platform/application descriptions created by hand using, for example,
VHDL and/or C. Such methodologies were effective in the past. How-
ever, applications and platforms used in many of today’s new system
designs are so complex that traditional design practices are now inad-
equate, because creating RTL descriptions of complex multiprocessor
systems is error-prone and time-consuming. Moreover, the complex-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CODES+ISSS’06, October 22–25, 2006, Seoul, Korea.
Copyright 2006 ACM 1-59593-370-0/06/0010 ...
$5.00.
ity of high-end, computationally intensive applications in the realm
of high throughput multimedia, imaging, and digital signal processing
exacerbates the difficulties associated with the traditional hand-coded
RTL design. Furthermore, using traditional logic simulation to verify
a large design represented in RTL is computationally expensive and
extremely slow.
1.1 Problem Description
For all the reasons stated above, we conclude that the use of a RTL
system specification as a starting point for multiprocessor system de-
sign methodologies is a bottleneck. Although the RTL system specifi-
cation has the advantage that the state of the art synthesis tools can use
it as an input to automatically implement a system, we believe that a
system should be specified at a higher level of abstraction called sys-
tem level. This is the only way to solve the problems caused by the
low level RTL specification. However, moving up from the detailed
RTL specification to a more abstract system level specification opens
a gap which we call Implementation Gap. Indeed, on the one hand,
the RTL system specification is very detailed and close to an imple-
mentation, thereby allowing an automated system synthesis path from
RTL specification to implementation. This is obvious if we consider
the current commercial synthesis tools where the RTL-to-netlist syn-
thesis is very well developed and efficient. On the other hand, the
complexity of today’s systems forces us to move to higher levels of
abstraction when designing a system, but currently we do not have
mature methodologies, techniques, and tools to move down from the
high-level system specification to an implementation. Therefore, the
Implementation Gap has to be closed by devising a systematic and
automated way to convert effectively and efficiently a system level
specification to a RTL level specification.
1.2 Paper Contributions
In this paper we present our tool ESPAM (Embedded System-level
Platform synthesis and Application Mapping) that implements our meth-
ods and techniques for systematic and automated multiprocessor plat-
form implementation and programming. They successfully bridge the
gap between the system level specification and the RTL level specifi-
cation which we consider as the main contribution of this paper. More
specifically, E
SPAM allows a system designer to specify a multipro-
cessor system at a high level of abstraction in a short amount of time.
Then E
SPAM refines this specification to a real implementation in a
systematic and automated way thereby successfully closing the imple-
mentation gap mentioned earlier. This reduces the design time from
months to hours. As a consequence, a very accurate exploration of the
performance of alternative multiprocessor platforms becomes feasible
at implementation level in a few hours.
The success of our methods and techniques in closing the imple-
mentation gap is based on the underlying application model and sys-
tem level platform model. E
SPAM can implement data-flow domi-
nated (streaming) applications onto multiprocessor platform instances
efficiently and in an automated way. For the latter, a crucial role is
played by the Kahn Process Network (KPN) [1] model of computa-
tion which we use as an application model. Many researchers [2] [3]
[4] [5] [6] [7] have already indicated that KPNs are suitable for effi-
cient mapping onto multiprocessor platforms. In addition to that, by
carefully exploiting and efficiently implementing the simple commu-
211

nication and synchronization features of a KPN, we have identified and
developed a set of generic parameterized components which we call a
platform model. We consider this an important contribution of this
paper because our set of components (platform model) allows system
designers to specify (construct) very fast and easily many alternative
multiprocessor platforms that are systematically and automatically im-
plemented and programmed by our tool E
SPAM.
1.3 Related Work
Systematic and automated application-to-architecture mapping has
been widely studied in the research community. The closest to our
work is the Compaan/Laura design flow [2]. It uses KPN specifi-
cations for automated mapping of applications targeting FPGA im-
plementations. The reported results are only for processor-coprocesor
architectures whereas our E
SPAM tool allows an automated implemen-
tation of KPN specifications onto multiprocessor platforms.
The Eclipse work [3] defines a scalable architecture template for
designing stream-oriented multiprocessor SoCs using the KPN model
of computation to specify and map data-dependent applications. The
Eclipse template is slightly more general than the templates presented
in this paper. However, the Eclipse work lacks an automated design
and implementation flow. In contrast, our work provides such automa-
tion starting from a high-level system specification.
In [8] a design flow for the generation of application-specific mul-
tiprocessor architectures is presented. This work is similar to our
approach in the sense that we also generate multiprocessor systems
based on instantiation of generic parameterized architecture compo-
nents where very efficient communication controllers are generated
automatically to connect processors to communication networks. How-
ever, many steps of the design flow in [8] are performed manually. As
a consequence a full implementation of a system with 4 processors
connected point-to-point takes around 33 hours. In contrast, our de-
sign flow is fully automated and a full implementation of a system
with 8 processors connected point-to-point or via crossbar or shared
bus takes around 2 hours.
A system level semantics for a system design process formaliza-
tion is presented in [9]. It enables design automation for synthesis
and verification to achieve a required design productivity gain. Using
Specification, Multiprocessing, and Architecture models, a translation
from behavior to structural descriptions is possible at a system level of
abstraction. Our approach is similar but in addition it defines and uses
application and platform models that allow an automated translation
from the system level to the RTL level of abstraction.
Companies such as Xilinx and Altera provide design tool chains at-
tempting to generate efficient implementations starting from descrip-
tions higher than (but still related to) the RTL level of abstraction. The
required input specifications are so detailed that designing a single
processor system is still error-prone and time consuming, let alone al-
ternative multiprocessor systems. In contrast, our design methodology
raises the design focus to an even higher level of abstraction allowing
the design and the programming of multiprocessor systems in a short
amount of time. Moreover, this does not sacrifice the possibility for
automatic and systematic design implementation because our E
SPAM
tool supports it.
2. ESPAM DESIGN FLOW: O VERVIEW
In this section we give an overview of our system design methodol-
ogy which is centered around our E
SPAM tool developed to close the
implementation gap described in Section 1.1. This is followed by a
description of our system level platform model and platform synthe-
sis in Section 3. In Section 4 we discuss the automated programming
of multiprocessor platforms, and in Section 5 we present some results
that we have obtained using E
SPAM. Section 6 concludes the paper.
Our system design methodology is depicted as a design flow in Fig-
ure 1. There are three levels of specification in the flow. They are
S
YSTEM-LEVELspecification, RTL-LEVELspecification, and GAT E -
L
EVEL specification. The SYSTEM-LEVEL specification consists of
three parts: 1) Platform Specification describing the topology of a
platform using our system level platform model, i.e., using generic
parameterized system components; 2) Application Specification de-
scribing an application as a Kahn Process Network (KPN), i.e., net-
work of concurrent processes communicating via FIFO channels. The
KPN specification reveals the task-level parallelism available in the
application; 3) Mapping Specification describing the relation between
all processes and FIFO channels in Application Specification and all
components in Platform Specification.
Program code
for processors
HW description
of IP Cores
Platform topology
description
Auxiliary
information
Chip
Silicon
P2
P1
CB
Platform Specification
A
C
B
Application Specification
KPN
IP Processor P2
Executable (B,C)
IP Processor P1
Executable (A)
IP Core
Crossbar
P2
P1
CB
A
C
B
Mapping Specification
Specification
Specification
Specification
Netlist
Gate−Level
RTL−Level
System−Level
Commercial Synthesizer and Compiler
ESPAM Tool
IP Cores
Library
Figure 1: ESPAM System Design Flow.
The SYSTEM-LEVELspecification is given as input to ESPAM.First,
E
SPAM constructs a platform instance following the platform specifi-
cation and runs a consistency check on that instance. The platform
instance is an abstract model of a multiprocessor platform because at
this stage no information about the target physical platform is taken
into account. The model defines only the key system components of
the platform and their attributes. Second, E
SPAM refines the abstract
platform model to an elaborate (detailed) parameterized RTL model
which is ready for an implementation on a target physical platform.
We call this refinement process platform synthesis. The refined system
components are instantiated by setting their parameters based on the
target physical platform features. Finally, E
SPAM generates program
code for each processor in the multiprocessor platform in accordance
with the application and mapping specifications.
The output of E
SPAM, namely a RTL-LEVEL specification of a
multiprocessor system, is a model that can adequately abstract and ex-
ploit the key features of a target physical platform at the register trans-
fer level. It consists of four parts: 1) Platform topology description
defining in greater detail the multiprocessor platform; 2) Hardware
descriptions of IP cores containing predefined and custom IP cores
used in 1). E
SPAM selects predefined IP cores (processors, memories,
etc.) from Library IP Cores, see Figure 1. Also, it generates cus-
tom IP cores needed as a glue/interface logic between components in
the platform; 3) Pr ogram code for processors to execute the appli-
cation on the synthesized multiprocessor platform, E
SPAM generates
program source code files for each processor in the platform. 4) Auxil-
iary information containing files which give tight control on the overall
specifications, such as defining precise timing requirements and prior-
itizing signal constraints.
With the descriptions above, a commercial synthesizer can convert
a RTL-L
EVEL specification to a GATE-LEVEL specification, thereby
generating the target platform gate level netlist, see the bottom part
of Figure 1. This G
ATE-LEVEL specification is actually the system
implementation. The current prototype version of E
SPAM facilitates
automated multiprocessor platform synthesis and programming using
Xilinx VirtexII-Pro FPGAs. E
SPAM uses the Xilinx Platform Studio
(XPS) tool as a back-end to generate the final bit-stream file that con-
figures a specific FPGA. We use the FPGA platform technology for
prototyping purposes only. Our E
SPAM is general and flexible enough
212

to be targeted to other physical platform technologies. A real-life
industrially-relevant application, namely Motion-JPEG encoder, has
been fully implemented onto several alternative multiprocessor plat-
forms by using the E
SPAM and XPS design tools.
3. PLATFORM MODEL AND SYNTHESIS
In our design methodology, the platform model is a library of generic
parameterized components. In order to support systematic and auto-
mated synthesis of multiprocessor platforms we have carefully identi-
fied and developed a set of computation and communication compo-
nents. In this section we give a detailed description of our approach
to build a multiprocessor platform. The platform model contains Pro-
cessing components, Memory components, Communication compo-
nents, Communication Controller,andLinks. Memory components
are used to specify the processors’ local program and data memories
and to specify data communication storages (buffers) between proces-
sors. Further we will call the data communication storages Commu-
nication Memories. We have developed a point-to-point network, a
crossbar switch, and a shared bus component with several arbitration
schemes (Round-Robin, Fixed Priority, and TDMA). These Commu-
nication components determine the communication network topology
of a multiprocessor platform. The Communication controller imple-
ments an interface between processing, memory, and communication
components. Links are used to connect any two components in our
system level platform model.
Using the components described above, a system designer can con-
struct many alternative platforms easily, simply by connecting pro-
cessing, memory, and communication components. We have devel-
oped a general approach to connect and synchronize programmable
processors of arbitrary types via a communication component. Our
approach is explained below using an example of a multiprocessor
platform. The system level specification of the platform is depicted
in Figure 2a. This specification, written in XML format, consists
<port
name = "IO1"
/>
name = "uP1"
<processor
</processor>
>
<port
name = "IO1"
/>
name = "uP2"
<processor
</processor>
>
<port
name = "IO1"
/>
name = "uP3"
<processor
</processor>
>
<port
name = "IO1"
/>
name = "uP4"
<processor
</processor>
>
name = "CB"
type = "Crossbar"
>
<network
name = "IO4"
/>
<port
name = "IO3"
/>
<port
name = "IO2"
/>
<port
<port
name = "IO1"
/>
<link
name = "BUS1"
/>
name = "CB"<resource
<port
name = "IO1"
/>
name = "uP1"<resource
<port
name = "IO1"
/>
name = "BUS2"
/>
<resource
<port
name = "IO2"
/>
name = "uP2"<resource
<port
name = "IO1"
/>
name = "CB"
name = "BUS3"
/>
<resource
<port
name = "IO3"
/>
name = "uP3"<resource
<port
name = "IO1"
/>
name = "CB"
<resource
<port
name = "IO4"
/>
name = "uP4"<resource
<port
name = "IO1"
/>
name = "CB"
Platform Specification
a)
CC1
uP1
uP3
uP2
uP4
CC2
MEM2
CM2
MC2
CC4
CM4
MEM4
MC4
1
10
20
15
5
25
30
</platform>
name =
<platform
"myPlatform"
>
</network>
</link>
</link>
<link
</link>
<link
</link>
<link
name = "BUS4"
/>
The elaborate platform
b)
CM1
MEM1
MC1
CC3
CM3
MEM3
MC3
CB
Figure 2: Example of a Multiprocessor Platform.
of three parts which define processing components (four processors,
lines 2-5), communication component (crossbar, lines 7-12), and links
(lines 14-29). The links specify the connections of the processors to
the communication component. To guarantee correct-by-construction
automated platform synthesis and implementation, our E
SPAM tool
runs a consistency check on each platform specified by a designer.
This includes finding impossible and/or meaningless connections be-
tween system level platform components as well as parameter values
that are out of range. Notice that in the specification a designer does
not have to take care of memory structures, interface controllers, and
communication and synchronization protocols. Our E
SPAM tool takes
care of this in the platform synthesis as follows. First, the tool instan-
tiates the processing and the communication components. Second,
it automatically attaches memories and memory controllers (MCs) to
each processor. Third, the tool automatically synthesizes, instantiates,
and connects all necessary communication memories (CMs) and com-
munication controllers (CCs) to allow efficient and safe (lossless) data
communication and synchronization between the components.
The elaborate platform generated by E
SPAM isshowninFigure2b.
The processors (uPs) transfer data between each other through the
CMs. A communication controller connects a communication mem-
ory to the data bus of the processor it belongs to and to a communi-
cation component. Since every programmable processor has a data
bus, processors of different types can easily be connected into a het-
erogeneous multiprocessor platform by using our CCs. Each CC im-
plements the processor’s local bus-based access protocol to the CM
for write operations and the access to the communication component
(CB) for read operations. Each CM is organized as one or more FIFO
buffers. We have chosen such organization because the inter-processor
synchronization in the platform can be implemented in a very sim-
ple and efficient way by blocking read/write operations on empty/full
FIFO buffers located in the communication memory. As a result,
memory contention is avoided.
KPNs assume unbounded communication buffers. Writing is al-
ways possible and thus a process blocks only on reading from an
empty FIFO. In the physical implementation however the communi-
cation buffers have bounded sizes and therefore a blocking write syn-
chronization mechanism is needed as well. At the same time, we want
to stay as close as possible to the KPN semantics because it guaran-
tees the highest possible communication performance. Therefore, in
our approach each processor writes only to its local communication
memory (CM) and uses the communication component only to read
data from all other communication memories. This means that a pro-
cessor can always write if there is room in its local CM (if this CM is
large enough, the processor may never block on writing). A processor
blocks when reading other processors’ CMs if data is not available or
the communication resource is currently not available.
3.1 Processing Components
In our approach we do not propose a design of processing com-
ponents. Instead, we use IP cores developed by third parties. Cur-
rently, for fast prototyping in order to validate our approach, we use
the Xilinx VirtexII-Pro FPGA technology. Therefore, our library pro-
cessing components include two programmable processors, namely
MicroBlaze (MB) and PowerPC (PPC). Our platform model is
general enough to be extended easily with additional (processing) com-
ponents. Notice that only the processing components in our plat-
form model are related to a particular technology (currently to Xilinx
VirtexII-Pro FPGAs). All other components discussed in this section
are technology independent.
3.2 Communication Memory Components
We implement the communication memories of a processor by us-
ing dual-port memories. Logically, a communication memory (CM) is
organized as one or more FIFO buffers. A FIFO buffer in a CM is seen
by a processor as two memory locations in its address space. A proces-
sor uses the first location to read/write data from/to the FIFO buffer,
thereby realizing inter-processor data transfer. The second location
is used to read the status of the FIFO. The status indicates whether a
FIFO is full (data cannot be written) or empty (data is not available).
This information is used for the inter-processor synchronization. The
multi-FIFO behavior of a communication memory is implemented by
the communication controller described below. However, if a commu-
nication memory contains only one FIFO, we use a dedicated FIFO
component which simplifies the structure of the communication con-
troller.
3.3 Communication Controller
The structure of the Communication Controller (CC) is shown in
Figure 3. It consists of two blocks, namely Interface Unit and FIFOs
Unit. The Interface Unit contains an address decoder, fifos’ control
logic, and logic to generate read requests to the communication com-
ponent. When a processor has to write data to its local Communica-
tion Memory (CM), it first checks if there is room in the corresponding
213

Logic
Write
Logic
Read
Communication Memory Side
Data Wr
Data
...
...
Addr
Addr
Read
Data
Write
Data
Rd/Wr
Addr Bus
Data’
FIFO Status
Unit
Data Rd
Pr
ocesso
r
S
i
de
Communication Component Side
Full
Control
Interface
.
FIFOs Unit
Empty
FIFO Sel
Read
Request
Empty’
Read’
Figure 3: Communication Controller.
FIFO by reading its status. If the FIFO is full, the processor blocks.
Otherwise, it sends the data to the CC. The Interface Unit decodes the
FIFO address sent by the processor along with the data and generates
control signals (select FIFO, write data, or read status) to the write
logic of the FIFOs Unit. The latter implements the multi-FIFO behav-
ior. For each FIFO buffer the FIFOs Unit contains read and write coun-
ters that indicate the read and write positions into the buffer. These
counters are used as read/write address generators and their values are
used for determining the empty/full status of a FIFO. The FIFOs Unit
also includes a memory interface logic that realizes the access to the
Communication Memory (CM) connected to the CC (bottom part of
Figure 3). Notice that since we use dual-port memories and read and
write logic are separated, a FIFO in a CM can be accessed for read and
write operations simultaneously by different processors or two FIFOs
in a CM can be accessed at a time one for read operation and one for
write operation.
Recall that a processor can access FIFOs located in other proces-
sors’ CMs via a communication component for read operations only.
First, the processor checks if there is any data in the FIFO the proces-
sor wants to read from. When a processor checks for data, the Interface
Unit sends a request to the communication component for granting a
connection to the CM in which the FIFO is located. A connection is
granted only if a communication line is available and there is data in
the FIFO. If a connection is not granted, the processor blocks until a
connection is granted. When a connection is granted, the CC connects
the data bus of the communication component (the upper part of the
communication component side in Figure 3) to the data bus of the pro-
cessor and the processor reads the data from the CM where the FIFO is
located. After the data is read the connection has to be released. This
allows other processors to access the same CM. When data is read
from a FIFO of a CM, the signals to the read logic of the FIFOs Unit
(FIFO Sel and Read) are generated by the communication component
(the bottom part of the communication component side in Figure 3) as
a response to a request from another CC.
The described blocking mechanism for accessing the CMs has to be
done in the processors. The blocking can be realized in hardware (usu-
ally processors have dedicated embedded hardware to stall the pro-
cessor) or in software by executing empty loops. We use the latter
approach because it is more general. Different processors are stalled
in hardware in different ways and therefore our CC would have to be
aware of many possibilities. This would result in a more complex and
less generic controller. Realizing the blocking mechanism in software
makes the controller more generic, thereby simplifying the integration
of different types of processors into a multiprocessor system.
3.4 Crossbar Communication Component
In this subsection we present the implementation of our Crossbar
communication component. Our general approach to connect proces-
sors that communicate data through communication memories (CM)
with FIFO organization allows the crossbar structure to be very sim-
ple. This results in a smaller crossbar with a reduced number of com-
munication and routing resources and thus reducing the design area
and power consumption. The structure of our crossbar component
consists of two main parts, crossbar switch (CBS) and crossbar con-
troller (CBC). The CBS implements uni-directional connections be-
tween communication memories and processors recall that a proces-
sor uses a communication component only to read data. Due to the
uni-directional communications and the FIFO organization of CMs,
the number of signals and busses that has to be switched by our cross-
bar is reduced a lot. Since the addresses for accessing CMs are gener-
ated locally by the CCs, address busses are not switched through the
crossbar. The crossbar switches 32-bit data buses in one direction and
two control signals per bus. These control signals are the Read strobe
and the Empty status flag for a FIFO.
The requests for granting a connection generated by the CCs are
processed by the crossbar controller (CBC) using Round-Robin policy.
If a request is for granting a connection, the CBC checks its request
table whether the required connection is available at the moment. The
request table contains information about the status (available or not
available) of all connections. The table is updated each time a connec-
tion is granted or released.
3.5 Point-to-Point Network
In this section we describe how we implement a point-to-point com-
munication in our platforms. In point-to-point networks the topology
of the platform (the number of processors and the number of direct
connections between the processors) is the same as the topology of the
process network. Since there is no communication component such as
a crossbar or a bus, there are no requests for granting connections and
there is no sharing of communication resources. Therefore, no addi-
tional communication delay is introduced in the platform. Because of
this, the highest possible communication performance can be achieved
in such multiprocessor platforms. Under the conditions that each com-
CC CC
CC
A
Legend:
MC
CC
MEM
CM
− Memory Controller
− Communication Controller
− Communication Memory
− Program and Data Memory
DATA BUS
MEM
MC
uP2
CM
DATA BUS
MEM
MC
uP3
C
CH3
B
DATA BUS
MEM
MC
uP1
CH2
CH1
CMCM
Figure 4: Point-to-Point Architecture.
munication memory (CM) contains only one channel and each proces-
sor writes data only to its local CM (in compliance with our concept),
our E
SPAM tool synthesizes a point-to-point network in the following
automated way. First, for each process in the KPN, E
SPAM instanti-
ates a processor together with a communication controller (CC). Then,
E
SPAM finds all the channels which the process writes to. For each
found channel the tool instantiates a CM and assigns the channel to
this CM. Finally, E
SPAM connects the memory to the already instan-
tiated processor. In Figure 4 we give an example of a point-to-point
multiprocessor platform generated by E
SPAM. Assume that the mul-
tiprocessor platform has to implement the KPN depicted in the top of
Figure 5 and each process is executed on a separate processor. There
are three channels that have to be assigned to three CMs. Follow-
ing the procedure above E
SPAM finds that CH1 and CH2 are written
by process A see the top part of Figure 5. Process A is assigned
to be executed onto processor uP1 therefore CMs corresponding to
CH1 and CH2 are instantiated and connected to uP1. Similarly, a CM
corresponding to CH3 is instantiated and connected to processor uP2.
Process C is assigned to processor uP3 and since process C only reads
data from CH1 and CH3 no more CMs are instantiated. Processor
uP3 is simply connected to the already instantiated CMs correspond-
ing to CH1 and CH3. Notice that in Figure 4, a CC is connected to
more than one CM. As we mentioned in Section 3.2, if a CM contains
only one FIFO a CM is implemented by a dedicated FIFO component.
Therefore, to connect one or more FIFOs to a processor in the case of
point-to-point network, we use a very simplified version of our com-
munication controller (CC) described in Section 3.3. The simplified
214

CC only translates the processor data bus signals to FIFO input/output
signals. The CC is parameterized and it supports up to 128 FIFOs for
read and write operations.
4. AUTOMATED PROGRAMMING
Application Specification
The first step to program multiprocessor systems in our E
SPAM de-
sign methodology is the partitioning of an application into concurrent
tasks where the inter-task communication and synchronization is ex-
plicitly specified in each task. The partitioning of an application into
concurrent tasks can be done by hand or automatically [2, 10] and it
allows each task or group of tasks to be compiled separately by a stan-
dard compiler in order to generate an executable code for each proces-
sor in the platform. The result of the partitioning done by the tools is
an XML description of a Kahn Process Network (KPN) as an Approx-
imated Dependence Graph (ADG) data structure [11]. It is a compact
mathematical representation of the process network in terms of poly-
hedra. This allows formal operations to be defined and applied [11] on
the KPN in order to generate an efficient code for the processors. A
main()void
{
read( p2, in_0, sizeof(myType) );
compute( in_0, out_0 );
C
p2
p1
CH2
B
p1
CH3
p1
p2
p2
A
CH1
<fromPort
name = "p1"
/>
<toPort
name = "p2"
/>
name = "p2"
/>
direction = "in"
<port
name =
<process >
"B"
<process_code
name = "compute"
>
<arg
name = "in_0" type = "input"
/>
name = "out_0" type = "output"
/>
<arg
<par_bounds
matrix = "[1,0,−1,384;"
1,0, 1, −3]"
/>
5
15
20
25
1
5
10
15
write( p1, out_0, sizeof(myType) );
}
}
void
for
int *isEmpty = port + 1;
// reading is blocked if a FIFO is empty
while
(byte* data)[i] = *port; // read data from a FIFO
( int i=o; i<length; i++ )
}
( *isEmpty )
{ }
read( byte *port, void *data, int length )
void write( byte *port, void *data, int length )
for
int *isFull = port + 1;
// writing is blocked if a FIFO is full
while
( int i=o; i<length; i++ )
( *isFull )
{ }
*port = (byte* data)[i]; // write data to a FIFO
}
{
{
}
25
20
}
name = CH2
<channel >
<toProcess
name = "B"
/>
<fromProcess />
</channel
name = "A"
. . .
XML specification of a KPN
a) b)
Program code, generated by Espam
10
1
</port
</port
<var
name = "out_0"
<var
name = "in_0"
/>
type = "myType"
<port
name = "p1" direction = "out"
/>
type = "myType"
/>
</loop
</process_code
</process >
<loop
parameter = "N"
>
index = "k"
<loop_bounds
/>
matrix = "[1, 1,0,−2;"
1,−1,2,−1]"
for
( int k=2; k<=2*N−1; k++ ){
{
{
Figure 5: Kahn Process Network Example.
simple example of a KPN is shown in Figure 5. Three processes (A,
B, and C) are connected through three FIFO channels (CH1, CH2, and
CH3). For the sake of clarity, in Figure 5a, we show the XML descrip-
tion only for one process (B). Process B has one input port and one
output port defined in lines 2-7. In our example, process B executes
a function called
compute (line 8). The function has one input argu-
ment (line 9) and one output argument (line 10). The relation between
the function arguments and the ports of the process is given in lines 3
and 6. The function has to be executed
2 N 2 times as specified
by the polytope in lines 12-13. The value of
N is between 3 and 384
(lines 14-15). Lines 20-25 show an example of how the topology of a
KPN is specified: CH2 connects processes A and B through ports p1
and p2.
Code Generation
E
SPAM takes the XML specification of an application, applies some
operations [11] on it and automatically generates software (C/C++)
code for each processor. The code contains the main behavior of a pro-
cess, together with the blocking read/write synchronization primitives
and the memory map of the system. The C code generated by E
SPAM
for process B is shown in Figure 5b. In accordance with the XML ap-
plication specification,
for loop is generated in the main function of
process B (lines 2-6) to execute
2 N 2 times function compute.
The C/C++ code implementing function
compute has to be provided
by the designer. The function uses local variables
in 0 and out 0.
For simplicity, the declaration of the local variables is not shown in
the figure. E
SPAM inserts a read primitive to read from CH2, initial-
izing variable
in 0 and a write primitive to send the results (the value
of variable
out 0) to CH3 (Figure 5b, lines 3 and 5). The code of the
synchronization read/write primitives, shown in the same figure, is au-
tomatically generated by E
SPAM as well. Each primitive has 3 param-
eters. Parameter
port is the address of the memory location through
which a processor can access a given FIFO channel. Parameter
data
is a pointer to a local variable and leng th specifies the amount of data
(in bytes) to be moved from/to the local variable to/from the chan-
nel. The primitives implement the blocking synchronization mecha-
nism between the processors in the following way. First, the status of
a channel that has to be read/written is checked. A channel status is
accessed using the locations defined in lines 10 and 19. The block-
ing is implemented by
while loops with empty bodies in lines 13 and
22. Each empty loop iterates (does nothing) while a channel is full or
empty. Then, in lines 14 and 23 the actual data transfer is done.
5. EXPERIMENTS AND RESULTS
In this section we present some of the results we have obtained by
implementing and executing a Motion JPEG (M-JPEG) encoder ap-
plication onto several multiprocessor platform instances using our E
S-
PAM system design flow presented in Section 2. The main objective
of this experiment is to show that our design flow successfully closes
the implementation gap between the System and RTL abstraction lev-
elsofdescriptionaswellastoshowthatusingtheE
SPAM tool a very
accurate exploration of the performance of alternative multiprocessor
platforms based on real implementations becomes feasible since the
design time is reduced significantly. For the implementations we used
a prototyping board with one Xilinx FPGA.
Design Time
In Table 1 we show the processing times of each step in the design
flow for the implementation of one platform instance. As described
in Section 2, the inputs to our system design flow are the Application,
Platform,andMapping Specifications.TheApplication Specification
has to represent the M-JPEG application as a KPN. For a certain class
of applications the generation of KPNs is automated by the translator
tools presented in [2, 10]. We started with the M-JPEG application
given as a sequential C program. With small modifications we struc-
tured the C code in order to comply with the input requirements of
the translators. Then we derived a KPN specification automatically. It
took us about half an hour to modify the C code and just 22 seconds to
derive the KPN specification. Notice that this is a one-time effort only
Table 1: Processing Times (hh:mm:ss).
KPN System Level to Physical Manual
Derivation RTL conversion Implement. Modific.
Translators 00:00:22 00:30:00
ESPAM tool 00:00:24 00:10:00
XPS tool 02:09:00
because in the implementation of each new platform the same KPN
specification is used. For each platform we wrote the Platform and
Mapping Specifications by hand in approximately 10 minutes. This is
a very simple task because our specifications are at a high system level
of abstraction (not RTL level). Having all three system level speci-
fications, our E
SPAM tool converts them to RTL level specifications
within half a minute. The generated specifications are close to an im-
plementation and are automatically imported to the Xilinx Platform
Studio (XPS) tool for physical implementation, i.e., mapping, place,
and route onto our prototyping FPGA. Table 1 shows that it took the
XPS tool more than 2 hours for the physical implementation. The
reported time is for a platform instance containing 8
MicroBlaze
processors. However, in case of 2 processors XPS needs only 20 min-
utes. All tools run on a Pentium IV machine at 1.8GHz with 1GB of
RAM.
The figures in Table 1 clearly show that a complete implementation
of a multiprocessor system starting from high abstraction system level
specifications can be obtained in about 2 hours using our E
SPAM tool
together with the translators [2, 10] and the commercial XPS tool. So,
a significant reduction of design time is achieved. This allows us to
explore the performance of 16 platforms using real system implemen-
tations. We implemented and ran the M-JPEG application on 16 alter-
native platforms using different Mapping and Platform Specifications
in a very short amount of time, approximately 2 days.
215

Citations
More filters
Journal ArticleDOI

SystemCoDesigner—an automatic ESL synthesis approach by design space exploration and behavioral synthesis for streaming applications

TL;DR: SystemCoDesigner is the first fully automated ESL synthesis tool providing a correct-by-construction generation of hardware/software SoC implementations, and is presented as a case study, a model of a Motion-JPEG decoder was automatically optimized and implemented using System coDesigner.
MonographDOI

FPGA-based Implementation of Signal Processing Systems

TL;DR: FPGA-based Implementation of Signal Processing Systems is an important reference for practising engineers and researchers working on the design and development of DSP systems for radio, telecommunication, information, audio-visual and security applications.
Journal ArticleDOI

pn: a tool for improved derivation of process networks

TL;DR: This paper presents the compiler techniques for facilitating the migration from a sequential application specification to a parallel application specification using the process network model of computation, and describes a technique for compile-time memory requirement estimation.
Proceedings ArticleDOI

Daedalus: toward composable multimedia MP-SoC design

TL;DR: This paper describes the first industrial deployment experiences with the Daedalus framework and performs a DSE study with a JPEG encoder application, which exploits both task and data parallelism.
Proceedings ArticleDOI

A framework for rapid system-level exploration, synthesis, and programming of multimedia MP-SoCs

TL;DR: The Daedalus framework is presented, which allows for traversing the path from sequential application specification to a working MP-SoC prototype in FPGA technology with the (parallelized) application mapped onto it in only a matter of hours.
References
More filters
Proceedings Article

The Semantics of a Simple Language for Parallel Programming.

Gilles Kahn
TL;DR: A simple language for parallel programming is described and its mathematical properties are studied to make a case for more formal languages for systems programming and the design of operating systems.
Journal ArticleDOI

A systematic approach to exploring embedded system architectures at multiple abstraction levels

TL;DR: The Sesame framework as mentioned in this paper provides high-level modeling and simulation methods and tools for system-level performance evaluation and exploration of heterogeneous embedded systems, and it takes a designer systematically along the path from selecting candidate architectures, using analytical modeling and multi-objective optimization, to simulating these candidate architectures with our system level simulation environment.
Proceedings ArticleDOI

System Design Using Kahn Process Networks: The Compaan/Laura Approach

TL;DR: This paper shows how for an application written in Matlab, a Kahn process network specification can automatically be derived and systematically mapped onto a target platform composed of a microprocessor and an FPGA.
Proceedings ArticleDOI

Automatic generation of application-specific architectures for heterogeneous multiprocessor system-on-chip

TL;DR: In the flow, architectural parameters are first extracted from a high-level system specification and used to instantiate architectural components, such as processors, memory modules and communication networks, that adapts the processor to the communication network in an application-specific way.
Book ChapterDOI

Guaranteeing the quality of services in networks on chip

TL;DR: The AETHEREAL NOC is an example of this approach, and forms the basis of a QOS-based design style, as advocated in this chapter.
Related Papers (5)
Frequently Asked Questions (8)
Q1. What have the authors contributed in "Multi-processor system design with espam" ?

As an efficient solution to these two problems, in this paper the authors propose a methodology and techniques implemented in a tool called ESPAM for automated multiprocessor system design and implementation. The authors explain how starting from system level platform, application, and mapping specifications, a multiprocessor platform is synthesized and programmed in a systematic and automated way. Furthermore, the authors present some results obtained by applying their methodology and ESPAM tool to automatically generate multiprocessor systems that execute a real-life application, namely a Motion-JPEG encoder. 

The first step to program multiprocessor systems in their ESPAM design methodology is the partitioning of an application into concurrent tasks where the inter-task communication and synchronization is explicitly specified in each task. 

The main objective of this experiment is to show that their design flow successfully closes the implementation gap between the System and RTL abstraction levels of description as well as to show that using the ESPAM tool a very accurate exploration of the performance of alternative multiprocessor platforms based on real implementations becomes feasible since the design time is reduced significantly. 

The high BRAM utilization is due to the fact that the M-JPEG is a relatively complex application and almost all BRAM blocks are used for the program and data memory of the 4 microprocessors in their platforms. 

Memory components are used to specify the processors’ local program and data memories and to specify data communication storages (buffers) between processors. 

Each CC implements the processor’s local bus-based access protocol to the CM for write operations and the access to the communication component (CB) for read operations. 

moving up from the detailed RTL specification to a more abstract system level specification opens a gap which the authors call Implementation Gap. 

Since every programmable processor has a data bus, processors of different types can easily be connected into a heterogeneous multiprocessor platform by using their CCs.