What have the authors contributed in "Multi-processor system design with espam" ?

As an efficient solution to these two problems, in this paper the authors propose a methodology and techniques implemented in a tool called ESPAM for automated multiprocessor system design and implementation. The authors explain how starting from system level platform, application, and mapping specifications, a multiprocessor platform is synthesized and programmed in a systematic and automated way. Furthermore, the authors present some results obtained by applying their methodology and ESPAM tool to automatically generate multiprocessor systems that execute a real-life application, namely a Motion-JPEG encoder.

Why is the high BRAM utilization due to the fact that the M-JPEG is?

The high BRAM utilization is due to the fact that the M-JPEG is a relatively complex application and almost all BRAM blocks are used for the program and data memory of the 4 microprocessors in their platforms.

What is the definition of Implementation Gap?

moving up from the detailed RTL specification to a more abstract system level specification opens a gap which the authors call Implementation Gap.

(Open Access) Multi-processor system design with ESPAM (2006) | Hristo Nikolov

Q: What is the first step to program multiprocessor systems in their methodology?

The first step to program multiprocessor systems in their ESPAM design methodology is the partitioning of an application into concurrent tasks where the inter-task communication and synchronization is explicitly specified in each task.

Q: What is the main objective of this experiment?

The main objective of this experiment is to show that their design flow successfully closes the implementation gap between the System and RTL abstraction levels of description as well as to show that using the ESPAM tool a very accurate exploration of the performance of alternative multiprocessor platforms based on real implementations becomes feasible since the design time is reduced significantly.

Q: What is the simplest way to connect a processor to a communication component?

Each CC implements the processor’s local bus-based access protocol to the CM for write operations and the access to the communication component (CB) for read operations.

Q: What is the difference between a programmable processor and a multiprocessor platform?

Since every programmable processor has a data bus, processors of different types can easily be connected into a heterogeneous multiprocessor platform by using their CCs.

Multi-processor System Design with ESPAM

Hristo Nikolov Todor Stefanov Ed Deprettere

Leiden Institute of Advanced Computer Science

Leiden University, The Netherlands

{nikolov,stefanov,edd}@liacs.nl

ABSTRACT

For modern embedded systems, the complexity of embedded appli-

cations has reached a point where the performance r equirements of

these applications can no longer be supported by embedded system

architectures based on a single processor. Thus, the emerging em-

bedded System-on-Chip platforms are increasingly becoming mul-

tiprocessor architectures. As a consequence, two major problems

emerge, i.e., how to design and how to program such multiproces-

sor platforms in a systematic and automated way in order to reduce

the design time and to satisfy the performance needs of applications

executed on these platforms. Unfortunately, most of the current de-

sign methodologies and tools are based on Register Transfer Level

(RTL) descriptions, mostly created by hand. Such methodologies

are inadequate, because creating RTL descriptions of complex mul-

tiprocessor systems is error-prone and time consuming.

As an efﬁcient solution to these two problems, in this paper we

propose a methodology and techniques implemented in a tool called

SPAM for automated multiprocessor system design and implemen-

tation. E

SPAM moves the design speciﬁcation from RTL to a higher,

so called system level of abstraction. We explain how starting from

system level platform, application, and mapping speciﬁcations, a

multiprocessor platform is synthesized and programmed in a sys-

tematic and automated way. Furthermore, we present some results

obtained by applying our methodology and E

SPAM tool to auto-

matically generate multiprocessor systems that execute a real-life

application, namely a Motion-JPEG encoder.

Categories and Subject Descriptors: J.6 [Computer-aided engineer-

ing]: Computer-aided design (CAD).

General Terms: Algorithms, Design, Experimentation.

Keywords: System-Level Design, Heterogeneous MPSoCs, Kahn Pro-

cess Networks.

1. INTRODUCTION

Moore’s law predicts exponential growth over time of the number

of transistors that can be integrated in a single chip. The intrinsic com-

putational power of a chip must not only be used efﬁciently and effec-

tively, but also the time and effort to design a system containing both

hardware and software must remain acceptable. Unfortunately, current

system design methodologies (including platform design and applica-

tion speciﬁcation) are still based on Register Transfer Level (RTL)

platform/application descriptions created by hand using, for example,

VHDL and/or C. Such methodologies were effective in the past. How-

ever, applications and platforms used in many of today’s new system

designs are so complex that traditional design practices are now inad-

equate, because creating RTL descriptions of complex multiprocessor

systems is error-prone and time-consuming. Moreover, the complex-

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

CODES+ISSS’06, October 22–25, 2006, Seoul, Korea.

$5.00.

ity of high-end, computationally intensive applications in the realm

of high throughput multimedia, imaging, and digital signal processing

exacerbates the difﬁculties associated with the traditional hand-coded

RTL design. Furthermore, using traditional logic simulation to verify

a large design represented in RTL is computationally expensive and

extremely slow.

1.1 Problem Description

For all the reasons stated above, we conclude that the use of a RTL

system speciﬁcation as a starting point for multiprocessor system de-

sign methodologies is a bottleneck. Although the RTL system speciﬁ-

cation has the advantage that the state of the art synthesis tools can use

it as an input to automatically implement a system, we believe that a

system should be speciﬁed at a higher level of abstraction called sys-

tem level. This is the only way to solve the problems caused by the

low level RTL speciﬁcation. However, moving up from the detailed

RTL speciﬁcation to a more abstract system level speciﬁcation opens

a gap which we call Implementation Gap. Indeed, on the one hand,

the RTL system speciﬁcation is very detailed and close to an imple-

mentation, thereby allowing an automated system synthesis path from

RTL speciﬁcation to implementation. This is obvious if we consider

the current commercial synthesis tools where the RTL-to-netlist syn-

thesis is very well developed and efﬁcient. On the other hand, the

complexity of today’s systems forces us to move to higher levels of

abstraction when designing a system, but currently we do not have

mature methodologies, techniques, and tools to move down from the

high-level system speciﬁcation to an implementation. Therefore, the

Implementation Gap has to be closed by devising a systematic and

automated way to convert effectively and efﬁciently a system level

speciﬁcation to a RTL level speciﬁcation.

1.2 Paper Contributions

In this paper we present our tool ESPAM (Embedded System-level

Platform synthesis and Application Mapping) that implements our meth-

ods and techniques for systematic and automated multiprocessor plat-

form implementation and programming. They successfully bridge the

gap between the system level speciﬁcation and the RTL level speciﬁ-

cation which we consider as the main contribution of this paper. More

speciﬁcally, E

SPAM allows a system designer to specify a multipro-

cessor system at a high level of abstraction in a short amount of time.

Then E

SPAM reﬁnes this speciﬁcation to a real implementation in a

systematic and automated way thereby successfully closing the imple-

mentation gap mentioned earlier. This reduces the design time from

months to hours. As a consequence, a very accurate exploration of the

performance of alternative multiprocessor platforms becomes feasible

at implementation level in a few hours.

The success of our methods and techniques in closing the imple-

mentation gap is based on the underlying application model and sys-

tem level platform model. E

SPAM can implement data-ﬂow domi-

nated (streaming) applications onto multiprocessor platform instances

efﬁciently and in an automated way. For the latter, a crucial role is

played by the Kahn Process Network (KPN) [1] model of computa-

tion which we use as an application model. Many researchers [2] [3]

[4] [5] [6] [7] have already indicated that KPNs are suitable for efﬁ-

cient mapping onto multiprocessor platforms. In addition to that, by

carefully exploiting and efﬁciently implementing the simple commu-

211

nication and synchronization features of a KPN, we have identiﬁed and

developed a set of generic parameterized components which we call a

platform model. We consider this an important contribution of this

paper because our set of components (platform model) allows system

designers to specify (construct) very fast and easily many alternative

multiprocessor platforms that are systematically and automatically im-

plemented and programmed by our tool E

SPAM.

1.3 Related Work

Systematic and automated application-to-architecture mapping has

been widely studied in the research community. The closest to our

work is the Compaan/Laura design ﬂow [2]. It uses KPN speciﬁ-

cations for automated mapping of applications targeting FPGA im-

plementations. The reported results are only for processor-coprocesor

architectures whereas our E

SPAM tool allows an automated implemen-

tation of KPN speciﬁcations onto multiprocessor platforms.

The Eclipse work [3] deﬁnes a scalable architecture template for

designing stream-oriented multiprocessor SoCs using the KPN model

of computation to specify and map data-dependent applications. The

Eclipse template is slightly more general than the templates presented

in this paper. However, the Eclipse work lacks an automated design

and implementation ﬂow. In contrast, our work provides such automa-

tion starting from a high-level system speciﬁcation.

In [8] a design ﬂow for the generation of application-speciﬁc mul-

tiprocessor architectures is presented. This work is similar to our

approach in the sense that we also generate multiprocessor systems

based on instantiation of generic parameterized architecture compo-

nents where very efﬁcient communication controllers are generated

automatically to connect processors to communication networks. How-

ever, many steps of the design ﬂow in [8] are performed manually. As

a consequence a full implementation of a system with 4 processors

connected point-to-point takes around 33 hours. In contrast, our de-

sign ﬂow is fully automated and a full implementation of a system

with 8 processors connected point-to-point or via crossbar or shared

bus takes around 2 hours.

A system level semantics for a system design process formaliza-

tion is presented in [9]. It enables design automation for synthesis

and veriﬁcation to achieve a required design productivity gain. Using

Speciﬁcation, Multiprocessing, and Architecture models, a translation

from behavior to structural descriptions is possible at a system level of

abstraction. Our approach is similar but in addition it deﬁnes and uses

application and platform models that allow an automated translation

from the system level to the RTL level of abstraction.

Companies such as Xilinx and Altera provide design tool chains at-

tempting to generate efﬁcient implementations starting from descrip-

tions higher than (but still related to) the RTL level of abstraction. The

required input speciﬁcations are so detailed that designing a single

processor system is still error-prone and time consuming, let alone al-

ternative multiprocessor systems. In contrast, our design methodology

raises the design focus to an even higher level of abstraction allowing

the design and the programming of multiprocessor systems in a short

amount of time. Moreover, this does not sacriﬁce the possibility for

automatic and systematic design implementation because our E

SPAM

tool supports it.

2. ESPAM DESIGN FLOW: O VERVIEW

In this section we give an overview of our system design methodol-

ogy which is centered around our E

SPAM tool developed to close the

implementation gap described in Section 1.1. This is followed by a

description of our system level platform model and platform synthe-

sis in Section 3. In Section 4 we discuss the automated programming

of multiprocessor platforms, and in Section 5 we present some results

that we have obtained using E

SPAM. Section 6 concludes the paper.

Our system design methodology is depicted as a design ﬂow in Fig-

ure 1. There are three levels of speciﬁcation in the ﬂow. They are

YSTEM-LEVELspeciﬁcation, RTL-LEVELspeciﬁcation, and GAT E -

EVEL speciﬁcation. The SYSTEM-LEVEL speciﬁcation consists of

three parts: 1) Platform Speciﬁcation describing the topology of a

platform using our system level platform model, i.e., using generic

parameterized system components; 2) Application Speciﬁcation de-

scribing an application as a Kahn Process Network (KPN), i.e., net-

work of concurrent processes communicating via FIFO channels. The

KPN speciﬁcation reveals the task-level parallelism available in the

application; 3) Mapping Speciﬁcation describing the relation between

all processes and FIFO channels in Application Speciﬁcation and all

components in Platform Speciﬁcation.

Program code

for processors

HW description

of IP Cores

Platform topology

description

Auxiliary

information

Chip

Silicon

Platform Specification

Application Specification

KPN

IP Processor P2

Executable (B,C)

IP Processor P1

Executable (A)

IP Core

Crossbar

Mapping Specification

Specification

Netlist

Gate−Level

RTL−Level

System−Level

Commercial Synthesizer and Compiler

ESPAM Tool

IP Cores

Library

Figure 1: ESPAM System Design Flow.

The SYSTEM-LEVELspeciﬁcation is given as input to ESPAM.First,

SPAM constructs a platform instance following the platform speciﬁ-

cation and runs a consistency check on that instance. The platform

instance is an abstract model of a multiprocessor platform because at

this stage no information about the target physical platform is taken

into account. The model deﬁnes only the key system components of

the platform and their attributes. Second, E

SPAM reﬁnes the abstract

platform model to an elaborate (detailed) parameterized RTL model

which is ready for an implementation on a target physical platform.

We call this reﬁnement process platform synthesis. The reﬁned system

components are instantiated by setting their parameters based on the

target physical platform features. Finally, E

SPAM generates program

code for each processor in the multiprocessor platform in accordance

with the application and mapping speciﬁcations.

The output of E

SPAM, namely a RTL-LEVEL speciﬁcation of a

multiprocessor system, is a model that can adequately abstract and ex-

ploit the key features of a target physical platform at the register trans-

fer level. It consists of four parts: 1) Platform topology description

deﬁning in greater detail the multiprocessor platform; 2) Hardware

descriptions of IP cores containing predeﬁned and custom IP cores

used in 1). E

SPAM selects predeﬁned IP cores (processors, memories,

etc.) from Library IP Cores, see Figure 1. Also, it generates cus-

tom IP cores needed as a glue/interface logic between components in

the platform; 3) Pr ogram code for processors — to execute the appli-

cation on the synthesized multiprocessor platform, E

SPAM generates

program source code ﬁles for each processor in the platform. 4) Auxil-

iary information containing ﬁles which give tight control on the overall

speciﬁcations, such as deﬁning precise timing requirements and prior-

itizing signal constraints.

With the descriptions above, a commercial synthesizer can convert

a RTL-L

EVEL speciﬁcation to a GATE-LEVEL speciﬁcation, thereby

generating the target platform gate level netlist, see the bottom part

of Figure 1. This G

ATE-LEVEL speciﬁcation is actually the system

implementation. The current prototype version of E

SPAM facilitates

automated multiprocessor platform synthesis and programming using

Xilinx VirtexII-Pro FPGAs. E

SPAM uses the Xilinx Platform Studio

(XPS) tool as a back-end to generate the ﬁnal bit-stream ﬁle that con-

ﬁgures a speciﬁc FPGA. We use the FPGA platform technology for

prototyping purposes only. Our E

SPAM is general and ﬂexible enough

212

to be targeted to other physical platform technologies. A real-life

industrially-relevant application, namely Motion-JPEG encoder, has

been fully implemented onto several alternative multiprocessor plat-

forms by using the E

SPAM and XPS design tools.

3. PLATFORM MODEL AND SYNTHESIS

In our design methodology, the platform model is a library of generic

parameterized components. In order to support systematic and auto-

mated synthesis of multiprocessor platforms we have carefully identi-

ﬁed and developed a set of computation and communication compo-

nents. In this section we give a detailed description of our approach

to build a multiprocessor platform. The platform model contains Pro-

cessing components, Memory components, Communication compo-

nents, Communication Controller,andLinks. Memory components

are used to specify the processors’ local program and data memories

and to specify data communication storages (buffers) between proces-

sors. Further we will call the data communication storages Commu-

nication Memories. We have developed a point-to-point network, a

crossbar switch, and a shared bus component with several arbitration

schemes (Round-Robin, Fixed Priority, and TDMA). These Commu-

nication components determine the communication network topology

of a multiprocessor platform. The Communication controller imple-

ments an interface between processing, memory, and communication

components. Links are used to connect any two components in our

system level platform model.

Using the components described above, a system designer can con-

struct many alternative platforms easily, simply by connecting pro-

cessing, memory, and communication components. We have devel-

oped a general approach to connect and synchronize programmable

processors of arbitrary types via a communication component. Our

approach is explained below using an example of a multiprocessor

platform. The system level speciﬁcation of the platform is depicted

in Figure 2a. This speciﬁcation, written in XML format, consists

<port

name = "IO1"

name = "uP1"

<processor

</processor>

<port

name = "IO1"

name = "uP2"

<processor

</processor>

<port

name = "IO1"

name = "uP3"

<processor

</processor>

<port

name = "IO1"

name = "uP4"

<processor

</processor>

name = "CB"

type = "Crossbar"

<network

name = "IO4"

<port

name = "IO3"

<port

name = "IO2"

<port

name = "IO1"

<link

name = "BUS1"

name = "CB"<resource

<port

name = "IO1"

name = "uP1"<resource

<port

name = "IO1"

name = "BUS2"

<resource

<port

name = "IO2"

name = "uP2"<resource

<port

name = "IO1"

name = "CB"

name = "BUS3"

<resource

<port

name = "IO3"

name = "uP3"<resource

<port

name = "IO1"

name = "CB"

<resource

<port

name = "IO4"

name = "uP4"<resource

<port

name = "IO1"

name = "CB"

Platform Specification

CC1

uP1

uP3

uP2

uP4

CC2

MEM2

CM2

MC2

CC4

CM4

MEM4

MC4

</platform>

name =

<platform

"myPlatform"

</network>

</link>

<link

</link>

<link

</link>

<link

name = "BUS4"

The elaborate platform

CM1

MEM1

MC1

CC3

CM3

MEM3

MC3

Figure 2: Example of a Multiprocessor Platform.

of three parts which deﬁne processing components (four processors,

lines 2-5), communication component (crossbar, lines 7-12), and links

(lines 14-29). The links specify the connections of the processors to

the communication component. To guarantee correct-by-construction

automated platform synthesis and implementation, our E

SPAM tool

runs a consistency check on each platform speciﬁed by a designer.

This includes ﬁnding impossible and/or meaningless connections be-

tween system level platform components as well as parameter values

that are out of range. Notice that in the speciﬁcation a designer does

not have to take care of memory structures, interface controllers, and

communication and synchronization protocols. Our E

SPAM tool takes

care of this in the platform synthesis as follows. First, the tool instan-

tiates the processing and the communication components. Second,

it automatically attaches memories and memory controllers (MCs) to

each processor. Third, the tool automatically synthesizes, instantiates,

and connects all necessary communication memories (CMs) and com-

munication controllers (CCs) to allow efﬁcient and safe (lossless) data

communication and synchronization between the components.

The elaborate platform generated by E

SPAM isshowninFigure2b.

The processors (uPs) transfer data between each other through the

CMs. A communication controller connects a communication mem-

ory to the data bus of the processor it belongs to and to a communi-

cation component. Since every programmable processor has a data

bus, processors of different types can easily be connected into a het-

erogeneous multiprocessor platform by using our CCs. Each CC im-

plements the processor’s local bus-based access protocol to the CM

for write operations and the access to the communication component

(CB) for read operations. Each CM is organized as one or more FIFO

buffers. We have chosen such organization because the inter-processor

synchronization in the platform can be implemented in a very sim-

ple and efﬁcient way by blocking read/write operations on empty/full

FIFO buffers located in the communication memory. As a result,

memory contention is avoided.

KPNs assume unbounded communication buffers. Writing is al-

ways possible and thus a process blocks only on reading from an

empty FIFO. In the physical implementation however the communi-

cation buffers have bounded sizes and therefore a blocking write syn-

chronization mechanism is needed as well. At the same time, we want

to stay as close as possible to the KPN semantics because it guaran-

tees the highest possible communication performance. Therefore, in

our approach each processor writes only to its local communication

memory (CM) and uses the communication component only to read

data from all other communication memories. This means that a pro-

cessor can always write if there is room in its local CM (if this CM is

large enough, the processor may never block on writing). A processor

blocks when reading other processors’ CMs if data is not available or

the communication resource is currently not available.

3.1 Processing Components

In our approach we do not propose a design of processing com-

ponents. Instead, we use IP cores developed by third parties. Cur-

rently, for fast prototyping in order to validate our approach, we use

the Xilinx VirtexII-Pro FPGA technology. Therefore, our library pro-

cessing components include two programmable processors, namely

MicroBlaze (MB) and PowerPC (PPC). Our platform model is

general enough to be extended easily with additional (processing) com-

ponents. Notice that only the processing components in our plat-

form model are related to a particular technology (currently to Xilinx

VirtexII-Pro FPGAs). All other components discussed in this section

are technology independent.

3.2 Communication Memory Components

We implement the communication memories of a processor by us-

ing dual-port memories. Logically, a communication memory (CM) is

organized as one or more FIFO buffers. A FIFO buffer in a CM is seen

by a processor as two memory locations in its address space. A proces-

sor uses the ﬁrst location to read/write data from/to the FIFO buffer,

thereby realizing inter-processor data transfer. The second location

is used to read the status of the FIFO. The status indicates whether a

FIFO is full (data cannot be written) or empty (data is not available).

This information is used for the inter-processor synchronization. The

multi-FIFO behavior of a communication memory is implemented by

the communication controller described below. However, if a commu-

nication memory contains only one FIFO, we use a dedicated FIFO

component which simpliﬁes the structure of the communication con-

troller.

3.3 Communication Controller

The structure of the Communication Controller (CC) is shown in

Figure 3. It consists of two blocks, namely Interface Unit and FIFOs

Unit. The Interface Unit contains an address decoder, ﬁfos’ control

logic, and logic to generate read requests to the communication com-

ponent. When a processor has to write data to its local Communica-

tion Memory (CM), it ﬁrst checks if there is room in the corresponding

213

Logic

Write

Logic

Read

Communication Memory Side

Data Wr

Data

...

Addr

Read

Data

Write

Data

Rd/Wr

Addr Bus

Data’

FIFO Status

Unit

Data Rd

ocesso

Communication Component Side

Full

Control

Interface

FIFOs Unit

Empty

FIFO Sel

Read

Request

Empty’

Read’

Figure 3: Communication Controller.

FIFO by reading its status. If the FIFO is full, the processor blocks.

Otherwise, it sends the data to the CC. The Interface Unit decodes the

FIFO address sent by the processor along with the data and generates

control signals (select FIFO, write data, or read status) to the write

logic of the FIFOs Unit. The latter implements the multi-FIFO behav-

ior. For each FIFO buffer the FIFOs Unit contains read and write coun-

ters that indicate the read and write positions into the buffer. These

counters are used as read/write address generators and their values are

used for determining the empty/full status of a FIFO. The FIFOs Unit

also includes a memory interface logic that realizes the access to the

Communication Memory (CM) connected to the CC (bottom part of

Figure 3). Notice that since we use dual-port memories and read and

write logic are separated, a FIFO in a CM can be accessed for read and

write operations simultaneously by different processors or two FIFOs

in a CM can be accessed at a time – one for read operation and one for

write operation.

Recall that a processor can access FIFOs located in other proces-

sors’ CMs via a communication component for read operations only.

First, the processor checks if there is any data in the FIFO the proces-

sor wants to read from. When a processor checks for data, the Interface

Unit sends a request to the communication component for granting a

connection to the CM in which the FIFO is located. A connection is

granted only if a communication line is available and there is data in

the FIFO. If a connection is not granted, the processor blocks until a

connection is granted. When a connection is granted, the CC connects

the data bus of the communication component (the upper part of the

communication component side in Figure 3) to the data bus of the pro-

cessor and the processor reads the data from the CM where the FIFO is

located. After the data is read the connection has to be released. This

allows other processors to access the same CM. When data is read

from a FIFO of a CM, the signals to the read logic of the FIFOs Unit

(FIFO Sel and Read) are generated by the communication component

(the bottom part of the communication component side in Figure 3) as

a response to a request from another CC.

The described blocking mechanism for accessing the CMs has to be

done in the processors. The blocking can be realized in hardware (usu-

ally processors have dedicated embedded hardware to stall the pro-

cessor) or in software by executing empty loops. We use the latter

approach because it is more general. Different processors are stalled

in hardware in different ways and therefore our CC would have to be

aware of many possibilities. This would result in a more complex and

less generic controller. Realizing the blocking mechanism in software

makes the controller more generic, thereby simplifying the integration

of different types of processors into a multiprocessor system.

3.4 Crossbar Communication Component

In this subsection we present the implementation of our Crossbar

communication component. Our general approach to connect proces-

sors that communicate data through communication memories (CM)

with FIFO organization allows the crossbar structure to be very sim-

ple. This results in a smaller crossbar with a reduced number of com-

munication and routing resources and thus reducing the design area

and power consumption. The structure of our crossbar component

consists of two main parts, crossbar switch (CBS) and crossbar con-

troller (CBC). The CBS implements uni-directional connections be-

tween communication memories and processors – recall that a proces-

sor uses a communication component only to read data. Due to the

uni-directional communications and the FIFO organization of CMs,

the number of signals and busses that has to be switched by our cross-

bar is reduced a lot. Since the addresses for accessing CMs are gener-

ated locally by the CCs, address busses are not switched through the

crossbar. The crossbar switches 32-bit data buses in one direction and

two control signals per bus. These control signals are the Read strobe

and the Empty status ﬂag for a FIFO.

The requests for granting a connection generated by the CCs are

processed by the crossbar controller (CBC) using Round-Robin policy.

If a request is for granting a connection, the CBC checks its request

table whether the required connection is available at the moment. The

request table contains information about the status (available or not

available) of all connections. The table is updated each time a connec-

tion is granted or released.

3.5 Point-to-Point Network

In this section we describe how we implement a point-to-point com-

munication in our platforms. In point-to-point networks the topology

of the platform (the number of processors and the number of direct

connections between the processors) is the same as the topology of the

process network. Since there is no communication component such as

a crossbar or a bus, there are no requests for granting connections and

there is no sharing of communication resources. Therefore, no addi-

tional communication delay is introduced in the platform. Because of

this, the highest possible communication performance can be achieved

in such multiprocessor platforms. Under the conditions that each com-

CC CC

Legend:

MEM

− Memory Controller

− Communication Controller

− Communication Memory

− Program and Data Memory

DATA BUS

MEM

uP2

DATA BUS

MEM

uP3

CH3

DATA BUS

MEM

uP1

CH2

CH1

CMCM

Figure 4: Point-to-Point Architecture.

munication memory (CM) contains only one channel and each proces-

sor writes data only to its local CM (in compliance with our concept),

our E

SPAM tool synthesizes a point-to-point network in the following

automated way. First, for each process in the KPN, E

SPAM instanti-

ates a processor together with a communication controller (CC). Then,

SPAM ﬁnds all the channels which the process writes to. For each

found channel the tool instantiates a CM and assigns the channel to

this CM. Finally, E

SPAM connects the memory to the already instan-

tiated processor. In Figure 4 we give an example of a point-to-point

multiprocessor platform generated by E

SPAM. Assume that the mul-

tiprocessor platform has to implement the KPN depicted in the top of

Figure 5 and each process is executed on a separate processor. There

are three channels that have to be assigned to three CMs. Follow-

ing the procedure above E

SPAM ﬁnds that CH1 and CH2 are written

by process A – see the top part of Figure 5. Process A is assigned

to be executed onto processor uP1 therefore CMs corresponding to

CH1 and CH2 are instantiated and connected to uP1. Similarly, a CM

corresponding to CH3 is instantiated and connected to processor uP2.

Process C is assigned to processor uP3 and since process C only reads

data from CH1 and CH3 no more CMs are instantiated. Processor

uP3 is simply connected to the already instantiated CMs correspond-

ing to CH1 and CH3. Notice that in Figure 4, a CC is connected to

more than one CM. As we mentioned in Section 3.2, if a CM contains

only one FIFO a CM is implemented by a dedicated FIFO component.

Therefore, to connect one or more FIFOs to a processor in the case of

point-to-point network, we use a very simpliﬁed version of our com-

munication controller (CC) described in Section 3.3. The simpliﬁed

214

CC only translates the processor data bus signals to FIFO input/output

signals. The CC is parameterized and it supports up to 128 FIFOs for

read and write operations.

4. AUTOMATED PROGRAMMING

Application Speciﬁcation

The ﬁrst step to program multiprocessor systems in our E

SPAM de-

sign methodology is the partitioning of an application into concurrent

tasks where the inter-task communication and synchronization is ex-

plicitly speciﬁed in each task. The partitioning of an application into

concurrent tasks can be done by hand or automatically [2, 10] and it

allows each task or group of tasks to be compiled separately by a stan-

dard compiler in order to generate an executable code for each proces-

sor in the platform. The result of the partitioning done by the tools is

an XML description of a Kahn Process Network (KPN) as an Approx-

imated Dependence Graph (ADG) data structure [11]. It is a compact

mathematical representation of the process network in terms of poly-

hedra. This allows formal operations to be deﬁned and applied [11] on

the KPN in order to generate an efﬁcient code for the processors. A

main()void

{

read( p2, in_0, sizeof(myType) );

compute( in_0, out_0 );

CH2

CH3

CH1

<fromPort

name = "p1"

<toPort

name = "p2"

direction = "in"

<port

name =

"B"

<process_code

name = "compute"

<arg

name = "in_0" type = "input"

name = "out_0" type = "output"

<arg

<par_bounds

matrix = "[1,0,−1,384;"

1,0, 1, −3]"

write( p1, out_0, sizeof(myType) );

}

void

for

int *isEmpty = port + 1;

// reading is blocked if a FIFO is empty

while

(byte* data)[i] = *port; // read data from a FIFO

( int i=o; i<length; i++ )

}

( *isEmpty )

{ }

read( byte *port, void *data, int length )

void write( byte *port, void *data, int length )

for

int *isFull = port + 1;

// writing is blocked if a FIFO is full

while

( int i=o; i<length; i++ )

( *isFull )

{ }

*port = (byte* data)[i]; // write data to a FIFO

}

{

}

name = CH2

<toProcess

name = "B"

</channel

name = "A"

. . .

XML specification of a KPN

a) b)

Program code, generated by Espam

</port

<var

name = "out_0"

<var

name = "in_0"

type = "myType"

<port

name = "p1" direction = "out"

type = "myType"

</loop

</process_code

</process >

<loop

parameter = "N"

index = "k"

<loop_bounds

matrix = "[1, 1,0,−2;"

1,−1,2,−1]"

for

( int k=2; k<=2*N−1; k++ ){

{

Figure 5: Kahn Process Network Example.

simple example of a KPN is shown in Figure 5. Three processes (A,

B, and C) are connected through three FIFO channels (CH1, CH2, and

CH3). For the sake of clarity, in Figure 5a, we show the XML descrip-

tion only for one process (B). Process B has one input port and one

output port deﬁned in lines 2-7. In our example, process B executes

a function called

compute (line 8). The function has one input argu-

ment (line 9) and one output argument (line 10). The relation between

the function arguments and the ports of the process is given in lines 3

and 6. The function has to be executed

2 ∗ N − 2 times as speciﬁed

by the polytope in lines 12-13. The value of

N is between 3 and 384

(lines 14-15). Lines 20-25 show an example of how the topology of a

KPN is speciﬁed: CH2 connects processes A and B through ports p1

and p2.

Code Generation

SPAM takes the XML speciﬁcation of an application, applies some

operations [11] on it and automatically generates software (C/C++)

code for each processor. The code contains the main behavior of a pro-

cess, together with the blocking read/write synchronization primitives

and the memory map of the system. The C code generated by E

SPAM

for process B is shown in Figure 5b. In accordance with the XML ap-

plication speciﬁcation,

for loop is generated in the main function of

process B (lines 2-6) to execute

2 ∗ N − 2 times function compute.

The C/C++ code implementing function

compute has to be provided

by the designer. The function uses local variables

in 0 and out 0.

For simplicity, the declaration of the local variables is not shown in

the ﬁgure. E

SPAM inserts a read primitive to read from CH2, initial-

izing variable

in 0 and a write primitive to send the results (the value

of variable

out 0) to CH3 (Figure 5b, lines 3 and 5). The code of the

synchronization read/write primitives, shown in the same ﬁgure, is au-

tomatically generated by E

SPAM as well. Each primitive has 3 param-

eters. Parameter

port is the address of the memory location through

which a processor can access a given FIFO channel. Parameter

data

is a pointer to a local variable and leng th speciﬁes the amount of data

(in bytes) to be moved from/to the local variable to/from the chan-

nel. The primitives implement the blocking synchronization mecha-

nism between the processors in the following way. First, the status of

a channel that has to be read/written is checked. A channel status is

accessed using the locations deﬁned in lines 10 and 19. The block-

ing is implemented by

while loops with empty bodies in lines 13 and

22. Each empty loop iterates (does nothing) while a channel is full or

empty. Then, in lines 14 and 23 the actual data transfer is done.

5. EXPERIMENTS AND RESULTS

In this section we present some of the results we have obtained by

implementing and executing a Motion JPEG (M-JPEG) encoder ap-

plication onto several multiprocessor platform instances using our E

S-

PAM system design ﬂow presented in Section 2. The main objective

of this experiment is to show that our design ﬂow successfully closes

the implementation gap between the System and RTL abstraction lev-

elsofdescriptionaswellastoshowthatusingtheE

SPAM tool a very

accurate exploration of the performance of alternative multiprocessor

platforms based on real implementations becomes feasible since the

design time is reduced signiﬁcantly. For the implementations we used

a prototyping board with one Xilinx FPGA.

Design Time

In Table 1 we show the processing times of each step in the design

ﬂow for the implementation of one platform instance. As described

in Section 2, the inputs to our system design ﬂow are the Application,

Platform,andMapping Speciﬁcations.TheApplication Speciﬁcation

has to represent the M-JPEG application as a KPN. For a certain class

of applications the generation of KPNs is automated by the translator

tools presented in [2, 10]. We started with the M-JPEG application

given as a sequential C program. With small modiﬁcations we struc-

tured the C code in order to comply with the input requirements of

the translators. Then we derived a KPN speciﬁcation automatically. It

took us about half an hour to modify the C code and just 22 seconds to

derive the KPN speciﬁcation. Notice that this is a one-time effort only

Table 1: Processing Times (hh:mm:ss).

KPN System Level to Physical Manual

Derivation RTL conversion Implement. Modiﬁc.

Translators 00:00:22 – – 00:30:00

ESPAM tool – 00:00:24 – 00:10:00

XPS tool – – 02:09:00 –

because in the implementation of each new platform the same KPN

speciﬁcation is used. For each platform we wrote the Platform and

Mapping Speciﬁcations by hand in approximately 10 minutes. This is

a very simple task because our speciﬁcations are at a high system level

of abstraction (not RTL level). Having all three system level speci-

ﬁcations, our E

SPAM tool converts them to RTL level speciﬁcations

within half a minute. The generated speciﬁcations are close to an im-

plementation and are automatically imported to the Xilinx Platform

Studio (XPS) tool for physical implementation, i.e., mapping, place,

and route onto our prototyping FPGA. Table 1 shows that it took the

XPS tool more than 2 hours for the physical implementation. The

reported time is for a platform instance containing 8

MicroBlaze

processors. However, in case of 2 processors XPS needs only 20 min-

utes. All tools run on a Pentium IV machine at 1.8GHz with 1GB of

RAM.

The ﬁgures in Table 1 clearly show that a complete implementation

of a multiprocessor system starting from high abstraction system level

speciﬁcations can be obtained in about 2 hours using our E

SPAM tool

together with the translators [2, 10] and the commercial XPS tool. So,

a signiﬁcant reduction of design time is achieved. This allows us to

explore the performance of 16 platforms using real system implemen-

tations. We implemented and ran the M-JPEG application on 16 alter-

native platforms using different Mapping and Platform Speciﬁcations

in a very short amount of time, approximately 2 days.

215

Multi-processor system design with ESPAM

Figures

Citations

SystemCoDesigner—an automatic ESL synthesis approach by design space exploration and behavioral synthesis for streaming applications

FPGA-based Implementation of Signal Processing Systems

pn: a tool for improved derivation of process networks

Daedalus: toward composable multimedia MP-SoC design

A framework for rapid system-level exploration, synthesis, and programming of multimedia MP-SoCs

References

The Semantics of a Simple Language for Parallel Programming.

A systematic approach to exploring embedded system architectures at multiple abstraction levels

System Design Using Kahn Process Networks: The Compaan/Laura Approach

Automatic generation of application-specific architectures for heterogeneous multiprocessor system-on-chip

Guaranteeing the quality of services in networks on chip

Related Papers (5)

The Semantics of a Simple Language for Parallel Programming.

System Design Using Kahn Process Networks: The Compaan/Laura Approach

A systematic approach to exploring embedded system architectures at multiple abstraction levels

Metropolis: an integrated electronic system design environment

System-level design: orthogonalization of concerns and platform-based design

Frequently Asked Questions (8)

Q1. What have the authors contributed in "Multi-processor system design with espam" ?

Q2. What is the first step to program multiprocessor systems in their methodology?

Q3. What is the main objective of this experiment?

Q4. Why is the high BRAM utilization due to the fact that the M-JPEG is?

Q5. What are the components used to specify the processors’ local program and data memories?

Q6. What is the simplest way to connect a processor to a communication component?

Q7. What is the definition of Implementation Gap?

Q8. What is the difference between a programmable processor and a multiprocessor platform?