What is the process of finding the new processor probabilities?

The updated actor execution times, execution probabilities and waiting probabilities are used to find the new processor level probabilities.

How long does it take to generate the bit file?

The Xilinx tool takes about 36 minute to generate the bit file together with the appropriate instruction and data memories for each core in the design.

What is the use-case merging technique used to generate the hardware?

As the generated hardware supports multiple use-cases, so the authors employ the use-case merging technique [26] and modify its certain parts to incorporate CA buffers.

What is the interesting way to find the throughput of an SDFG?

One of the methods to find the throughput of an SDFG is to convert it into HSDF graph and then find the throughput of the resulting graph.

What is the way to compute the worst-case waiting time for non-preempt?

The worst-case-waiting-times for non-preemptive systems for FCFS as mentioned in [16] are computed by using the following formulatwait = n∑i=1texec(ai) (5)where actors ai for i = 1, 2, 3, ...n are mapped on the same resource (i.e processor).

How many pixels are sent to the DCT actor?

The DCT actor sends these 6 macro-blocks one by one (64 pixels each time) to the VLC actor where each of these macro-blocks is variable length encoded.

What is the number of CPUs and CA buffers needed for each use-case?

While the number of processors and CA buffers needed is updated with a max operation (line 10 and line 11 in Algorithm 1), the number of CA channels is added for each application (indicated by line 13 in Algorithm 1).

How many design points does it take to generate a new hardware?

As the number of design points are increased, the cost of generating the hardware becomes neg-ligible and each iteration takes only about 25 seconds.

What is the reason for the non-blocking of the claimreadspace and claimwritespace?

To avoid this, claimreadspace and claimwritespace commands have been implemented as non-blocking so that if any of claimspace commands is unsuccessful, the processor is not blocked.

(Open Access) CA-MPSoC: An automated design flow for predictable multi-processor architectures for multiple applications (2010) | Ahsan Shabbir

Q: What are the contributions mentioned in the paper "Ca-mpsoc: an automated design flow for predictable multi-processor architectures for multiple applications" ?

In this paper, the authors present a fully automated design flow ( CA-MPSoC ) to generate communication assist CA-based multi-processor systems. The design flow provides performance estimates and timing guarantees for both hard real-time and soft real-time applications, provided the task to processor mappings are given by the user. In a mobile phone case study with 6 applications, the merging of use-cases results in a speed up of 18 when compared to the case where each use-case is evaluated individually.

Q: What are the future works mentioned in the paper "Ca-mpsoc: an automated design flow for predictable multi-processor architectures for multiple applications" ?

In the future, the authors intend to include an NoC also in their design flow. The authors also want to extend the design flow with automated mapping decisions, so that mapping of the actors to the processors can also be optimized.

CA-MPSoC: An Automated Design Flow for Predictable Multi-processor Architectures

for Multiple Applications

A. Shabbir

,a,1

, A. Kumar

a,b,1

, S. Stuijk

a,1

, B. Mesman

a,1

, H. Corporaal

a,1

Eindhoven University of Technology Eindhoven, The Netherlands

National University of Singapore, Singapore

Abstract

Future applications for embedded systems demand multi-processor designs to meet real-time deadlines. The large number of

applications in these systems generates an exponential number of use-cases. The key design automation challenges are designing

systems for these use-cases and fast exploration of software and hardware implementation alternatives with accurate performance

evaluation of these use-cases. These challenges can not be overcome by current design methodologies which are semi-automated,

time consuming and error-prone.

In this paper, we present a fully automated design ﬂow (CA-MPSoC) to generate communication assist CA-based multi-processor

systems. A worst-case performance model of our CA is proposed so that the performance of the CA-based platform can be analyzed

before its implementation. The design ﬂow provides performance estimates and timing guarantees for both hard real-time and soft

real-time applications, provided the task to processor mappings are given by the user. The ﬂow automatically generates a super-set

hardware that can be used in all use-cases of the applications. The ﬂow also generates the software for each of these use-cases,

including the conﬁguration of communication architecture and interfacing with application tasks.

CA-MPSoC has been implemented on Xilinx FPGAs for evaluation. It is also available on-line for the beneﬁt of research

community and is used for performance analysis of two real life applications, Sobel and JPEG encoder executing concurrently. The

CA-based platform generated by our design ﬂow records a maximum error of 3.4% between analyzed and measured periods. In a

mobile phone case study with 6 applications, the merging of use-cases results in a speed up of 18 when compared to the case where

each use-case is evaluated individually.

Key words: Multi-processor, Multiple Applications, Performance Analysis, Automated Design Flow, Communication Assist

1. Introduction

Modern multimedia embedded systems have to support a

large number of independent applications. In the area of

portable consumer systems, such as mobile phones, the num-

ber of applications doubles roughly every two years and the

introduction of new technology solutions is increasingly driven

by applications [18]. Tile-based multi-processor platforms [47,

23, 24, 12, 39] are increasingly being used in modern embed-

ded systems to meet tight timing and high performance require-

ments of these large number of applications and their use-cases.

A use-case is a combination of concurrently executing applica-

tions. The number of such potential use-cases is exponential in

the number of applications that are present in the system.

In general, mapping applications onto tile-based platforms is

considered diﬃcult. However, streaming applications can be

described in a data ﬂow like manner and the computational ker-

nels of this ﬂow can be easily mapped to suitable processing

elements. In essence, these systems trade architectural com-

plexity for communications, spreading work across a number

Email addresses: a.shabbir@tue.nl (A. Shabbir), a.kumar@tue.nl

(A. Kumar), s.stuijk@tue.nl (S. Stuijk), b.mesman@tue.nl (B. Mesman),

h.corporaal@tue.nl (H. Corporaal)

of sparsely connected small tiles rather than among richly con-

nected functional units of a monolithic, wide core. In order to

make use of tile-based platforms easier, inter-tile communica-

tion for these architectures should be predictable, fast and easy

to program.

In [9], a multi-processor platform is introduced that de-

couples the computation and communication of applications

through a hardware communication assist (CA). This decou-

pling oﬀ-loads the communication load from the processor,

thereby improving the performance signiﬁcantly. Further, this

makes it easier to provide tight timing guarantees on the com-

putation and communication tasks that are performed by the

applications running on the platform. Several CA architec-

tures [33, 4, 35, 37] have been presented in the literature. How-

ever, it is very time consuming to map applications on these

platforms due to unavailability of platform generation tools.

Furthermore, it is very diﬃcult to program them as the user

has to conﬁgure the communication infrastructure in addition

to the application functionality.

Manual design eﬀorts are error prone and consume a lot of

time. To worsen the matters, most of these devices have very

short product life so shorter time-to-market for these systems

poses a challenge for the designers. The designers have to ver-

ify each use-case. For example, Bluetooth 2.5 has to meet its

Preprint submitted to Systems Architecture March 1, 2010

speciﬁcation during each combination of applications. It should

perform while receiving a call or sending text messages or even

taking a picture. So there is a need for automated tools which

can reduce the design generation and veriﬁcation time.

There are some multi-processor design tools [37, 44, 20, 31],

but most of them lack support for multiple applications let alone

multiple use-cases, and require manual steps. There is a tool

described in [26] that supports platform generation for multi-

ple applications and their use-cases but it does not support CA-

based platforms. Automated platform generation reduces errors

in the design and thus saves time for design iterations.

Automatic platform generation is very helpful for the design-

ers but often they are also interested in knowing about the ex-

pected performance of the applications before the actual syn-

thesis of the platform. This allows the designers to choose the

design which meets their requirements. There are some perfor-

mance evaluation tools [46, 22, 48, 29], but most of them are

for single application. There is a tool [28] for performance anal-

ysis for multiple applications but it does not take into account

the communication architecture details.

In this paper, we present a design ﬂow (CA-MPSoC) that

takes models of multiple applications and their task to proces-

sor mappings, as input and gives expected performance of the

applications. Synchronous Data Flow graphs [30] (SDFGs) to

model the applications. These application models are reﬁned

with the details of the communication architecture and actor-to-

processor mappings. The reﬁned graphs are used to predict the

performance of multiple applications. If the designer is satisﬁed

with the performance estimates, he/she can generate CA-based

platform by using our CA-MPSoC. As far as we know, this is

the ﬁrst design ﬂow which can generate a CA-based platform.

Following are the key contributions of the paper.

Performance analysis: The ﬂow provides the expected perfor-

mance of applications on the platform, given the fact that

the mappings of the tasks on the processors is already pro-

vided. The applications are presented as SDFGs and archi-

tecture details are added to these graphs. A model of CA

has been introduced and it is used to generate architec-

ture aware SDFGs. The tool provides both the worst case

and average case performance results from these graphs.

Worst case results can be used for hard real-time applica-

tions whereas the average case can be used for soft real-

time applications.

Automatic CA-based multi-processor generation: An auto-

mated design ﬂow that generates multi-processor systems,

directly from the architecture aware application graphs.

The ﬂow also generates the communication infrastructure

so that the designer does not worry about it. It generates a

super-set hardware which can be used for all the use-cases.

The software for each use-case is generated individually.

This reduces the veriﬁcation time of all the use-cases of

the applications. The designer can verify that their appli-

cations will meet their required performance in all possible

combinations of applications.

SDF Task Interface: Another contribution of this work is def-

inition of an interface for the tasks such that the semantics

of SDF behaviour are maintained during execution. So

when an application speciﬁcation includes high-level lan-

guage code corresponding to tasks in the application, the

source code is automatically added to the desired proces-

sor.

Software generation: The software for all the processors is

automatically generated in the ﬂow. Further, the required

communication APIs are also generated. This includes

conﬁguration of communication channels, setting up con-

nections, and management of memory used for communi-

cation. The programmer does not bother about these con-

ﬁgurations and can concentrate on the functionality of the

applications.

The above contributions are essential to further research in de-

sign automation community since the embedded devices are in-

creasingly becoming multi-featured. Our ﬂow allows designers

to evaluate the performance of applications on the architecture

before actually synthesizing it. It also allows the designers to

generate the platform for either hard real-time or soft real-time

systems with given sets of actor to processor mappings. CA-

MPSoC is evaluated on two real life applications Sobel and

JPEG Encoder. The maximum error between estimated and

measured periods of these applications is about 3.4% for soft

real-time analysis. Furthermore, platform generation for mul-

tiple uses-cases is evaluated with a mobile phone case study

consisting of 6 applications. The merging of use-cases gives a

platform which supports all the use-cases. This merging results

in a speed up of 18 as compared to the case where the use-cases

are evaluated individually. The tool is made available on line [7]

for the beneﬁt of the research community.

The rest of the paper is organized as follows. Section 2 re-

views the related work for existing CA architectures, perfor-

mance analysis and automatic platform generation tool ﬂows.

In Section 3 we describe our architecture template. Section 4

introduces SDFGs. Section 5 presents SDF model of our CA.

In Section 6, we show how the SDF model of CA can be in-

corporated in the application model and how performance of

applications can be predicted. Section 7 gives details of the

steps performed in our design ﬂow to generate the platform.

Section 8 describes details of tool implementation. Section 9

presents results of the experiments performed to evaluate our

design ﬂow. Section 10 concludes the paper and gives direc-

tions for future work.

2. Related Work

2.1. Communication Assist

The communication controller presented in [37] implements

FIFO based communication between tasks. Writes to the FI-

FOs are always local to a processor whereas reads are always

remote (from the FIFO memory of a producer). The program-

ming model is based on Kahn Process Network [21] (KPN).

Due to FIFO based communication, out-of-order access, re-

reading, and skipping is only possible after storing the data lo-

cally in the consuming task. In our CA-based platform, all the

reads/writes to the memory are local to the producer/consumer

resulting in saving of the memory space.

In [32], the authors have presented SystemC model of a CA,

but there are some key diﬀerences with our CA. They propose

separate communication and computation memories whereas

in our case, the data memory is also used as communication

memory. In [13], the authors have presented a synchronization

scheme for embedded shared memory systems. They propose

channel controllers for synchronization of data between tasks.

They have channel controllers per channel; our implementation

has one controller for all the channels, resulting in area eﬃ-

cient implementation. Authors in [6] describe communication

between Nested Loop Programs (NLP) in multi-processor sys-

tems. The algorithm is implemented in software and can handle

out-of-order access to the buﬀer. Both producer and consumer

have their respective write and read windows for mutually ex-

clusive access. However, the algorithm is limited to single as-

signment codes. Our CA does not impose such restrictions.

A KPN is derived from NLP in [49]. In KPN communication

between the tasks is arranged via FIFO buﬀers. When the con-

suming task has to read a location multiple times, the consumer

stores the array in an additional buﬀer. Instead of FIFO buﬀers,

we use circular buﬀers and also there is no need to copy values

in an additional buﬀer. The work by [17] is quite similar to [49]

and uses a read and write window.

CELL BBE [15] implements communication between pro-

cessing elements (SPEs) and the external memory through

DMA controllers called Memory Flow controller (MFC). The

key diﬀerence between MFC and our CA is the fact that in MFC

the synchronization between the memories has to be performed

explicitly by the SPEs. In case of CA the synchronization is

taken care of by the CA itself and the processor is freed from

the synchronization overhead.

In the KPN model of computation, processes communicate

with each other by sending data to each other over edges. A

process may write to an edge whenever it wants. When it tries

to read from an edge which is empty, it blocks and must wait till

the data is available. The amount of data read from an edge may

be data-dependent. This allows modeling of any continuous

function from the inputs of the KPN to the outputs of the KPN.

It has been proved in literature that it is not possible to an-

alyze properties like the throughput or buﬀer requirements of

a KPN at design time [14]. On the other hand, SDF is more

restrictive model. A task can only execute if it has input data

and space available at the output. The size of input and out data

is also ﬁxed so throughput analysis and buﬀer capacity analysis

of SDF graphs is possible statically, which makes SDF more

attractive than KPN.

Note that others in fact impose restrictions on the KPN

graphs that are accepted by their tools. These constraints turn

these graphs into cyclo-static dataﬂow graphs. Such a cyclo-

static dataﬂow graph can always be transferred into an SDF and

mapped using our ﬂow. Hence it may seem that others use a

more ﬂexible model, but in fact their restrictions imply that use

the same model as we do.

2.2. Design Flows for Platform Generation

The problem of mapping an application to an architecture

has been widely studied in literature. One of the recent works

most related to our research is ESPAM [37]. This uses Kahn

process networks (KPNs) [21] for application speciﬁcation. In

our approach, we use SDFGs for application speciﬁcation in-

stead. Further, our approach supports mapping of multiple ap-

plications, while ESPAM is limited to single application. This

diﬀerence is imperative for developing modern embedded sys-

tems which support more than tens of applications on a single

MPSoC. The same diﬀerence can be seen between our approach

and the one proposed in [20], where an exploration framework

to build eﬃcient FPGA multi-processors is proposed.

The Compaan/Laura design ﬂow presented in [44] also uses

KPN speciﬁcation for mapping applications to FPGAs. How-

ever, their approach is limited to a processor and coprocessor.

Our approach aims at synthesizing complete MPSoC designs

supporting multiple processors. Another approach for gen-

erating application-speciﬁc MPSoC architectures is presented

in [31]. However, most of the steps in their approach are done

manually. Exploring multiple design iterations is therefore not

feasible. In our ﬂow, the entire ﬂow is automated, including

the generation of the ﬁnal bit-ﬁle that runs on the FPGA. Yet

another ﬂow for generating MPSoCs for FPGAs has been pre-

sented in [27]. However, that ﬂow focuses on generic MPSoCs

and not on application-speciﬁc architectures. There is also a

tool described in [26] that supports platform generation for mul-

tiple use-cases but it does not support CA-based platforms.

Xilinx provides a tool-chain as well to generate designs with

multiple processors and peripherals [50]. However, most of

the features are limited to designs with a bus-based processor-

coprocessor pair with shared memory. It is very time consum-

ing and error prone to generate an MPSoC architecture and

the corresponding software projects to run on the system. In

our ﬂow, an MPSoC architecture is automatically generated to-

gether with the respective software projects for each core.

Finally, none of the above ﬂows support a CA-based plat-

form. In fact our ﬂow is the ﬁrst to generate CA base multi-

processor platforms. Communication plays important role in

the parallelization of applications. The communication to com-

putation ratio determines the justiﬁcation of splitting task be-

tween the processors. Our CA in turn exposes more parallelism

in the applications.

In [8], the authors present a design ﬂow that generates a

multicore system for multimedia applications. Their work is

quite similar to ours. However, there are some key diﬀerences.

Firstly they use mesh network for interconnection whereas we

use point-to-point networks. Secondly, they use proﬁling to di-

mension their system. We, on the other hand use static analysis

techniques. Proﬁling based techniques are signiﬁcantly slower

than analysis based techniques. Also their synthesis ﬂow gener-

ates platforms for average case performance whereas our ﬂow

can generate platforms for both worst case and average case

performance. Lastly, our ﬂow supports multiple applications

concurrently executing on the platform while [8] is for single

application.

network

NI FIFOs NI FIFOs

DMDM

Figure 1: Proposed CA-based platform.

2.3. Performance Analysis

In [34], the authors propose to analyze the performance of a

single application modeled as an SDFG by decomposing it into

a homogeneous SDF graph (HSDFG) [43]. The throughput is

calculated based on analysis of each cycle in the resulting HS-

DFG [10]. However, this can result in an exponential number

of vertices [38]. Thus, algorithms that have a polynomial com-

plexity for HSDFGs have an exponential complexity for SD-

FGs. This approach is not practical for multiple applications.

For multiple applications, an approach that models resource

contention by computing worst-case-response-time (WCRT)

for TDMA scheduling (requires preemption) has been analyzed

in [3]. A similar worst-case analysis approach for round-robin

is presented in [16], which also considers non-preemptive sys-

tems, but suﬀers from the same problem of lack of scalabil-

ity. Real-time calculus has also been used to provide worst-case

bounds for multiple applications [22, 48, 29]. The analysis is

very intensive and requires a very large design-time eﬀort. On

the other the worst-case-waiting-time analysis used in our tool

is very fast and simple.

A common way to use probabilities for modeling dynamism

in application is using stochastic task execution times [1, 42,

41]. The probabilistic approach [25] used by us uses proba-

bilities to model the resource contention and provide estimates

for the throughput of applications. This approach is orthogo-

nal to the approach of using stochastic task execution times.

To the best of our knowledge, there is no eﬃcient approach

of analyzing multiple applications on a non-preemptive hetero-

geneous multi-processor platform. A technique has been pre-

sented in [28] to also model and analyze contention, but the ap-

proach used in this paper is much better. The technique in [28]

looks at all possible combinations of actors blocking another

actor. Since the number of combinations is exponential in the

number of actors mapped on a resource, the analysis has an

exponential complexity. The approach used in this paper has

linear complexity in number of actors.

3. Architecture Template

The architecture template used in our platform is depicted

in Figure 1. It consists of a processing element (PE), a com-

munication assist (CA), Data memory (DM) and Network in-

terface FIFOs (NI FIFO). The CA transfers data between the

DM and the NI FIFO. The NI FIFOs are connected through a

partial point-to-point network. The structure of the networks

themselves is out of the scope of this paper.

Scalability of partial point-to-point networks has been an is-

sue as they require storage to deal with bursts. FSL buses from

Xilinx is one example. However, the point-to-point networks

used in our template do not require storage. This means that

cost of a connection is not very high. The CAs can transfer the

data directly from the data memory of sending tile to the data

memory of the receiving tile, i.e. they do not require storage in

the point-to-point network itself.

3.1. Processing Element

The processing elements used in our template are simple

RISC based processors. RISC processors are the processing

element of choice for tile-based platforms [47]. No caches are

attached to the processor to have predictable execution trace.

The PE has local instruction and data memories. The instruc-

tion memory is connected to the PE through a bus whereas the

access to the data memory is through the communication assist.

Note that we chose microblaze processors from Xilinx whereas

there is work [2] where picoblaze processors are used. Our syn-

thesis ﬂow is not restricted to any one processor type so choice

of processor is not important.

The PE is non-preemptive and can execute only single thread.

This simpliﬁes the architecture of the PE. Preemption requires

extra hardware and is costly in terms of area. Furthermore, non-

preemptive scheduling algorithms are easier to implement as

compared to their preemptive counter parts and have dramat-

ically lower overhead at runtime [19]. In high performance

embedded processors (like SPEs in Cell Broad Band Engine

and graphics processors), non-preemptive systems are preferred

over preemptive systems.

3.2. Memories

We use a single port instruction memory, which is directly

connected to the PE. The data memory (DM) used in our tem-

plate is a dual ported memory as depicted in Figure 1. The

CA has exclusive access to one port of this memory. The sec-

ond port is connected to the PE through the CA. The choice of

dual ported memory may seem expensive, however we use it to

make the access of the memory to CA and PE as fast as possi-

ble. The other option could be an arbiter to resolve the access

between the two but for predictable performance, we preferred

dual ported memory over a combination of an arbiter and a sin-

gle ported memory. Single ported memory can introduce stall

cycles for the processor which inturn makes the execution time

of the task executing on the processor, unpredictable. Further,

it is very diﬃcult to model an unpredictable arbiter so we de-

cided to use dual ported DM. Next subsection will clarify this

conﬁguration.

3.3. Communication Assist

Figure 2 shows the global view of CA (more details about the

architecture can be seen in [40]). It performs following basic

functions

Addr_tr

cntrlFSM

Pointer

Store

NI FIFOs

Figure 2: CA architecture.

1. It conﬁgures NI FIFO channels and their corresponding

buﬀers in DM.

2. It accepts data transfer requests from the attached PE

and splits them into local memory requests and remote

requests (to other tiles). The address translation unit

“Addr tr” shown in Figure 2 performs this task.

3. Local memory requests are simply bypassed to the data

memory.

4. Remote memory requests are handled through a round

robin arbiter. Every two cycles, a 32 bit word is trans-

ferred from the buﬀer in the memory to NI FIFO channels

and vice verse.

5. The buﬀers implemented in the memory are circular

buﬀers. The pointers needed for circular buﬀer manage-

ment are updated and stored in the CA. The number of NI

FIFO channels can be greater than or equal to number of

buﬀers in the data memory.

Our communication assist acts as an interface that provides link

between NoC and the sub systems (PE and memory). It also

acts as memory management unit that helps processor keep

track of its data structures. As a result, it decouples commu-

nication from computation and relieves the processor from data

transfer functions. Our programmable CA uses a shared data

and buﬀer memory. This leads to lower memory requirement

for the overall system and to a lower communication latency.

Figure 1 shows CA-based multi-processor tiles and demon-

strates the steps involved during data transactions between the

tiles. Assume tile T

is executing a producer task and tile T 1

is executing a consumer task. The primitives used for commu-

nication are known as C-HEAP [36] protocol. The producer

task executing on tile T

requests for space. The CA returns the

pointer to the buﬀer in the memory (step 1 in Figure 1). The

PE processes the data as local memory access. It then requests

the CA that it wants to release the space. The CA transfers the

data to the designated NI FIFO (step2). The data is transported

through the network (step 3). The CA of the consumer task exe-

cuting in tile T

receives the data and places that in the memory

(step 4). The consumer task requests the CA about the avail-

ability of the data. The CA sends the pointer to this data and

the PE can access it like a local memory request (step 4). The

consumer task processes the data and releases the space so that

the CA can use this space for future data receptions (step 5).

Figure 2 depicts the hardware components of CA. The point-

ers used for circular buﬀer management are stored in a pointer

store unit “Pointer Store”. Every clock cycle, the CA checks

wheather there is data to be transferred between the DM and

the NI FIFOs. The monitoring of the NI FIFOs is round robin,

which makes the architecture predictable. This predictability

allows us to give tight bounds on the reported performance of

the platform.

Before we can demonstrate how the communication between

the tiles and the timing behaviour of task execution can be an-

alyzed in terms of timing, ﬁrst we need to introduce SDFGs in

the next section.

4. SDF Graphs

Synchronous data ﬂow graphs are often used for modeling

modern DSP applications [43] and for designing concurrent

multimedia applications implemented on multi-processor plat-

forms. Both pipelined streaming and cyclic dependencies be-

tween tasks can be easily modeled in SDFGs. Tasks are mod-

eled by the vertices of an SDFG, which are called actors. SD-

FGs allow analysis of a system in terms of throughput and

other performance properties, such as latency and buﬀer re-

quirements [45].

4 2

Figure 3: Example of an SDF graph.

Figure 3 shows an example of an SDFG. There are four ac-

tors in this graph. As in a typical data-ﬂow graph, a directed

edge represents the dependency between tasks. Tasks also need

some input data (or control information) before they can start

and usually also produce some output data; such terms of infor-

mation are referred to as tokens. Actor execution is also called

ﬁring. An actor is called ready when it has suﬃcient input to-

kens on all its input edges and suﬃcient buﬀer space on all its

output channels; an actor can only ﬁre when it is ready.

The edges may also contain initial tokens, indicated by bul-

lets on the edges, as seen on the edge from actor C to actor A

in Figure 3. Buﬀer sizes may be modeled as a back-edge with

initial tokens. In such cases, the number of tokens on this edge

indicates the buﬀer size available. When an actor writes data

to such channels, the available size reduces; when the receiving

actor consumes this data, the available buﬀer increases, mod-

eled by an increase in the number of tokens.

One of the most interesting properties of SDFGs relevant to

this paper is throughput. Throughput is deﬁned as the inverse of

the long term period, i.e. the average time needed for one itera-

tion of the application. An iteration is deﬁned as the minimum

non-zero execution such that the original state of the graph is

obtained. This is the performance parameter we use in this pa-

per.

CA-MPSoC: An automated design flow for predictable multi-processor architectures for multiple applications

Figures

Citations

Exploring Trade-Offs inBuffer Requirements and Throughput Constraints forSynchronous Dataflow Graphs*

Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap

A methodology for automated design of hard-real-time embedded streaming systems

Dataflow formalisation of real-time streaming applications on a Composable and Predictable Multi-Processor SOC

An automated flow to map throughput constrained applications to a MPSoC

References

The Semantics of a Simple Language for Parallel Programming.

Parallel Computer Architecture: A Hardware/Software Approach

IEEE International Conference on Acoustics Speech and Signal Processing

Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing

Blind Image Restoration Using a Block-Stationary Signal Model

Related Papers (5)

Embedded Multiprocessors: Scheduling and Synchronization

Synchronous data flow

Throughput Analysis of Synchronous Data Flow Graphs

Predictable mapping of streaming applications on multiprocessors

The Semantics of a Simple Language for Parallel Programming.

Frequently Asked Questions (16)

Q1. What are the contributions mentioned in the paper "Ca-mpsoc: an automated design flow for predictable multi-processor architectures for multiple applications" ?

Q2. What are the future works mentioned in the paper "Ca-mpsoc: an automated design flow for predictable multi-processor architectures for multiple applications" ?

Q3. What is the process of finding the new processor probabilities?

Q4. What is the complexity of the analysis?

Q5. What is the effect of the buffer size on the actor?

Q6. How can the authors make use of tile-based platforms easier?

Q7. How long does it take to generate the bit file?

Q8. What is the use-case merging technique used to generate the hardware?

Q9. What is the preferred architecture for a preemptive system?

Q10. What is the interesting way to find the throughput of an SDFG?

Q11. What is the way to compute the worst-case waiting time for non-preempt?

Q12. How many pixels are sent to the DCT actor?

Q13. What is the number of CPUs and CA buffers needed for each use-case?

Q14. How many design points does it take to generate a new hardware?

Q15. What is the reason for the non-blocking of the claimreadspace and claimwritespace?

Q16. How can a streaming application be described?