scispace - formally typeset
Open AccessJournal ArticleDOI

CA-MPSoC: An automated design flow for predictable multi-processor architectures for multiple applications

Reads0
Chats0
TLDR
A worst-case performance model of the authors' CA is proposed so that the performance of the CA-based platform can be analyzed before its implementation, and a fully automated design flow to generate communication assist (CA) based multi-processor systems (CA-MPSoC) is presented.
Abstract
Future embedded systems demand multi-processor designs to meet real-time deadlines. The large number of applications in these systems generates an exponential number of use-cases. The key design automation challenges are designing systems for these use-cases and fast exploration of software and hardware implementation alternatives with accurate performance evaluation of these use-cases. These challenges cannot be overcome by current design methodologies which are semi-automated, time consuming and error prone. In this paper, we present a fully automated design flow to generate communication assist (CA) based multi-processor systems (CA-MPSoC). A worst-case performance model of our CA is proposed so that the performance of the CA-based platform can be analyzed before its implementation. The design flow provides performance estimates and timing guarantees for both hard real-time and soft real-time applications, provided the task to processor mappings are given by the user. The flow automatically generates a super-set hardware that can be used in all use-cases of the applications. The software for each of these use-cases is also generated including the configuration of communication architecture and interfacing with application tasks. CA-MPSoC has been implemented on Xilinx FPGAs for evaluation. Further, it is made available on-line for the benefit of the research community and in this paper, it is used for performance analysis of two real life applications, Sobel and JPEG encoder executing concurrently. The CA-based platform generated by our design flow records a maximum error of 3.4% between analyzed and measured periods. Our tool can also merge use-cases to generate a super-set hardware which accelerates the evaluation of these use-cases. In a case study with six applications, the use-case merging results in a speed up of 18 when compared to the case where each use-case is evaluated individually.

read more

Content maybe subject to copyright    Report

CA-MPSoC: An Automated Design Flow for Predictable Multi-processor Architectures
for Multiple Applications
A. Shabbir
,a,1
, A. Kumar
a,b,1
, S. Stuijk
a,1
, B. Mesman
a,1
, H. Corporaal
a,1
a
Eindhoven University of Technology Eindhoven, The Netherlands
b
National University of Singapore, Singapore
Abstract
Future applications for embedded systems demand multi-processor designs to meet real-time deadlines. The large number of
applications in these systems generates an exponential number of use-cases. The key design automation challenges are designing
systems for these use-cases and fast exploration of software and hardware implementation alternatives with accurate performance
evaluation of these use-cases. These challenges can not be overcome by current design methodologies which are semi-automated,
time consuming and error-prone.
In this paper, we present a fully automated design flow (CA-MPSoC) to generate communication assist CA-based multi-processor
systems. A worst-case performance model of our CA is proposed so that the performance of the CA-based platform can be analyzed
before its implementation. The design flow provides performance estimates and timing guarantees for both hard real-time and soft
real-time applications, provided the task to processor mappings are given by the user. The flow automatically generates a super-set
hardware that can be used in all use-cases of the applications. The flow also generates the software for each of these use-cases,
including the configuration of communication architecture and interfacing with application tasks.
CA-MPSoC has been implemented on Xilinx FPGAs for evaluation. It is also available on-line for the benefit of research
community and is used for performance analysis of two real life applications, Sobel and JPEG encoder executing concurrently. The
CA-based platform generated by our design flow records a maximum error of 3.4% between analyzed and measured periods. In a
mobile phone case study with 6 applications, the merging of use-cases results in a speed up of 18 when compared to the case where
each use-case is evaluated individually.
Key words: Multi-processor, Multiple Applications, Performance Analysis, Automated Design Flow, Communication Assist
1. Introduction
Modern multimedia embedded systems have to support a
large number of independent applications. In the area of
portable consumer systems, such as mobile phones, the num-
ber of applications doubles roughly every two years and the
introduction of new technology solutions is increasingly driven
by applications [18]. Tile-based multi-processor platforms [47,
23, 24, 12, 39] are increasingly being used in modern embed-
ded systems to meet tight timing and high performance require-
ments of these large number of applications and their use-cases.
A use-case is a combination of concurrently executing applica-
tions. The number of such potential use-cases is exponential in
the number of applications that are present in the system.
In general, mapping applications onto tile-based platforms is
considered dicult. However, streaming applications can be
described in a data flow like manner and the computational ker-
nels of this flow can be easily mapped to suitable processing
elements. In essence, these systems trade architectural com-
plexity for communications, spreading work across a number
Email addresses: a.shabbir@tue.nl (A. Shabbir), a.kumar@tue.nl
(A. Kumar), s.stuijk@tue.nl (S. Stuijk), b.mesman@tue.nl (B. Mesman),
h.corporaal@tue.nl (H. Corporaal)
of sparsely connected small tiles rather than among richly con-
nected functional units of a monolithic, wide core. In order to
make use of tile-based platforms easier, inter-tile communica-
tion for these architectures should be predictable, fast and easy
to program.
In [9], a multi-processor platform is introduced that de-
couples the computation and communication of applications
through a hardware communication assist (CA). This decou-
pling o-loads the communication load from the processor,
thereby improving the performance significantly. Further, this
makes it easier to provide tight timing guarantees on the com-
putation and communication tasks that are performed by the
applications running on the platform. Several CA architec-
tures [33, 4, 35, 37] have been presented in the literature. How-
ever, it is very time consuming to map applications on these
platforms due to unavailability of platform generation tools.
Furthermore, it is very dicult to program them as the user
has to configure the communication infrastructure in addition
to the application functionality.
Manual design eorts are error prone and consume a lot of
time. To worsen the matters, most of these devices have very
short product life so shorter time-to-market for these systems
poses a challenge for the designers. The designers have to ver-
ify each use-case. For example, Bluetooth 2.5 has to meet its
Preprint submitted to Systems Architecture March 1, 2010

specification during each combination of applications. It should
perform while receiving a call or sending text messages or even
taking a picture. So there is a need for automated tools which
can reduce the design generation and verification time.
There are some multi-processor design tools [37, 44, 20, 31],
but most of them lack support for multiple applications let alone
multiple use-cases, and require manual steps. There is a tool
described in [26] that supports platform generation for multi-
ple applications and their use-cases but it does not support CA-
based platforms. Automated platform generation reduces errors
in the design and thus saves time for design iterations.
Automatic platform generation is very helpful for the design-
ers but often they are also interested in knowing about the ex-
pected performance of the applications before the actual syn-
thesis of the platform. This allows the designers to choose the
design which meets their requirements. There are some perfor-
mance evaluation tools [46, 22, 48, 29], but most of them are
for single application. There is a tool [28] for performance anal-
ysis for multiple applications but it does not take into account
the communication architecture details.
In this paper, we present a design flow (CA-MPSoC) that
takes models of multiple applications and their task to proces-
sor mappings, as input and gives expected performance of the
applications. Synchronous Data Flow graphs [30] (SDFGs) to
model the applications. These application models are refined
with the details of the communication architecture and actor-to-
processor mappings. The refined graphs are used to predict the
performance of multiple applications. If the designer is satisfied
with the performance estimates, he/she can generate CA-based
platform by using our CA-MPSoC. As far as we know, this is
the first design flow which can generate a CA-based platform.
Following are the key contributions of the paper.
Performance analysis: The flow provides the expected perfor-
mance of applications on the platform, given the fact that
the mappings of the tasks on the processors is already pro-
vided. The applications are presented as SDFGs and archi-
tecture details are added to these graphs. A model of CA
has been introduced and it is used to generate architec-
ture aware SDFGs. The tool provides both the worst case
and average case performance results from these graphs.
Worst case results can be used for hard real-time applica-
tions whereas the average case can be used for soft real-
time applications.
Automatic CA-based multi-processor generation: An auto-
mated design flow that generates multi-processor systems,
directly from the architecture aware application graphs.
The flow also generates the communication infrastructure
so that the designer does not worry about it. It generates a
super-set hardware which can be used for all the use-cases.
The software for each use-case is generated individually.
This reduces the verification time of all the use-cases of
the applications. The designer can verify that their appli-
cations will meet their required performance in all possible
combinations of applications.
SDF Task Interface: Another contribution of this work is def-
inition of an interface for the tasks such that the semantics
of SDF behaviour are maintained during execution. So
when an application specification includes high-level lan-
guage code corresponding to tasks in the application, the
source code is automatically added to the desired proces-
sor.
Software generation: The software for all the processors is
automatically generated in the flow. Further, the required
communication APIs are also generated. This includes
configuration of communication channels, setting up con-
nections, and management of memory used for communi-
cation. The programmer does not bother about these con-
figurations and can concentrate on the functionality of the
applications.
The above contributions are essential to further research in de-
sign automation community since the embedded devices are in-
creasingly becoming multi-featured. Our flow allows designers
to evaluate the performance of applications on the architecture
before actually synthesizing it. It also allows the designers to
generate the platform for either hard real-time or soft real-time
systems with given sets of actor to processor mappings. CA-
MPSoC is evaluated on two real life applications Sobel and
JPEG Encoder. The maximum error between estimated and
measured periods of these applications is about 3.4% for soft
real-time analysis. Furthermore, platform generation for mul-
tiple uses-cases is evaluated with a mobile phone case study
consisting of 6 applications. The merging of use-cases gives a
platform which supports all the use-cases. This merging results
in a speed up of 18 as compared to the case where the use-cases
are evaluated individually. The tool is made available on line [7]
for the benefit of the research community.
The rest of the paper is organized as follows. Section 2 re-
views the related work for existing CA architectures, perfor-
mance analysis and automatic platform generation tool flows.
In Section 3 we describe our architecture template. Section 4
introduces SDFGs. Section 5 presents SDF model of our CA.
In Section 6, we show how the SDF model of CA can be in-
corporated in the application model and how performance of
applications can be predicted. Section 7 gives details of the
steps performed in our design flow to generate the platform.
Section 8 describes details of tool implementation. Section 9
presents results of the experiments performed to evaluate our
design flow. Section 10 concludes the paper and gives direc-
tions for future work.
2. Related Work
2.1. Communication Assist
The communication controller presented in [37] implements
FIFO based communication between tasks. Writes to the FI-
FOs are always local to a processor whereas reads are always
remote (from the FIFO memory of a producer). The program-
ming model is based on Kahn Process Network [21] (KPN).
Due to FIFO based communication, out-of-order access, re-
reading, and skipping is only possible after storing the data lo-
cally in the consuming task. In our CA-based platform, all the
2

reads/writes to the memory are local to the producer/consumer
resulting in saving of the memory space.
In [32], the authors have presented SystemC model of a CA,
but there are some key dierences with our CA. They propose
separate communication and computation memories whereas
in our case, the data memory is also used as communication
memory. In [13], the authors have presented a synchronization
scheme for embedded shared memory systems. They propose
channel controllers for synchronization of data between tasks.
They have channel controllers per channel; our implementation
has one controller for all the channels, resulting in area e-
cient implementation. Authors in [6] describe communication
between Nested Loop Programs (NLP) in multi-processor sys-
tems. The algorithm is implemented in software and can handle
out-of-order access to the buer. Both producer and consumer
have their respective write and read windows for mutually ex-
clusive access. However, the algorithm is limited to single as-
signment codes. Our CA does not impose such restrictions.
A KPN is derived from NLP in [49]. In KPN communication
between the tasks is arranged via FIFO buers. When the con-
suming task has to read a location multiple times, the consumer
stores the array in an additional buer. Instead of FIFO buers,
we use circular buers and also there is no need to copy values
in an additional buer. The work by [17] is quite similar to [49]
and uses a read and write window.
CELL BBE [15] implements communication between pro-
cessing elements (SPEs) and the external memory through
DMA controllers called Memory Flow controller (MFC). The
key dierence between MFC and our CA is the fact that in MFC
the synchronization between the memories has to be performed
explicitly by the SPEs. In case of CA the synchronization is
taken care of by the CA itself and the processor is freed from
the synchronization overhead.
In the KPN model of computation, processes communicate
with each other by sending data to each other over edges. A
process may write to an edge whenever it wants. When it tries
to read from an edge which is empty, it blocks and must wait till
the data is available. The amount of data read from an edge may
be data-dependent. This allows modeling of any continuous
function from the inputs of the KPN to the outputs of the KPN.
It has been proved in literature that it is not possible to an-
alyze properties like the throughput or buer requirements of
a KPN at design time [14]. On the other hand, SDF is more
restrictive model. A task can only execute if it has input data
and space available at the output. The size of input and out data
is also fixed so throughput analysis and buer capacity analysis
of SDF graphs is possible statically, which makes SDF more
attractive than KPN.
Note that others in fact impose restrictions on the KPN
graphs that are accepted by their tools. These constraints turn
these graphs into cyclo-static dataflow graphs. Such a cyclo-
static dataflow graph can always be transferred into an SDF and
mapped using our flow. Hence it may seem that others use a
more flexible model, but in fact their restrictions imply that use
the same model as we do.
2.2. Design Flows for Platform Generation
The problem of mapping an application to an architecture
has been widely studied in literature. One of the recent works
most related to our research is ESPAM [37]. This uses Kahn
process networks (KPNs) [21] for application specification. In
our approach, we use SDFGs for application specification in-
stead. Further, our approach supports mapping of multiple ap-
plications, while ESPAM is limited to single application. This
dierence is imperative for developing modern embedded sys-
tems which support more than tens of applications on a single
MPSoC. The same dierence can be seen between our approach
and the one proposed in [20], where an exploration framework
to build ecient FPGA multi-processors is proposed.
The Compaan/Laura design flow presented in [44] also uses
KPN specification for mapping applications to FPGAs. How-
ever, their approach is limited to a processor and coprocessor.
Our approach aims at synthesizing complete MPSoC designs
supporting multiple processors. Another approach for gen-
erating application-specific MPSoC architectures is presented
in [31]. However, most of the steps in their approach are done
manually. Exploring multiple design iterations is therefore not
feasible. In our flow, the entire flow is automated, including
the generation of the final bit-file that runs on the FPGA. Yet
another flow for generating MPSoCs for FPGAs has been pre-
sented in [27]. However, that flow focuses on generic MPSoCs
and not on application-specific architectures. There is also a
tool described in [26] that supports platform generation for mul-
tiple use-cases but it does not support CA-based platforms.
Xilinx provides a tool-chain as well to generate designs with
multiple processors and peripherals [50]. However, most of
the features are limited to designs with a bus-based processor-
coprocessor pair with shared memory. It is very time consum-
ing and error prone to generate an MPSoC architecture and
the corresponding software projects to run on the system. In
our flow, an MPSoC architecture is automatically generated to-
gether with the respective software projects for each core.
Finally, none of the above flows support a CA-based plat-
form. In fact our flow is the first to generate CA base multi-
processor platforms. Communication plays important role in
the parallelization of applications. The communication to com-
putation ratio determines the justification of splitting task be-
tween the processors. Our CA in turn exposes more parallelism
in the applications.
In [8], the authors present a design flow that generates a
multicore system for multimedia applications. Their work is
quite similar to ours. However, there are some key dierences.
Firstly they use mesh network for interconnection whereas we
use point-to-point networks. Secondly, they use profiling to di-
mension their system. We, on the other hand use static analysis
techniques. Profiling based techniques are significantly slower
than analysis based techniques. Also their synthesis flow gener-
ates platforms for average case performance whereas our flow
can generate platforms for both worst case and average case
performance. Lastly, our flow supports multiple applications
concurrently executing on the platform while [8] is for single
application.
3

CA
PE
CA
network
PE
T
0
T
1
NI FIFOs NI FIFOs
1
2
3
4
5
DMDM
Figure 1: Proposed CA-based platform.
2.3. Performance Analysis
In [34], the authors propose to analyze the performance of a
single application modeled as an SDFG by decomposing it into
a homogeneous SDF graph (HSDFG) [43]. The throughput is
calculated based on analysis of each cycle in the resulting HS-
DFG [10]. However, this can result in an exponential number
of vertices [38]. Thus, algorithms that have a polynomial com-
plexity for HSDFGs have an exponential complexity for SD-
FGs. This approach is not practical for multiple applications.
For multiple applications, an approach that models resource
contention by computing worst-case-response-time (WCRT)
for TDMA scheduling (requires preemption) has been analyzed
in [3]. A similar worst-case analysis approach for round-robin
is presented in [16], which also considers non-preemptive sys-
tems, but suers from the same problem of lack of scalabil-
ity. Real-time calculus has also been used to provide worst-case
bounds for multiple applications [22, 48, 29]. The analysis is
very intensive and requires a very large design-time eort. On
the other the worst-case-waiting-time analysis used in our tool
is very fast and simple.
A common way to use probabilities for modeling dynamism
in application is using stochastic task execution times [1, 42,
41]. The probabilistic approach [25] used by us uses proba-
bilities to model the resource contention and provide estimates
for the throughput of applications. This approach is orthogo-
nal to the approach of using stochastic task execution times.
To the best of our knowledge, there is no ecient approach
of analyzing multiple applications on a non-preemptive hetero-
geneous multi-processor platform. A technique has been pre-
sented in [28] to also model and analyze contention, but the ap-
proach used in this paper is much better. The technique in [28]
looks at all possible combinations of actors blocking another
actor. Since the number of combinations is exponential in the
number of actors mapped on a resource, the analysis has an
exponential complexity. The approach used in this paper has
linear complexity in number of actors.
3. Architecture Template
The architecture template used in our platform is depicted
in Figure 1. It consists of a processing element (PE), a com-
munication assist (CA), Data memory (DM) and Network in-
terface FIFOs (NI FIFO). The CA transfers data between the
DM and the NI FIFO. The NI FIFOs are connected through a
partial point-to-point network. The structure of the networks
themselves is out of the scope of this paper.
Scalability of partial point-to-point networks has been an is-
sue as they require storage to deal with bursts. FSL buses from
Xilinx is one example. However, the point-to-point networks
used in our template do not require storage. This means that
cost of a connection is not very high. The CAs can transfer the
data directly from the data memory of sending tile to the data
memory of the receiving tile, i.e. they do not require storage in
the point-to-point network itself.
3.1. Processing Element
The processing elements used in our template are simple
RISC based processors. RISC processors are the processing
element of choice for tile-based platforms [47]. No caches are
attached to the processor to have predictable execution trace.
The PE has local instruction and data memories. The instruc-
tion memory is connected to the PE through a bus whereas the
access to the data memory is through the communication assist.
Note that we chose microblaze processors from Xilinx whereas
there is work [2] where picoblaze processors are used. Our syn-
thesis flow is not restricted to any one processor type so choice
of processor is not important.
The PE is non-preemptive and can execute only single thread.
This simplifies the architecture of the PE. Preemption requires
extra hardware and is costly in terms of area. Furthermore, non-
preemptive scheduling algorithms are easier to implement as
compared to their preemptive counter parts and have dramat-
ically lower overhead at runtime [19]. In high performance
embedded processors (like SPEs in Cell Broad Band Engine
and graphics processors), non-preemptive systems are preferred
over preemptive systems.
3.2. Memories
We use a single port instruction memory, which is directly
connected to the PE. The data memory (DM) used in our tem-
plate is a dual ported memory as depicted in Figure 1. The
CA has exclusive access to one port of this memory. The sec-
ond port is connected to the PE through the CA. The choice of
dual ported memory may seem expensive, however we use it to
make the access of the memory to CA and PE as fast as possi-
ble. The other option could be an arbiter to resolve the access
between the two but for predictable performance, we preferred
dual ported memory over a combination of an arbiter and a sin-
gle ported memory. Single ported memory can introduce stall
cycles for the processor which inturn makes the execution time
of the task executing on the processor, unpredictable. Further,
it is very dicult to model an unpredictable arbiter so we de-
cided to use dual ported DM. Next subsection will clarify this
configuration.
3.3. Communication Assist
Figure 2 shows the global view of CA (more details about the
architecture can be seen in [40]). It performs following basic
functions
4

Addr_tr
CA
cntrlFSM
Pointer
Store
NI FIFOs
DM
P
Figure 2: CA architecture.
1. It configures NI FIFO channels and their corresponding
buers in DM.
2. It accepts data transfer requests from the attached PE
and splits them into local memory requests and remote
requests (to other tiles). The address translation unit
Addr tr” shown in Figure 2 performs this task.
3. Local memory requests are simply bypassed to the data
memory.
4. Remote memory requests are handled through a round
robin arbiter. Every two cycles, a 32 bit word is trans-
ferred from the buer in the memory to NI FIFO channels
and vice verse.
5. The buers implemented in the memory are circular
buers. The pointers needed for circular buer manage-
ment are updated and stored in the CA. The number of NI
FIFO channels can be greater than or equal to number of
buers in the data memory.
Our communication assist acts as an interface that provides link
between NoC and the sub systems (PE and memory). It also
acts as memory management unit that helps processor keep
track of its data structures. As a result, it decouples commu-
nication from computation and relieves the processor from data
transfer functions. Our programmable CA uses a shared data
and buer memory. This leads to lower memory requirement
for the overall system and to a lower communication latency.
Figure 1 shows CA-based multi-processor tiles and demon-
strates the steps involved during data transactions between the
tiles. Assume tile T
0
is executing a producer task and tile T 1
is executing a consumer task. The primitives used for commu-
nication are known as C-HEAP [36] protocol. The producer
task executing on tile T
0
requests for space. The CA returns the
pointer to the buer in the memory (step 1 in Figure 1). The
PE processes the data as local memory access. It then requests
the CA that it wants to release the space. The CA transfers the
data to the designated NI FIFO (step2). The data is transported
through the network (step 3). The CA of the consumer task exe-
cuting in tile T
1
receives the data and places that in the memory
(step 4). The consumer task requests the CA about the avail-
ability of the data. The CA sends the pointer to this data and
the PE can access it like a local memory request (step 4). The
consumer task processes the data and releases the space so that
the CA can use this space for future data receptions (step 5).
Figure 2 depicts the hardware components of CA. The point-
ers used for circular buer management are stored in a pointer
store unit “Pointer Store”. Every clock cycle, the CA checks
wheather there is data to be transferred between the DM and
the NI FIFOs. The monitoring of the NI FIFOs is round robin,
which makes the architecture predictable. This predictability
allows us to give tight bounds on the reported performance of
the platform.
Before we can demonstrate how the communication between
the tiles and the timing behaviour of task execution can be an-
alyzed in terms of timing, first we need to introduce SDFGs in
the next section.
4. SDF Graphs
Synchronous data flow graphs are often used for modeling
modern DSP applications [43] and for designing concurrent
multimedia applications implemented on multi-processor plat-
forms. Both pipelined streaming and cyclic dependencies be-
tween tasks can be easily modeled in SDFGs. Tasks are mod-
eled by the vertices of an SDFG, which are called actors. SD-
FGs allow analysis of a system in terms of throughput and
other performance properties, such as latency and buer re-
quirements [45].
B
10
4
3
3
A
1
5
1
1
C
1
4
7
4 2
D
6
Figure 3: Example of an SDF graph.
Figure 3 shows an example of an SDFG. There are four ac-
tors in this graph. As in a typical data-flow graph, a directed
edge represents the dependency between tasks. Tasks also need
some input data (or control information) before they can start
and usually also produce some output data; such terms of infor-
mation are referred to as tokens. Actor execution is also called
firing. An actor is called ready when it has sucient input to-
kens on all its input edges and sucient buer space on all its
output channels; an actor can only fire when it is ready.
The edges may also contain initial tokens, indicated by bul-
lets on the edges, as seen on the edge from actor C to actor A
in Figure 3. Buer sizes may be modeled as a back-edge with
initial tokens. In such cases, the number of tokens on this edge
indicates the buer size available. When an actor writes data
to such channels, the available size reduces; when the receiving
actor consumes this data, the available buer increases, mod-
eled by an increase in the number of tokens.
One of the most interesting properties of SDFGs relevant to
this paper is throughput. Throughput is defined as the inverse of
the long term period, i.e. the average time needed for one itera-
tion of the application. An iteration is defined as the minimum
non-zero execution such that the original state of the graph is
obtained. This is the performance parameter we use in this pa-
per.
5

Citations
More filters

Exploring Trade-Offs inBuffer Requirements and Throughput Constraints forSynchronous Dataflow Graphs*

Sander Stuijk
TL;DR: This work presents exact techniques to chart the Pareto space of throughput and storage tradeoffs, which can be used to determine the minimal storage space needed to execute a graph under a given throughput constraint.
Book

Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap

TL;DR: This book provides embedded software developers with techniques for programming heterogeneous Multi-Processor Systems-on-Chip (MPSoCs), capable of executing multiple applications simultaneously, with an in-depth description of the underlying problems and challenges of todays programming practices.
Proceedings ArticleDOI

A methodology for automated design of hard-real-time embedded streaming systems

TL;DR: This paper proposes a novel methodology for automated design of an embedded multiprocessor system, which can run multiple hard- real-time streaming applications simultaneously and enables the use of hard-real-time multiprocessionor scheduling theory to schedule the applications in a way that temporal isolation and a given throughput of each application are guaranteed.
Journal ArticleDOI

Dataflow formalisation of real-time streaming applications on a Composable and Predictable Multi-Processor SOC

TL;DR: A dataflow formalisation is described to independently model real-time applications executing on the CompSOC platform, including new models of the entire software stack, and correctly predicts trends, such as application speed-up when mapping an application to more processors.

An automated flow to map throughput constrained applications to a MPSoC

TL;DR: A design flow to map throughput constrained applications on a Multi-processor System-on-Chip (MPSoC) and is able to provide a tight, conservative bound on the worst-case throughput of the FPGA implementation.
References
More filters
Proceedings Article

The Semantics of a Simple Language for Parallel Programming.

Gilles Kahn
TL;DR: A simple language for parallel programming is described and its mathematical properties are studied to make a case for more formal languages for systems programming and the design of operating systems.
Book

Parallel Computer Architecture: A Hardware/Software Approach

TL;DR: This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures and provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.
Journal ArticleDOI

Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing

TL;DR: This self-contained paper develops the theory necessary to statically schedule SDF programs on single or multiple processors, and a class of static (compile time) scheduling algorithms is proven valid, and specific algorithms are given for scheduling SDF systems onto single ormultiple processors.
Proceedings ArticleDOI

Blind Image Restoration Using a Block-Stationary Signal Model

TL;DR: A novel method for blind image restoration which is a multidimensional extension of an approach used successfully for audio restoration, and a maximum marginalised a posteriori (MMAP) blur estimate is obtained by optimising the resulting probability density function.
Frequently Asked Questions (16)
Q1. What are the contributions mentioned in the paper "Ca-mpsoc: an automated design flow for predictable multi-processor architectures for multiple applications" ?

In this paper, the authors present a fully automated design flow ( CA-MPSoC ) to generate communication assist CA-based multi-processor systems. The design flow provides performance estimates and timing guarantees for both hard real-time and soft real-time applications, provided the task to processor mappings are given by the user. In a mobile phone case study with 6 applications, the merging of use-cases results in a speed up of 18 when compared to the case where each use-case is evaluated individually. 

In the future, the authors intend to include an NoC also in their design flow. The authors also want to extend the design flow with automated mapping decisions, so that mapping of the actors to the processors can also be optimized. 

The updated actor execution times, execution probabilities and waiting probabilities are used to find the new processor level probabilities. 

Since the number of combinations is exponential in the number of actors mapped on a resource, the analysis has an exponential complexity. 

When an actor writes data to such channels, the available size reduces; when the receiving actor consumes this data, the available buffer increases, modeled by an increase in the number of tokens. 

In order to make use of tile-based platforms easier, inter-tile communication for these architectures should be predictable, fast and easy to program. 

The Xilinx tool takes about 36 minute to generate the bit file together with the appropriate instruction and data memories for each core in the design. 

As the generated hardware supports multiple use-cases, so the authors employ the use-case merging technique [26] and modify its certain parts to incorporate CA buffers. 

In high performance embedded processors (like SPEs in Cell Broad Band Engine and graphics processors), non-preemptive systems are preferred over preemptive systems. 

One of the methods to find the throughput of an SDFG is to convert it into HSDF graph and then find the throughput of the resulting graph. 

The worst-case-waiting-times for non-preemptive systems for FCFS as mentioned in [16] are computed by using the following formulatwait = n∑i=1texec(ai) (5)where actors ai for i = 1, 2, 3, ...n are mapped on the same resource (i.e processor). 

The DCT actor sends these 6 macro-blocks one by one (64 pixels each time) to the VLC actor where each of these macro-blocks is variable length encoded. 

While the number of processors and CA buffers needed is updated with a max operation (line 10 and line 11 in Algorithm 1), the number of CA channels is added for each application (indicated by line 13 in Algorithm 1). 

As the number of design points are increased, the cost of generating the hardware becomes neg-ligible and each iteration takes only about 25 seconds. 

To avoid this, claimreadspace and claimwritespace commands have been implemented as non-blocking so that if any of claimspace commands is unsuccessful, the processor is not blocked. 

streaming applications can be described in a data flow like manner and the computational kernels of this flow can be easily mapped to suitable processing elements.