scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Decoupling of Computation and Communication with a Communication Assist

TL;DR: The worst-case execution time of tasks does not depend on communication bandwidth if a Communication Assist is applied, despite that memory ports are shared, and it is shown that adding a CA increases the processor utilization and reduces the required communication bandwidth.
Abstract: In an embedded multiprocessor system the minimum throughput and maximum latency of real-time applications are usually derived given the worst-case execution time of the software tasks. Derivation of the worst-case execution time becomes easier if it is independent of the available communication bandwidth. In this paper we show that the worst-case execution time of tasks does not depend on communication bandwidth if a Communication Assist (CA) is applied, despite that memory ports are shared. Furthermore we show that adding a CA increases the processor utilization and reduces the required communication bandwidth. Finally we show that the difference between the measured and computed worst-case processor utilization is less than 6%, for our MP3 playback application.
Citations
More filters
Journal ArticleDOI
01 Jul 2010
TL;DR: A worst-case performance model of the authors' CA is proposed so that the performance of the CA-based platform can be analyzed before its implementation, and a fully automated design flow to generate communication assist (CA) based multi-processor systems (CA-MPSoC) is presented.
Abstract: Future embedded systems demand multi-processor designs to meet real-time deadlines. The large number of applications in these systems generates an exponential number of use-cases. The key design automation challenges are designing systems for these use-cases and fast exploration of software and hardware implementation alternatives with accurate performance evaluation of these use-cases. These challenges cannot be overcome by current design methodologies which are semi-automated, time consuming and error prone. In this paper, we present a fully automated design flow to generate communication assist (CA) based multi-processor systems (CA-MPSoC). A worst-case performance model of our CA is proposed so that the performance of the CA-based platform can be analyzed before its implementation. The design flow provides performance estimates and timing guarantees for both hard real-time and soft real-time applications, provided the task to processor mappings are given by the user. The flow automatically generates a super-set hardware that can be used in all use-cases of the applications. The software for each of these use-cases is also generated including the configuration of communication architecture and interfacing with application tasks. CA-MPSoC has been implemented on Xilinx FPGAs for evaluation. Further, it is made available on-line for the benefit of the research community and in this paper, it is used for performance analysis of two real life applications, Sobel and JPEG encoder executing concurrently. The CA-based platform generated by our design flow records a maximum error of 3.4% between analyzed and measured periods. Our tool can also merge use-cases to generate a super-set hardware which accelerates the evaluation of these use-cases. In a case study with six applications, the use-case merging results in a speed up of 18 when compared to the case where each use-case is evaluated individually.

42 citations

Proceedings ArticleDOI
10 Mar 2008
TL;DR: A novel cache optimisation technique to reduce instruction and data cache misses for streaming applications and introduces a dataflow model that is used to trade-off the number of cache misses against end-to-end latency and memory usage.
Abstract: Efficient use of the memory hierarchy is critical for achieving high performance in a multiprocessor system- on-chip. An external memory that is shared between processors is a bottleneck in current and future systems. Cache misses and a large cache miss penalty contribute to a low processor utilisation. In this paper, we describe a novel cache optimisation technique to reduce instruction and data cache misses for streaming applications. The instruction and data locality are improved by executing a task multiple times before moving to the next task. Furthermore, we introduce a dataflow model that is used to trade-off the number of cache misses against end-to-end latency and memory usage. For our industrial application, which is a Digital Radio Mondiale receiver, the number of cache misses is reduced with a factor 4.2.

14 citations

Proceedings ArticleDOI
02 Dec 2013
TL;DR: An algorithm is presented to solve the mapping problem jointly for computation and communication using this cost model and it is shown that the obtained mapping is able to reduce the execution time.
Abstract: Automated mapping of dataflow applications to state-of-the-art, heterogeneous Multiprocessor Systems on Chip (MPSoCs) with complex interconnects and communication means is an ongoing research endeavor. We implement, measure and analyze three different communication libraries for a representative, off-the-shelf platform of this kind. The results of the analysis are used to show the need of a new cost model to properly characterize inter-task communication. Afterwards, this paper presents an algorithm to solve the mapping problem jointly for computation and communication using this cost model. A case study with four real streaming applications shows that the obtained mapping is able to reduce the execution time. Compared to a mapping decision where all channels are mapped to shared memory, the makespan fell down up to 10% due to an automated selection of a more appropriate communication library.

11 citations

DOI
01 Jan 2009
TL;DR: This thesis introduces a network-based multiprocessor system that is predictable by starting with an architecture where processors have private local memories and execute tasks in a static order, so that the uncertainty in the temporal behaviour is minimised.
Abstract: The focus of this thesis is on embedded media systems that execute applications from the application domain car infotainment. These applications, which we refer to as jobs, typically fall in the class of streaming, i.e. they process on a stream of data. The jobs are executed on heterogeneous multiprocessor platforms, for performance and power efficiency reasons. Most of these jobs have firm real-time requirements, like throughput and end-to-end latency. Car-infotainment systems become increasingly more complex, due to an increase in the supported number of jobs and an increase of resource sharing. Therefore, it is hard to verify, for each job, that the realtime requirements are satisfied. To reduce the verification effort, we elaborate on an architecture for a predictable system from which we can verify, at design time, that the job’s throughput and end-to-end latency requirements are satisfied. This thesis introduces a network-based multiprocessor system that is predictable. This is achieved by starting with an architecture where processors have private local memories and execute tasks in a static order, so that the uncertainty in the temporal behaviour is minimised. As an interconnect, we use a network that supports guaranteed communication services so that it is guaranteed that data is delivered in time. The architecture is extended with shared local memories, run-time scheduling of tasks, and a memory hierarchy. Dataflow modelling and analysis techniques are used for verification, because they allow cyclic data dependencies that influence the job’s performance. Shown is how to construct a dataflow model from a job that is mapped onto our predictable multiprocessor platforms. This dataflow model takes into account: computation of tasks, communication between tasks, buffer capacities, and scheduling of shared resources. The job’s throughput and end-to-end latency bounds are derived from a self-timed execution of the dataflow graph, by making use of existing dataflow-analysis techniques. It is shown that the derived bounds are tight, e.g. for our channel equaliser job, the accuracy of the derived throughput bound is within 10.1%. Furthermore, it is shown that the dataflow modelling and analysis techniques can be used despite the use of shared memories, run-time scheduling of tasks, and caches.

10 citations

Journal ArticleDOI
01 Nov 2013
TL;DR: A predictable high-performance communication assist (CA) that helps to tackle design challenges in integrating IP cores into heterogeneous Multi-Processor System-on-Chips (MPSoCs), and a predictable heterogeneous multi-processor platform template for streaming applications is presented.
Abstract: Streaming applications are an important class of applications in emerging embedded systems such as smart camera network, unmanned vehicles, and industrial printing. These applications are usually very computationally intensive and have real-time constraints. To meet the increasing demand for performance and efficiency in these applications, the use of application specific IP cores in heterogeneous Multi-Processor System-on-Chips (MPSoCs) becomes inevitable. However, two of the key challenges in integrating these IP cores into MPSoCs are (i) how to properly handle inter-core communication; (ii) how to map streaming applications in an efficient and predictable way. In this paper, we first present a predictable high-performance communication assist (CA) that helps to tackle these design challenges. The proposed CA has zero throughput overhead, negligible latency overhead, and significantly less resource usage compared to existing CA designs. The proposed CA also provides a unified abstract interface for both processors and accelerator IP cores with flexible data access support. Based on the proposed CA design, we present a predictable heterogeneous multi-processor platform template for streaming applications. The template is used in a predictable design flow that uses Synchronous Data Flow (SDF) graphs for design time analysis. An accurate SDF model of our CA is introduced, enabling the mapping of applications onto heterogeneous MPSoCs in an efficient and predictable way. As a case study, we map the complete high-speed vision processing pipeline of an industrial application, Organic Light Emitting Diode (OLED) screen printing, onto one instance of the proposed platform. The result demonstrates that system design and analysis effort is greatly reduced with the proposed CA-based design flow.

5 citations

References
More filters
Book
22 Apr 2011
TL;DR: Real-Time Systems offers a splendid example for the balanced, integrated treatment of systems and software engineering, helping readers tackle the hardest problems of advanced real-time system design, such as determinism, compositionality, timing and fault management.
Abstract: "This book is a comprehensive text for the design of safety critical, hard real-time embedded systems. It offers a splendid example for the balanced, integrated treatment of systems and software engineering, helping readers tackle the hardest problems of advanced real-time system design, such as determinism, compositionality, timing and fault management. This book is an essential reading for advanced undergraduates and graduate students in a wide range of disciplines impacted by embedded computing and software. Its conceptual clarity, the style of explanations and the examples make the abstract conceptsaccessible for a wide audience."Janos Sztipanovits, DirectorE. Bronson Ingram Distinguished Professor of EngineeringInstitute for Software Integrated SystemsVanderbilt UniversityReal-Time Systems focuses on hard real-time systems, which are computing systems that must meet their temporal specification in all anticipated load and fault scenarios. The book stresses the system aspects of distributed real-time applications, treating the issues of real-time, distribution and fault-tolerance from an integral point of view. A unique cross-fertilization of ideas and concepts between the academic and industrial worlds has led to the inclusion of many insightful examples from industry to explain the fundamental scientific concepts in a real-world setting. Compared to the first edition, new developments incomplexity management,energy and power management, dependability, security, andthe internet of things, are addressed. The book is written as a standard textbook for a high-level undergraduate or graduate course on real-time embedded systems or cyber-physical systems. Its practical approach to solving real-time problems, along with numerous summary exercises, makes it an excellent choice for researchers and practitioners alike.

2,147 citations


"Decoupling of Computation and Commu..." refers methods in this paper

  • ...This schedule is constructed given the worst-case execution times of the tasks [ 6 , 15]....

    [...]

Book
15 Aug 1998
TL;DR: This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures and provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.
Abstract: The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions. * synthesizes a decade of research and development for practicing engineers, graduate students, and researchers in parallel computer architecture, system software, and applications development * presents in-depth application case studies from computer graphics, computational science and engineering, and data mining to demonstrate sound quantitative evaluation of design trade-offs * describes the process of programming for performance, including both the architecture-independent and architecture-dependent aspects, with examples and case-studies * illustrates bus-based and network-based parallel systems with case studies of more than a dozen important commercial designs Table of Contents 1 Introduction 2 Parallel Programs 3 Programming for Performance 4 Workload-Driven Evaluation 5 Shared Memory Multiprocessors 6 Snoop-based Multiprocessor Design 7 Scalable Multiprocessors 8 Directory-based Cache Coherence 9 Hardware-Software Tradeoffs 10 Interconnection Network Design 11 Latency Tolerance 12 Future Directions APPENDIX A Parallel Benchmark Suites

1,571 citations


"Decoupling of Computation and Commu..." refers background or methods in this paper

  • ...In [ 3 , 10, 1, 12] such a CA has already been introduced and in [14] an implementation of a CA is described for a CA that supports four communication streams....

    [...]

  • ...This autonomous DMA controller is called a Communication Assist (CA) [ 3 ]....

    [...]

Proceedings ArticleDOI
22 Jun 2001
TL;DR: The "network-on-chip" approach partitions the communication into layers to maximize reuse and provide a programmer with an abstraction of the underlying communication framework.
Abstract: Communication-based design represents a formal method approach to of system-on-a-chip design that considers communication between components as important as the computations they perform. “Our network-on-chipr approach partitions the communication into layers to maximize reuse and provide a programmer with an abstraction of the underlying communication framework. This layered approach is cast in the structure advocated by the OSI Reference network model and is demonstrated with a reconfigurable DSP example. The Metropolis methodology of deriving layers through a sequence of successive adaptation steps between incompatible behaviors refinement of communication is illustrated through the Intercom a design example. In another approach, MESCAL provides a designer with tools for a correct-by-construction protocol stack.

404 citations


"Decoupling of Computation and Commu..." refers background in this paper

  • ...In [3, 10 , 1, 12] such a CA has already been introduced and in [14] an implementation of a CA is described for a CA that supports four communication streams....

    [...]

Proceedings ArticleDOI
06 Mar 2006
TL;DR: A complete allocation and scheduling framework, where an MPSoC virtual platform is used to accurately derive input parameters, validate abstract models of system components and assess constraint satisfaction and objective function optimization is proposed.
Abstract: This paper proposes a complete allocation and scheduling framework, where an MPSoC virtual platform is used to accurately derive input parameters, validate abstract models of system components and assess constraint satisfaction and objective function optimization. The optimizer implements an efficient and exact approach to allocation and scheduling based on problem decomposition. The allocation subproblem is solved through integer programming while the scheduling one through constraint programming. The two solvers can interact by means of no-good generation, thus building an iterative procedure which has been proven to converge to the optimal solution. Experimental results show significant speedups w.r.t. pure IP and CP exact solution strategies as well as high accuracy with respect to cycle accurate functional simulation. A case study further demonstrates the practical viability of our framework for real-life systems and applications.

106 citations


"Decoupling of Computation and Commu..." refers background in this paper

  • ...In [9, 8 ] all effects that have an inuence on the transaction are taken into account in deriving the worst-case execution time of a task....

    [...]

Proceedings ArticleDOI
22 Oct 2006
TL;DR: In this article, the authors present an algorithm, with linear computational complexity, that does not require the required conversion from a multi-rate dataflow graph to a single-rate graph and that determines close to minimal buffer capacities.
Abstract: A key step in the design of multi-rate real-time systems is the determination of buffer capacities. In our multi-processor system, we apply back-pressure as caused by bounded buffers in order to control jitter. This requires the derivation of buffer capacities that both satisfy the temporal constraints as well as constraints on the buffer capacity. Existing exact solutions suffer from the computational complexity associated with the required conversion from a multi-rate dataflow graph to a single-rate dataflow graph. In this paper we present an algorithm, with linear computational complexity, that does not require this conversion and that determines close to minimal buffer capacities. The algorithm is applied to an MP3 play-back application that is mapped on our network based multi-processor system.

84 citations