scispace - formally typeset
Search or ask a question

Showing papers by "Kees Goossens published in 2009"


Journal ArticleDOI
TL;DR: A Composable and Predictable Multi-Processor System on Chip (CoMPSoC) platform template is proposed, which enables a divide-and-conquer design strategy, where all applications, potentially using different programming models and communication paradigms, are developed and verified independently of one another.
Abstract: A growing number of applications, often with firm or soft real-time requirements, are integrated on the same System on Chip, in the form of either hardware or software intellectual property. The applications are started and stopped at run time, creating different use-cases. Resources, such as interconnects and memories, are shared between different applications, both within and between use-cases, to reduce silicon cost and power consumption.The functional and temporal behaviour of the applications is verified by simulation and formal methods. Traditionally, designers resort to monolithic verification of the system as whole, since the applications interfere in shared resources, and thus affect each other's behaviour. Due to interference between applications, the integration and verification complexity grows exponentially in the number of applications, and the task to verify correct behaviour of concurrent applications is on the system designer rather than the application designers.In this work, we propose a Composable and Predictable Multi-Processor System on Chip (CoMPSoC) platform template. This scalable hardware and software template removes all interference between applications through resource reservations. We demonstrate how this enables a divide-and-conquer design strategy, where all applications, potentially using different programming models and communication paradigms, are developed and verified independently of one another. Performance is analyzed per application, using state-of-the-art dataflow techniques or simulation, depending on the requirements of the application. These results still apply when the applications are integrated onto the platform, thus separating system-level design and application design.

204 citations


Proceedings ArticleDOI
20 Apr 2009
TL;DR: Aelite NoC architecture, that offers only Guaranteed Services, based on flit-synchronous Time Division Multiplexing (TDM), is presented, that delivers the requested service to hundreds of simultaneous connections, and does so with 5 times less area compared to a state-of-the-art NoC.
Abstract: To accommodate the growing number of applications integrated on a single chip, Networks on Chip (NoC) must offer scalability not only on the architectural, but also on the physical and functional level. In addition, real-time applications require Guaranteed Services (GS), with latency and throughput bounds. Traditionally, NoC architectures only deliver scalability on two of the aforementioned three levels, or do not offer GS. In this paper we present the composable and predictable aelite NoC architecture, that offers only GS, based on flit-synchronous Time Division Multiplexing (TDM). In contrast to other TDM-based NoCs, scalability on the physical level is achieved by using mesochronous or asynchronous links. Functional scalability is accomplished by completely isolating applications, and by having a router architecture that does not limit the number of service levels or connections. We demonstrate how aelite delivers the requested service to hundreds of simultaneous connections, and does so with 5 times less area compared to a state-of-the-art NoC.

102 citations


Proceedings ArticleDOI
28 Apr 2009
TL;DR: This paper presents a monitoring infrastructure for multi-processor SOCs with a Network on Chip (NOC), and explains its application to performance analysis and debug, and describes how its monitors aid in the performanceAnalysis and debug of the interactions of the embedded processors.
Abstract: Problems in a new System on Chip (SOC) consisting of hardware and embedded software often only show up when a silicon prototype of the chip is placed in its intended target environment and the application is executed. Traditionally, the debugging of embedded systems is difficult and time consuming because of the intrinsic lack of internal system observability and controlability in the target environment. Design for Debug (DfD) is the act of adding debug support to the design of a chip, in the realization that not every SOC is correct first time. DfD provides debug engineers with increased observability and controlability of the internal operation of an embedded system. In this paper, we present a monitoring infrastructure for multi-processor SOCs with a Network on Chip (NOC), and explain its application to performance analysis and debug. We describe how our monitors aid in the performance analysis and debug of the interactions of the embedded processors. We present a generic template for bus and router monitors, and show how they are instantiated at design time in our NOC design flow. We conclude this paper with details of their hardware cost.

48 citations


Proceedings ArticleDOI
27 Aug 2009
TL;DR: An approach to composable resource sharing based on latency-rate servers that supports any arbiter belonging to the class, providing a larger solution space for a given set of requirements, and an architecture for a resource front end that implements the concepts and provides composable service for any resource with bounded service time.
Abstract: Verification of application requirements is becom- ing a bottleneck in system-on-chip design, as the number of applications grows. Traditionally, the verification complexity increases exponentially with the number of applications and must be repeated if an application is added, removed, or modified. Predictable systems offering lower bounds on performance have been proposed to manage the increasing verification complexity, although this approach is only applicable to a restricted set of applications and systems. Composable systems, on the other hand, completely isolate applications in both the value and time domains, allowing them to be independently verified. However, existing approaches to composable system design are either restricted to applications that can be statically scheduled, or share resources using time-division multiplexing, which cannot efficiently satisfy tight latency requirements. In this paper, we present an approach to composable resource sharing based on latency-rate servers that supports any arbiter belonging to the class, providing a larger solution space for a given set of requirements. The approach can be combined with formal performance analysis using a variety of well-known modeling frame works. We furthermore propose an architecture for a resource front end that implements our concepts and provides composable service for any resource with bounded service time. The architecture supports both systems with buffers dimensioned to prevent overflow and systems with smaller buffers, where overflow is prevented with flow control. Finally, we experimentally demonstrate the usefulness of our approach with a simple use case sharing an SRAM memory.

47 citations


Proceedings ArticleDOI
20 Apr 2009
TL;DR: This paper raises the debug abstraction level further, by utilising structural and temporal abstraction techniques, combined with debug data interpretation and logical communication views, and presents a generic debug API, which can be used to visualise an SOC's state at the logical communication level.
Abstract: A large part of a modern SOC's debug complexity resides in the interaction between the main system components. Transaction-level debug moves the abstraction level of the debug process up from the bit and cycle level to the transactions between IP blocks. In this paper we raise the debug abstraction level further, by utilising structural and temporal abstraction techniques, combined with debug data interpretation and logical communication views. The combination of these techniques and views allow us, among others, to single-step and observe the operation of the network on a per-connection basis. As an example, we show how these higher-level abstractions have been implemented in the debug environment for the AEthereal NOC architecture and present a generic debug API, which can be used to visualise an SOC's state at the logical communication level.

32 citations


Proceedings ArticleDOI
11 Oct 2009
TL;DR: In this paper, the authors introduce an on-chip interconnect and protocol stack that combines streaming and distributed shared memory communication, and quantify the cost, both on the block level and for a complete SoC.
Abstract: A growing number of applications, with diverse requirements, are integrated on the same System on Chip (SoC) in the form of hardware and software Intellectual Property (IP). The diverse requirements, coupled with the IPs being developed by unrelated design teams, lead to multiple communication paradigms, programming models, and interface protocols that the on-chip interconnect must accommodate.Traditionally, on-chip buses offer distributed shared memory communication with established memory-consistency models, but are tightly coupled to a specific interface protocol. On-chip networks, on the other hand, offer layering and interface abstraction, but are centred around point-to-point streaming communication, and do not address issues at the higher layers in the protocol stack, such as memory-consistency models and message-dependent deadlock.In this work we introduce an on-chip interconnect and protocol stack that combines streaming and distributed shared memory communication. The proposed interconnect offers an established memory-consistency model and does not restrict any higher-level protocol dependencies. We present the protocol stack and the architectural blocks and quantify the cost, both on the block level and for a complete SoC. For a multi-processor multi-application SoC with multiple communication paradigms and programming models, our proposed interconnect occupies only 4% of the chip area.

28 citations


Proceedings ArticleDOI
27 Aug 2009
TL;DR: This paper proposes to use a multi-hop network on a chip (NOC) as the crossbar fabric, with FIFO-queued line cards, and prototyped a 32×32 NOC-based cross bar fabric in a 65nm CMOS technology.
Abstract: The scalability and performance of the Internet depends critically on the performance of its packet switches. Current packet switches are based on single-hop crossbar fabrics, with line cards that use virtual output-queueing to reduce head-of-line blocking. In this paper we propose to use a multi-hop network on a chip (NOC) as the crossbar fabric, with FIFO-queued line cards. The use of a multi-hop crossbar fabric has several advantages. 1) Speed-up, i.e. the crossbar fabric can operate faster because NOC inter-router wires are shorter than those in a single-hop crossbar, and because arbitration is distributed instead of centralised. 2) Load balancing because paths from different input-output port pairs share the same router buffers, unlike the internal buffers of buffered crossbar fabric that are dedicated to a single input- output pair. 3) Path diversity allows traffic from an input port to follow different paths to its destination output port. This results in further load balancing, especially for non-uniform traffic patterns. 4) Simpler line-card design: the use of FIFOs on the line cards simplifies both the line cards and the (inter- chip) flow control between the crossbar fabric and line cards, reducing the number of (expensive) chip pins required for flow control. 5) Scalability, in the sense that the crossbar speed is independent of the number of ports, which is not the case for single-hop crossbar fabrics. We analyzed the performance of our architecture both analytically and by simulation, and show that it performs well for a wide range of traffic conditions and switch sizes. Additionally we prototyped a 32×32 NOC-based crossbar fabric in a 65nm CMOS technology. The unoptimised implementation operates at 413 MHz, achieving an aggregate throughput in excess of 10 10 ATM cells per second.

28 citations


Proceedings ArticleDOI
27 Aug 2009
TL;DR: Sl SlackOS is proposed, a dynamic, dependency-aware, task scheduling that conserva- tively scales the voltage and frequency of each processor, to respect RT deadlines and delivers 22% to 33% energy reduction, compared to dynamic RT scheduling that is not energy aware.
Abstract: Voltage-frequency scaling (VFS) trades a linear processor slowdown for a potentially quadratic reduction in energy consumption. Complex dependencies may exist between different tasks of an application. The impact of VFS on the end- to-end application performance is difficult to predict, especially when these tasks are mapped on multiple processors that are scaled independently. This is a problem for real-time (RT) applications that require guaranteed end-to-end performance. In this paper we first classify the slack existing in RT applications consisting of multiple dependent tasks mapped on multiple processors independently using VFS, resulting in static, work, and share slack. Then we concentrate on work and share slack as they can only be detected at run time, thus their conservative use is challenging. We propose SlackOS, a dynamic, dependency-aware, task scheduling that conserva- tively scales the voltage and frequency of each processor, to respect RT deadlines. When applied to a H.264 application, our method delivers 22% to 33% energy reduction, compared to dynamic RT scheduling that is not energy aware.

26 citations


Proceedings ArticleDOI
24 Aug 2009
TL;DR: It is concluded that having a fine allocation granularity that is decoupled from latency is essential to manage highly loaded resources in real-time systems.
Abstract: Resources in contemporary systems-on-chip (SoC) are shared between applications to reduce cost. Access to shared resources is provided by arbiters that require a small hardware implementation and must run at high speed. To manage heavily loaded resources, such as memory channels, it is also important that the arbiter minimizes over allocation. A Credit-Controlled Static-Priority (CCSP) arbiter comprised of a rate regulator and a static-priority scheduler has been proposed for scheduling access to SoC resources. The proposed rate regulator, however, is not straight-forward to implement in hardware, and assumes that service is allocated with infinite precision. In this paper, we introduce a fast and small hardware implementation of the CCSP rate regulator and formally prove its correctness. We also show an efficient way of representing the allocated service in hardware with finite precision. Based on this representation, we define and evaluate two allocation strategies, and derive tight bounds on their respective over allocations. We show that increasing the precision of the implementation results in an exponential reduction in maximum over allocation at the cost of a linear increase in area. We demonstrate that the allocation strategy has a large impact on the allocation success rate for use cases with high load. Finally, we compare CCSP to traditional frame-based approaches and conclude that having a fine allocation granularity that is decoupled from latency is essential to manage highly loaded resources in real-time systems.

21 citations


Patent
12 May 2009
TL;DR: In this article, a power manager and a method for managing the power supplied to an electronic device is provided, where the power is controlled by a hardware monitor and a power controller.
Abstract: A power manager (106) and method for managing the power supplied to an electronic device is provided. Furthermore, a system wherein the power supplied to an electronic device is managed is provided. The power manager (106) is operative to monitor a hardware monitor (104) during a monitoring time period. The hardware monitor (104) is coupled to an electronic device (102). The electronic device (102) has a workload during operational use. The hardware monitor is operative to indicate the workload of the electronic device (102). The power manager is operative to control power supplied to the electronic device (102) in dependency on the monitoring.

16 citations


Proceedings ArticleDOI
30 Nov 2009
TL;DR: Simulations results showed that the proposal outperforms the CICQ architecture and offers a viable architectural alternative and the effect of various parameters such as the depth of the NoC as well as the speedup requirement for high-bandwidth multicast switching.
Abstract: The Internet growth coupled with the variety of its services is creating an increasing need for multicast traffic support by backbone routers and packet switches. Recently, buffered crossbar (CICQ) switches have shown high potential in efficiently handling multicast traffic. However, they were unable to deliver optimal performance despite their expensive and complex crossbar fabric. This paper proposes an enhanced CICQ switching architecture suitable for multicast traffic. Instead of a dedicated internal crosspoint buffer for every input-output pair of ports, the crossbar is designed as a multi-hop Network on Chip (NoC). Designing the crossbar as a NoC offers several advantages such as low latency, internal fabric load balancing and path diversity. It also obviates the requirement of the virtual output queuing by allowing simple FIFO structure withouts performance degradation. We designed appropriate routing for the NoC as well as on-chip router scheduling and tested its performance under a wide range of input multicast traffic. Simulations results showed that our proposal outperforms the CICQ architecture and offers a viable architectural alternative. We also studied the effect of various parameters such as the depth of the NoC as well as the speedup requirement for high-bandwidth multicast switching.

Proceedings ArticleDOI
23 May 2009
TL;DR: It is proposed that FPGAs use a hardwired network on chip (HWNOC) as a unified interconnect for functional communications (data and control) as well as configuration (bitstreams for soft IP) in this paper.
Abstract: We propose that FPGAs use a hardwired network on chip (HWNOC) as a unified interconnect for functional communications (data and control) as well as configuration (bitstreams for soft IP). In this paper we model such a platform. Using the HWNOC applications mapped on hard or soft IPs are set up and removed using memory-mapped communications. Peer-to-peer streaming data is used to communicate data between IPs, and also to transport configuration bitstreams. The composable nature of the HWNOC ensures that applications can be dynamically configured, programmed, and can operate, without affecting other running (real-time) applications. We describe this platform and the steps required for dynamic reconfiguration of IPs. We then model the hardware, i.e. HWNOC and hard and soft IPs, in cycle-accurate transaction-level SystemC. Next, we model its dynamic behavior, including bitstream loading, HWNOC programming, dynamic (re)configuration, clocking, reset, and computation.

Proceedings ArticleDOI
01 Dec 2009
TL;DR: A 3-tier reconfiguration model that uses the HWNoC as the underlying platform to realize dynamic loading, starting, and stopping of applications to ensure that applications are guaranteed their required resources.
Abstract: We envision that future Field-Programmable Gate Arrays (FPGAs) will use a Hardwired Network on Chip (HWNoC) as a unified interconnect for functional communications (data and control) as well as configuration (bitstream for soft IPs). In this paper we present a 3-tier reconfiguration model that uses the HWNoC as the underlying platform to realize dynamic loading, starting, and stopping of applications. The model ensures that applications are guaranteed their required resources (LUTs, communication, memory). Resource allocation is performed globally at design time. Applications are started and stopped dynamically at run time, yet are composable, i.e. do not affect each other when they do so. Our model comprises three layers: system manager, application manager, and application. The system manager instantiates (configures) and enforces the resource allocation (LUTs, NoC connections, memories) at run time. Each application is independent, and is accompanied by an application manager that programs (starts and stops) the application, within its allocated resources (a virtual platform). We model our system in cycle-accurate transaction-level SystemC which includes bitstream loading, HWNoC and IP programming, clocking, reset, computation.

27 Nov 2009
TL;DR: The scope of the research is power management including these techniques on an MPSoC executing streaming applications, such as audio/video codecs, telecom services (protocols), or any other firm and soft real time applications, including adaptive body biasing and dynamic voltage and frequency scaling.
Abstract: Power is an important design constraint for all nomadic and tethered devices as mobile phones or media-boxes today. This is mainly because it limits their operational time or because of the required operational thermal conditions. In order to keep the pace with increasing number of use-cases while increasing the lifetime, power reduction is enforced to all parts of a device, thus also for the embedded chipset. For this and other reasons like cost and size, the whole chipset has been integrated into a multiprocessor system-on-chip (MPSOC). As a complex and often heterogeneous system that executes different mixtures of applications with the variable workload, not all of its parts are utilized all the time. This introduces spare time in the system, denoted as slack that is possible to exploit for lower power and energy consumption by power management (PM). The most common techniques are adaptive body biasing and dynamic voltage and frequency scaling of a part of a system or the system as a whole. The scope of our research is power management including these techniques on an MPSoC executing streaming applications, such as audio/video codecs, telecom services (protocols), or any other firm and soft real time applications. A lot of previous research has been done on this topic, mostly focusing on the isolated parts of the system. However, focus has recently been moved to the system-wise approach. This paper is an overview of the commercial and solutions from academia, published until now. Special attention is given to the state-of-the-art infrastructure for PM and its dynamic possibilities to react and save power. We favourite the conservative approaches that do not disturb regular execution and do not introduce any additional delay or deadline misses comparing to the execution without power management. An overview of advanced PM is presented. Additionally, we elaborate the trade-off between race-to-idle and performance-on demand approaches reflecting the difference in static and dynamic power consumption.

Proceedings ArticleDOI
01 Oct 2009
TL;DR: In this paper, the authors focus on a multi-path slot allocation method in networks with static resource reservations, in particular networks on chip (NoC) employing time-division multiplexing (TDM).
Abstract: The exponential increase in transistor count due to technological progress has resulted in an increase in complexity and processing power of on-chip elements. Recently a stage has been reached where it is not practical anymore to increase the core size, and as a consequence the number of cores, processing elements or peripherals is being increased instead. In this study we focus on improving the efficiency of a network between those processing elements using alternative routing strategies. We focus on a multi-path slot allocation method in networks with static resource reservations, in particular networks on chip (NoC) employing time-division multiplexing (TDM). The simplicity of these networks makes it possible to implement this routing scheme without significant hardware overhead. Our proposed method, although displaying large variations between test cases, provides significant overall gains in terms of allocated bandwidth, with an average gain across all tests of 29% against an exhaustive search of single-path routes, and gains of 47% when compared to other single-path routing algorithms.

Proceedings ArticleDOI
09 Dec 2009
TL;DR: This paper model the application dynamic swapping behavior in cycle-accurate transaction-level SystemC which includes bitstream loading, HWNoC programming, clocking, reset, computation, and describes the approach and steps required to achieve the above objectives.
Abstract: We envision that future FPGA will use a hardwired network on chip (HWNoC)~\cite{Goossens08NoCS} as a unified interconnect for functional communications (data and control) as well as configuration (bitstream for soft IPs). In this paper we present a reconfiguration methodology which makes use of such a platform to realize composable inter-application communication and persistent-state intra-application when run-time partial reconfiguration is performed. The proposed methodology also ensures that the required performance constraints of the dynamically swapped in application are fulfilled. We describe the approach and steps required to achieve the above objectives. We model the application dynamic swapping behavior in cycle-accurate transaction-level SystemC which includes bitstream loading, HWNoC programming, clocking, reset, computation.


Proceedings ArticleDOI
05 Oct 2009
TL;DR: In this paper, a specific workload decomposition method for work required for (streaming) application processing data tokens (e.g. video frames) with work behavior patterns as a mix of periodic and aperiodic patterns is presented.
Abstract: In this paper an analytical study on dynamism and possibilities on slack exploitation by dynamic power management is presented. We introduce a specific workload decomposition method for work required for (streaming) application processing data tokens (e.g. video frames) with work behaviour patterns as a mix of periodic and aperiodic patterns. It offers efficient and computationally light method for speculation on considerable work variations and its exploitation in energy saving techniques. It is used by a dynamic power management policy which has low overhead and reduces both requirements for buffering space, and deadline misses (increase QoS). We evaluate our policy in experiments on MPEG4 decoding of several different input sequences and present results.