Showing papers on "Multi-core processor published in 2003"

PDF

Open Access

Proceedings Article•DOI•

Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction

[...]

Rakesh Kumar¹, Keith Farkas², Norman P. Jouppi², Parthasarathy Ranganathan², Dean M. Tullsen¹ - Show less +1 more•Institutions (2)

University of California, San Diego¹, Hewlett-Packard²

03 Dec 2003

TL;DR: This paper proposes and evaluates single-ISA heterogeneousmulti-core architectures as a mechanism to reduceprocessor power dissipation and results indicate a 39% average energy reduction while only sacrificing 3% in performance.

...read moreread less

Abstract: This paper proposes and evaluates single-ISA heterogeneous multi-core architectures as a mechanism to reduce processor power dissipation. Our design incorporates heterogeneous cores representing different points in the power/performance design space; during an application's execution, system software dynamically chooses the most appropriate core to meet specific performance and power requirements. Our evaluation of this architecture shows significant energy benefits. For an objective function that optimizes for energy efficiency with a tight performance threshold, for 14 SPEC benchmarks, our results indicate a 39% average energy reduction while only sacrificing 3% in performance. An objective function that optimizes for energy-delay with looser performance bounds achieves, on average, nearly a factor of three improvements in energy-delay product while sacrificing only 22% in performance. Energy savings are substantially more than chip-wide voltage/frequency scaling.

...read moreread less

809 citations

Proceedings Article•DOI•

Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

[...]

Karthikeyan Sankaralingam¹, Ramadass Nagarajan¹, Haiming Liu¹, Changkyu Kim¹, Jaehyuk Huh¹, Doug Burger¹, Stephen W. Keckler¹, Charles R. Moore¹ - Show less +4 more•Institutions (1)

University of Texas at Austin¹

01 May 2003

TL;DR: Results show that high performance can be obtained in each of the three modes--ILP, TLP, and DLP-demonstrating the viability of the polymorphous coarse-grained approach for future microprocessors.

...read moreread less

Abstract: This paper describes the polymorphous TRIPS architecture which can be configured for different granularities and types of parallelism. TRIPS contains mechanisms that enable the processing cores and the on-chip memory system to be configured and combined in different modes for instruction, data, or thread-level parallelism. To adapt to small and large-grain concurrency, the TRIPS architecture contains four out-of-order, 16-wide-issue Grid Processor cores, which can be partitioned when easily extractable fine-grained parallelism exists. This approach to polymorphism provides better performance across a wide range of application types than an approach in which many small processors are aggregated to run workloads with irregular parallelism. Our results show that high performance can be obtained in each of the three modes--ILP, TLP, and DLP-demonstrating the viability of the polymorphous coarse-grained approach for future microprocessors.

...read moreread less

512 citations

Patent•

Dynamically selecting processor cores for overall power efficiency

[...]

Keith Farkas¹, Norman P. Jouppi¹, Robert N. Mayo¹, Parthasarathy Ranganathan¹•Institutions (1)

Mayo Clinic¹

25 Apr 2003

TL;DR: In this paper, a workload transfer mechanism transfers the executing application software to a second computer hardware processor core in a search for reduced operating power, and a transfer delay mechanism is connected to delay a subsequent transfer of the executed application software if the system operating power may be conserved by such delay.

...read moreread less

Abstract: A computer system for conserving operating power includes a number of computer hardware processor cores that differ amongst themselves in at least in their respective operating power requirements and processing capabilities. A monitor gathers performance metric information from each of the computer hardware processor cores that is specific to a particular run of application software then executing. A workload transfer mechanism transfers the executing application software to a second computer hardware processor core in a search for reduced operating power. A transfer delay mechanism is connected to delay a subsequent transfer of the executing application software if the system operating power may be conserved by such delay.

...read moreread less

317 citations

Journal Article•DOI•

A formal approach to MpSoC performance verification

[...]

Kai Richter, Marek Jersak, Rolf Ernst

01 Apr 2003-IEEE Computer

TL;DR: The article presents a technology that uses event model interfaces and a novel event flow mechanism that extends formal analysis approaches from real-time system design into the multiprocessor system on chip domain.

...read moreread less

Abstract: Multiprocessor system on chip designs use complex on-chip networks to integrate different programmable processor cores, specialized memories, and other components on a single chip. MpSoC have been become the architecture of choice in many industries. Their heterogeneity inevitably increases with intellectual-property integration and component specialization. System integration is becoming a major challenge in their design. Simulation is state of the art in MpSoC performance verification, but it has conceptual disadvantages that become disabling as complexity increases. Formal approaches offer a systematic alternative. The article presents a technology that uses event model interfaces and a novel event flow mechanism that extends formal analysis approaches from real-time system design into the multiprocessor system on chip domain.

...read moreread less

178 citations

Book Chapter•DOI•

An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms

[...]

Markus Kowarschik¹, Christian Weiß²•Institutions (2)

University of Erlangen-Nuremberg¹, Technische Universität München²

01 Jan 2003-Lecture Notes in Computer Science

TL;DR: In this article, the authors focus on optimization techniques for enhancing cache performance by hiding both the low main memory bandwidth and the latency of main memory accesses which is slow in contrast to the floating-point performance of the CPUs.

...read moreread less

Abstract: In order to mitigate the impact of the growing gap between CPU speed and main memory performance, today’s computer architectures implement hierarchical memory structures. The idea behind this approach is to hide both the low main memory bandwidth and the latency of main memory accesses which is slow in contrast to the floating-point performance of the CPUs. Usually, there is a small and expensive high speed memory sitting on top of the hierarchy which is usually integrated within the processor chip to provide data with low latency and high bandwidth; i.e., the CPU registers. Moving further away from the CPU, the layers of memory successively become larger and slower. The memory components which are located between the processor core and main memory are called cache memories or caches. They are intended to contain copies of main memory blocks to speed up accesses to frequently needed data [378], [392]. The next lower level of the memory hierarchy is the main memory which is large but also comparatively slow. While external memory such as hard disk drives or remote memory components in a distributed computing environment represent the lower end of any common hierarchical memory design, this paper focuses on optimization techniques for enhancing cache performance.

...read moreread less

157 citations

Book Chapter•DOI•

Chapter 4 – Design Space Exploration of Network Processor Architectures

[...]

Lothar Thiele¹, Samarjit Chakraborty¹, Matthias Gries¹, Simon Künzli¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

01 Jan 2003

TL;DR: The chapter presents a framework for the design space exploration of embedded systems and focuses on high level of abstraction, where the goal is to quickly identify interesting architectures that can be further evaluated by taking lower-level details into account.

...read moreread less

Abstract: It is noted that network processors (NPs) generally consist of multiple processing units such as CPU cores, microengines, and dedicated hardware for computing-intensive tasks, memory units, caches, interconnections, and I/O interfaces. Following a system-on-a-chip (SoC) design method, these resources are then put on a single chip and they must interoperate in order to perform packet processing tasks at line speed. The process of determining the optimal hardware and software architecture for such processors includes issues involving resource allocation and partitioning. The chapter presents a framework for the design space exploration of embedded systems. It is observed that the architecture exploration and evaluation of network processors involve many tradeoffs and a complex interplay between hardware and software. The chapter focuses on high level of abstraction, where the goal is to quickly identify interesting architectures that can be further evaluated by taking lower-level details into account. Task models, task scheduling, operating system issues, and packet processor architectures collectively play a role in different phases of the design space exploration of packet processor devices.

...read moreread less

144 citations

Dynamic Power Management for Embedded Systems

[...]

Bishop Brock¹, Karthick Rajamani¹•Institutions (1)

IBM¹

01 Jan 2003

TL;DR: This paper discusses several of the SOC design issues pertaining to dynamic voltage and frequency scalable systems, and how these issues were resolved in the IBM PowerPC 405LP processor, and introduces DPM, a novel architecture for policy-guided dynamic power management.

...read moreread less

Abstract: This paper discusses several of the SOC design issues pertaining to dynamic voltage and frequency scalable systems, and how these issues were resolved in the IBM PowerPC 405LP processor. We also introduce DPM, a novel architecture for policy-guided dynamic power management. We illustrate the utility of DPM by its ability to implement several classes of power management strategies and demonstrate practical results for a 405LP embedded system. I. INTRODUCTION Advances in low-power components and system design have brought general purpose computation into watches, wireless telephones, PDAs and tablet computers. Power management of these systems has traditionally focused on sleep modes and device power management (1). Embedded processors for these applications are highly integrated system-on-a-chip (SOC) de- vices that also support aggressive power management through techniques such as programmable clock gating and dynamic voltage and frequency scaling (DVFS). This paper describes one of these processors, and the development of a software architecture for policy-guided dynamic power management. II. 405LP DESIGN AND POWER MANAGEMENT FEATURES The IBM PowerPC 405LP is a dynamic voltage and frequency scalable embedded processor targeted at high- performance battery-operated devices. The 405LP is an SOC ASIC design in a 0.18 m bulk CMOS process, integrating a PowerPC 405 CPU core modified for operation over a 1.0 V to 1.8 V range with off-the-shelf IP cores. The chip includes a flexible clock generation subsystem, new hardware accelerators for speech recognition and security, as well as a novel standby power management controller (2). In a system we normally operate the CPU/SDRAM at 266/133 MHz above 1.65 V and at 66/33 MHz above 0.9 V, typically providing a 13:1 SOC core power range over the 4:1 performance range. From a system design and active power management perspec- tive the most interesting facets of the 405LP SOC design concern the way the clocks are generated and controlled. These features of the processor are described in the remainder of this Section.

...read moreread less

129 citations

Journal Article•DOI•

Configurable multilayer CNN-UM emulator on FPGA

[...]

Zoltán Nagy, Péter Szolgay

09 Jul 2003-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: Using reconfigurable devices to implement emulated digital architectures provides more flexibility compared to the custom very large-scale integration designs because different Falcon architectures can be used on the same FPGA device.

...read moreread less

Abstract: A new emulated digital multilayer cellular neural network (universal machine (CNN-UM) chip architecture called Falcon has been developed. In this brief, the main steps of the field-programmable gate array (FPGA) implementation are introduced. The main results are as follows. The CNN-UM architecture emulated on Xilinx Virtex series FPGA, three-dimensional nonlinear spatio-temporal dynamics can be implemented on this architecture. The critical parameters of the implementation in a single-layer configuration are 55 million cell update/s/processor core, or, equivalently 1 giga-operation per second (GOPS) computing performance. In the face of the high performance, the power requirements of the architecture are relatively low only /spl sim/3 W per processor core. Using reconfigurable devices to implement emulated digital architectures provides more flexibility compared to the custom very large-scale integration designs because different Falcon architectures can be used on the same FPGA device.

...read moreread less

123 citations

Journal Article•DOI•

Processor Power Reduction Via Single-ISA Heterogeneous Multi-Core Architectures

[...]

Rakesh Kumar¹, K. Farkas¹, Norman P. Jouppi², Partha Ranganathan², Dean M. Tullsen² - Show less +1 more•Institutions (2)

University of California, San Diego¹, Hewlett-Packard²

01 Jan 2003-IEEE Computer Architecture Letters

TL;DR: A single-ISA heterogeneousmulti-core architecture as a mechanism to reduce processor power dissipation and demonstrates a five-fold reduction in energy at a cost of only 25% performance.

...read moreread less

Abstract: This paper proposes a single-ISA heterogeneousmulti-core architecture as a mechanism to reduce processorpower dissipation. It assumes a single chip containing a diverseset of cores that target different performance levels and consumedifferent levels of power. During an applications execution,system software dynamically chooses the most appropriate core tomeet specific performance and power requirements. It describesan example architecture with five cores of varying performanceand complexity. Initial results demonstrate a five-fold reductionin energy at a cost of only 25% performance.

...read moreread less

111 citations

Patent•

Heterogeneous processor core systems for improved throughput

[...]

Keith Farkas, Norman P. Jouppi, Parthasarathy Ranganathan

16 Jul 2003

TL;DR: In this paper, a workload assignment mechanism assigns jobs to processor cores in order to maximize overall system throughput and the throughput of individual jobs, based on performance metric information from each of the computer hardware processor cores that are specific to a particular run of application software.

...read moreread less

Abstract: A computer system for maximizing system and individual job throughput includes a number of computer hardware processor cores that differ amongst themselves in at least in their respective resource requirements and processing capabilities. A monitor gathers performance metric information from each of the computer hardware processor cores that are specific to a particular run of application software then executing. Based on these metrics, a workload assignment mechanism assigns jobs to processor cores in order to maximize overall system throughput and the throughput of individual jobs.

...read moreread less

99 citations

Patent•

Method, apparatus and system for optimizing context switching between virtual machines

[...]

Vijay Tewari, Robert Knauerhase, Milan Milenkovic

16 Dec 2003

TL;DR: In this article, the authors present a method, apparatus and system may optimize context switching between virtual machines (VMs) according to an embodiment of the present invention, a first processor core may execute a first VM while a second processor core can concurrently retrieve information pertaining to the state of a second VM into a processor cache.

...read moreread less

Abstract: A method, apparatus and system may optimize context switching between virtual machines (“VMs”). According to an embodiment of the present invention, a first processor core may execute a first VM while a second processor core may concurrently retrieve information pertaining to the state of a second VM into a processor cache. When the virtual machine manager (“VMM”) performs a context switch between the first and the second VMs, the second processor may immediately begin executing the second VM, while the first processor may save the state information for the first VM. In yet another embodiment, different threads on a processor may be utilized to execute different VMs on a host.

...read moreread less

Journal Article•DOI•

The performance and energy consumption of embedded real-time operating systems

[...]

K. Baynes¹, C. Collins², Eric Fiterman, Brinda Ganesh³, Paul Kohout, C. Smit, T. Zhang, Bruce Jacob - Show less +4 more•Institutions (3)

Verizon Communications¹, Intel², University of Maryland, College Park³

01 Nov 2003-IEEE Transactions on Computers

TL;DR: SimBed, an execution-driven simulation testbed that measures the execution behavior and power consumption of embedded applications and RTOSs by executing them on an accurate architectural model of a microcontroller with simulated real-time stimuli, shows that there appears no clear winner in timing accuracy between preemptive systems and cooperative systems.

...read moreread less

Abstract: We present the modelling of embedded systems with SimBed, an execution-driven simulation testbed that measures the execution behavior and power consumption of embedded applications and RTOSs by executing them on an accurate architectural model of a microcontroller with simulated real-time stimuli. We briefly describe the simulation environment and present a study that compares three RTOSs: /spl mu/C/OS-II, a popular public-domain embedded real-time operating system; Echidna, a sophisticated, industrial-strength (commercial) RTOS; and NOS, a bare-bones multirate task scheduler reminiscent of typical "roll-your-own" RTOSs found in many commercial embedded systems. The microcontroller simulated in this study is the Motorola M-CORE processor: a low-power, 32-bit CPU core with 16-bit instructions, running at 20MHz. Our simulations show what happens when RTOSs are pushed beyond their limits and they depict situations in which unexpected interrupts or unaccounted-for task invocations disrupt timing, even when the CPU is lightly loaded. In general, there appears no clear winner in timing accuracy between preemptive systems and cooperative systems. The power-consumption measurements show that RTOS overhead is a factor of two to four higher than it needs to be, compared to the energy consumption of the minimal scheduler. In addition, poorly designed idle loops can cause the system to double its energy consumption-energy that could be saved by a simple hardware sleep mechanism.

...read moreread less

Patent•

Performance control within a multi-processor system

[...]

Krisztian Flautner

04 Aug 2003

TL;DR: In this paper, the authors measure the degree of parallelism achieved in executing program instructions and use this to dynamically control the clock speeds and supply voltage levels applied to different processor cores 4, 6 so as to reduce the overall amount of energy consumed by matching the processing performance achieved to the clock speed and voltage levels used.

...read moreread less

Abstract: A multi-processing system 2 measures the degree of parallelism achieved in executing program instructions and uses this to dynamically control the clock speeds and supply voltage levels applied to different processor cores 4, 6 so as to reduce the overall amount of energy consumed by matching the processing performance achieved to the clock speeds and voltage levels used.

...read moreread less

Journal Article•DOI•

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O

[...]

Michele Borgatti¹, Francesco Lertora¹, B. Foret¹, Lorenzo Cali¹•Institutions (1)

STMicroelectronics¹

10 Mar 2003-IEEE Journal of Solid-state Circuits

TL;DR: A system chip targeting image and voice processing and recognition application domains is implemented as a representative of the potential of using programmable logic in system design.

...read moreread less

Abstract: A system chip targeting image and voice processing and recognition application domains is implemented as a representative of the potential of using programmable logic in system design. It features an embedded reconfigurable processor built by joining a configurable and extensible processor core and an SRAM-based embedded field-programmable gate array (FPGA). Application-specific bus-mapped coprocessors and flexible input/output peripherals and interfaces can also be added and dynamically modified by reconfiguring the embedded FPGA. The architecture of the system is discussed as well as the design flows for pre- and post-silicon design and customization. The silicon area required by the system is 20 mm/sup 2/ in a 0.18-/spl mu/m CMOS technology. The embedded FPGA accounts for about 40% of the system area.

...read moreread less

Book•

Parallel computing on heterogeneous networks

[...]

Alexey Lastovetsky¹•Institutions (1)

University College Dublin¹

01 Jan 2003

TL;DR: This text begins with a general introduction to parallel computing, then progresses to the specifics of parallel computing with heterogeneous networks, proving a superior reference for researchers and graduate students in computer science.

...read moreread less

Abstract: From the Publisher: "Traditional software for parallel computing typically spreads computations evenly over a set of linked processors. This, however, may not always be the best way of maximizing the performance of a given network or cluster of computers. By taking account of the actual performance of individual processors and the links between them, parallel computing on heterogeneous networks offers significant improvements in parallel computation. Alexey Lastovetsky's Parallel Computing on Heterogeneous Networks provides a resource on his innovative technology." "This text begins with a general introduction to parallel computing, then progresses to the specifics of parallel computing with heterogeneous networks. Practically oriented, the book includes illustrative algorithms in the mpC programming language, a unique high-level software tool designed by the author specifically for programming heterogeneous parallel algorithms. All concepts and algorithms are illustrated with working programs that can be compiled or executed on any cluster." All of the contents are also illustrated by carefully tested source code, allowing readers to play with the presented software tools and algorithms - particularly with the mpC programming language - while reading the book. Appendices provide both the complete source code and user's guide for the principal applications used to illustrate the book's material. Parallel Computing on Heterogeneous Networks proves a superior reference for researchers and graduate students in computer science.

...read moreread less

Proceedings Article•DOI•

SNAP: a Sensor-Network Asynchronous Processor

[...]

Clinton W. Kelly¹, Virantha Ekanayake¹, Rajit Manohar¹•Institutions (1)

Cornell University¹

12 May 2003

TL;DR: A Sensor-Network Asynchronous Processor (SNAP) is presented, which is designed to be both a processor core for a sensor- network node and a component of a chip multiprocessor, the Network on a Chip (NoC), which will execute a novel sensor-network simulator.

...read moreread less

Abstract: We present a Sensor-Network Asynchronous Processor (SNAP), which we have designed to be both a processor core for a sensor-network node and a component of a chip multiprocessor, the Network on a Chip (NoC), which will execute a novel sensor-network simulator. We discuss the advantages of using the same processor for nodes in physical and simulated sensor networks. We describe the attributes that a processor must possess to function well in both roles, and we then describe the way we designed SNAP to have these attributes.

...read moreread less

Patent•

Bus arbitration in low power system

[...]

John Arends¹, William C. Moyer, Steven L. Schwartz²•Institutions (2)

Freescale Semiconductor¹, Apple Inc.²

28 Feb 2003

TL;DR: In this article, the authors consider a data processing system that includes a processor core and system circuitry coupled to the processor core, and propose a method for conserving power by granting bus access to a requesting device.

...read moreread less

Abstract: Power is conserved in a data processing system that includes a processor core and system circuitry coupled to the processor core. A first method for conserving power includes entering a low power state by the processor and the system circuitry and enabling bus arbitration by the processor while the processor core remains in the low power state. One embodiment further contemplates a method of conserving power by granting bus access to a requesting device and entering a power conservation mode by the processor core in response thereto. Bus operations are then performed while the processor core remains in the power conservation mode. Another embodiment contemplates a method of debugging a data processing system in which a debug state is entered by the processor and the system circuitry and, thereafter, bus arbitration is enabled by the processor while the processor core remains in the debug state.

...read moreread less

Patent•

Hierarchical test methodology for multi-core chips

[...]

Rajesh Y. Pendurkar

02 Jul 2003

TL;DR: In this paper, the authors propose a hierarchical testing architecture compliant with IEEE 1149.1 Joint Test Action Group (JTAG) standard that leverages existing standard testing architectures within each processor core to allow for chip level access to schedule built-in self test (BIST) operations for the cores.

...read moreread less

Abstract: A multi-core chip (MCC) having a plurality of processor cores includes a hierarchical testing architecture compliant with the IEEE 1149.1 Joint Test Action Group (JTAG) standard that leverages existing standard testing architectures within each processor core to allow for chip level access to schedule built-in self test (BIST) operations for the cores. The MCC includes boundary scan logic, a chip-level JTAG-compliant test access port (TAP) controller, a chip-level master BIST controller, and a test pin interface. Each processor core includes a JTAG-compliant TAP controller and one or more BIST enabled memory arrays. The chip TAP controller includes one or more user defined registers, including a core select register and a test mode register. The core select register stores a plurality of core select bits that select corresponding processor cores for BIST operations.

...read moreread less

Patent•

Reconfigurable digital processing system for space

[...]

Joseph R. Marshall¹, Dennis Alan F, Charles A. Dennis, Steven G. Santee¹•Institutions (1)

BAE Systems¹

09 Jan 2003

TL;DR: A reconfigurable digital processing system for space includes the utilization of field programmable gate arrays utilizing a hardware centric approach to reconfigure software processors in a space vehicle through the reprogramming of multiple FPGAs such that one obtains a power/performance characteristic for signal processing tasks that cannot be achieved simply through the use of off-the-shelf processors as mentioned in this paper.

...read moreread less

Abstract: A reconfigurable digital processing system for space includes the utilization of field programmable gate arrays utilizing a hardware centric approach to reconfigure software processors in a space vehicle through the reprogramming of multiple FPGAs such that one obtains a power/performance characteristic for signal processing tasks that cannot be achieved simply through the use of off-the-shelf processors In one embodiment, for damaged or otherwise inoperable signal processors located on a spacecraft, the remaining processors which are undamaged can be reconfigured through changing the machine language and binary to the field programmable gate arrays to change the core processor while at the same time maintaining undamaged components so that the signal processing functions can be restored utilizing a RAM-based FPGA as a signal processor In one embodiment, multiple FPGAs are connected together by a data bus and are also provided with data pipes which interconnect selected FPGAs together to provide the necessary processing function Flexibility in reconfiguration includes the utilizing of a timing and synchronization block as well as a common configuration block which when coupled to an interconnect block permits reconfiguration of a customizable application core, depending on the particular signal processing function desired The result is that damaged or inoperable signal processing components can be repaired in space without having to physically attend to the hardware by transmitting to the spacecraft commands which reconfigure the particular FPGAs thus to alter their signal processing function Also mission changes can be accomplished by reprogramming the FPGAs

...read moreread less

Journal Article•DOI•

A 600 MHz single-chip multiprocessor with 4.8 GB/s internal shared pipelined bus and 512 kB internal memory

[...]

S. Kaneko¹, Katsunori Sawai¹, N. Masui¹, Kouichi Ishimi¹, T. Itou¹, Mitsugu Satou¹, Hiroyuki Kondo¹, Naoto Okumura¹, Yukari Takata¹, Hidehiro Takata¹, M. Sakugawa¹, T. Higuchi¹, S. Ohtani¹, K. Sakamoto¹, N. Ishikawa¹, M. Nakajima¹, S. Iwata¹, K. Hayase¹, S. Nakano¹, S. Nakazawa¹, Osamu Tomisawa¹, T. Shimizu¹ - Show less +18 more•Institutions (1)

Mitsubishi¹

09 Feb 2003

TL;DR: This 600 MHz single-chip multiprocessor consists of two M32R 32 b CPU cores and 512 kB shared SRAM and is designed for embedded systems and operates at 600 MHz with 800 mW peak power dissipation.

...read moreread less

Abstract: A 600-MHz single-chip multiprocessor, which includes two M32R 32-bit CPU cores , a 512-kB shared SRAM and an internal shared pipelined bus, was fabricated using a 0.15-/spl mu/m CMOS process for embedded systems. This multiprocessor is based on symmetric multiprocessing (SMP), and supports modified-exclusive-shared-invalid (MESI) cache coherency protocol. The multiprocessor inherits the advantages of previously reported single-chip multiprocessors, while its multiprocessor architecture is optimized for use as an embedded processor. The internal shared pipelined bus has a low latency and large bandwidth (4.8 GB/s). These features enhance the performance of the multiprocessor. In addition, the multiprocessor employs various low-power techniques. The multiprocessor dissipates 800 mW in a 1.5-V 600-MHz multiprocessor mode. Standby power dissipation is less than 1.5 mW at 1.5 V. Hence, the multiprocessor achieves higher performance and lower power consumption. This paper presents a single-chip multiprocessor architecture optimized for use as an embedded processor and its various low-power techniques.

...read moreread less

Proceedings Article•DOI•

Interface synthesis using memory mapping for an FPGA platform

[...]

Manev Luthra¹, Sumit Gupta¹, Nikil Dutt¹, Rajesh Gupta, Alexandru Nicolau - Show less +1 more•Institutions (1)

University of California, Irvine¹

13 Oct 2003

TL;DR: An interface synthesis approach that forms part of the hardware-software codesign methodology for such an FPGA-based platform based on a novel memory mapping algorithm that maps data used by both the hardware and the software to shared memories on the reconfigurable fabric.

...read moreread less

Abstract: Several system-on-chip (SoC) platforms have recently emerged that use reconfigurable logic (FPGAs) as a programmable coprocessor to reduce the computational load on the main processor core. We present an interface synthesis approach that forms part of our hardware-software codesign methodology for such an FPGA-based platform. The approach is based on a novel memory mapping algorithm that maps data used by both the hardware and the software to shared memories on the reconfigurable fabric. The memory mapping algorithm couples with a high-level synthesis tool and uses scheduling information to map variables, arrays and complex data structures to the shared memories in a way that minimizes the number of registers and multiplexers used in the hardware interface. We also present three software schemes that enable the application software to communicate with this hardware interface. We demonstrate the utility of our approach and study the trade-offs involved using a case study of the codesign of a computationally expensive portion of the MPEG-1 multimedia application on to the Altera Nios platform.

...read moreread less

A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors

[...]

Rakesh Kumar, Keith Farkas, Norman P. Jouppi, Partha Ranganathan, Dean M. Tullsen - Show less +1 more

01 Jan 2003

TL;DR: Single-ISA heterogeneous multicore architectures as a mechanism to reduce processor power dissipation are proposed and initial results show a more than three-fold reduction in energy at a cost of only 18% performance.

...read moreread less

Abstract: This paper proposes single-ISA heterogeneous multicore architectures as a mechanism to reduce processor power dissipation. It assumes a single chip containing a diverse set of cores that target different performance levels and consume different levels of power. During an application’s execution, system software evaluates the resources required by an application for good performance and dynamically chooses the core that can best meet these requirements while minimizing energy consumption. It describes an example architecture with five cores of varying performance and complexity. Initial results show a more than three-fold reduction in energy at a cost of only 18% performance.

...read moreread less

Proceedings Article•DOI•

INSIDE: INstruction Selection/Identification & Design Exploration for Extensible Processors

[...]

Newton Cheung¹, Sri Parameswaran¹, Jorg Henkel²•Institutions (2)

University of New South Wales¹, Princeton University²

09 Nov 2003

TL;DR: The INSIDE system is presented, which combines a methodology to determine which code segments are most suited for implementation as a set of extensible instructions, a heuristic algorithm to select pre-configured extensible processors, and an estimation tool which rapidly estimates the performance of an application on a generated extensible processor.

...read moreread less

Abstract: This paper presents the INSIDE system that rapidly searches the design space for extensible processors, given area and performance constraints of an embedded application, while minimizing the design turn-around-time. Our system consists of a) a methodology to determine which code segments are most suited for implementation as a set of extensible instructions, b) a heuristic algorithm to select pre-configured extensible processors as well as extensible instructions (library), and c) an estimation tool which rapidly estimates the performance of an application on a generated extensible processor. By selecting the right combination of a processor core plus extensible instructions, we achieve a performance increase on average of 2.03x (up to 7x) compared to the base processor core at a minimum hardware overhead of 25% on average.

...read moreread less

Journal Article•DOI•

HiBRID-SoC: a multi-core SoC architecture for multimedia signal processing

[...]

H.-J. Stolberg, Mladen Berekovic, L. Friebe, S. Moch, M.B. Kulaczewski, A. Dehnhardt, Peter Pirsch - Show less +3 more

14 Oct 2003

TL;DR: The HiBRID-SoC multi-core system-on-chip architecture targets a wide range of multimedia applications with particularly high processing demands, including general signal processing applications, video de-/encoding, image processing, or a combination of these tasks.

...read moreread less

Abstract: The HiBRID-SoC multi-core system-on-chip architecture targets a wide range of multimedia applications with particularly high processing demands, including general signal processing applications, video encoding/decoding, image processing, or a combination of these tasks. For this purpose, the HiBRID-SoC integrates three fully programmable processor cores and various interfaces onto a single chip, all tied to a 64 bit AMBA AHB bus. The processor cores are individually optimized to the particular computational characteristics of different application fields, complementing each other to deliver high performance levels with high flexibility at reduced system cost. The HiBRID-SoC is fabricated in a 0.18 /spl mu/m 6LM standard-cell technology, occupies about 82 mm/sup 2/, and operates at 145 MHz. An MPEG-4 Advanced Simple Profile decoder in full TV resolution requires about 120 MHz for real-time performance on the HiBRID-SoC, utilizing only two of the three cores.

...read moreread less

Patent•

Fault-tolerant multi-core microprocessing

[...]

Andrew H. Barr¹, Ken Gary Pomaranski¹, Dale John Shidla¹•Institutions (1)

Hewlett-Packard¹

22 Oct 2003

TL;DR: In this article, a method of executing program code on a target microprocessor with multiple CPU cores thereon is described, where one of the CPU cores is selected for testing, and inter-core context switching is performed.

...read moreread less

Abstract: One embodiment disclosed relates to a method of executing program code on a target microprocessor with multiple CPU cores thereon. One of the CPU cores is selected for testing, and inter-core context switching is performed. Parallel execution occurs of diagnostic code on the selected CPU core and the program code on remaining CPU cores. Another embodiment disclosed relates to a microprocessor having a plurality of CPU cores integrated on the microprocessor chip. Inter-core communications circuitry is coupled to each of the CPU cores and configured to perform context switching between the CPU cores.

...read moreread less

Proceedings Article•DOI•

A C-based algorithm development flow for a reconfigurable processor architecture

[...]

Claudio Mucci¹, Carlo Chiesa¹, Andrea Lodi¹, Mario Toma¹, Fabio Campi¹ - Show less +1 more•Institutions (1)

University of Bologna¹

19 Nov 2003

TL;DR: A C-based algorithm development flow for XiRisc, a reconfigurable processor architecture targeted at embedded systems, that couples a VLIW risc core with a custom designed programmable hardware unit optimized for being programmed starting from data flow graph (DFG) descriptions is presented.

...read moreread less

Abstract: Reconfigurable processors are an appealing option to achieve high performance and low energy consumption in digital signal processing, but their utilization often involves hardware issues not usual for algorithm developers proficient in high level languages. This paper presents a C-based algorithm development flow for XiRisc, a reconfigurable processor architecture targeted at embedded systems, that couples a VLIW risc core with a custom designed programmable hardware unit optimized for being programmed starting from data flow graph (DFG) descriptions. Starting from C-language, the flow produces both executable codes for the processor core and configuration bits for the embedded programmable unit. The proposed flow was utilized for implementing a set of DSP algorithms on a prototypal 0.18 /spl mu/m XiRisc test-chip obtaining performance speed-ups up to 10x and energy consumption reduction up to 75%.

...read moreread less

Patent•

Reusable, built-in self-test methodology for computer systems

[...]

Jay Nejedlo¹•Institutions (1)

Intel¹

20 Mar 2003

TL;DR: In this article, a methodology for testing a computer system using multiple test units, each test unit being associated with its respective core function circuitry, is presented, where the core circuitry and its respective test unit are located in a primary integrated circuit component of the computer system such as a processor, memory, or chipset.

...read moreread less

Abstract: A methodology for testing a computer system using multiple test units, each test unit being associated with its respective core function circuitry. The core circuitry and its respective test unit are located in a primary integrated circuit component of the computer system, such as a processor, memory, or chipset. The on-chip test units communicate with one another and with other parts of the system, to determine whether a specification of the computer system is satisfied, without requiring a processor core of the computer system to execute an operating system program for the computer system.

...read moreread less

Proceedings Article•DOI•

Low-Cost Software-Based Self-Testing of RISC Processor Cores

[...]

N. Kranitis¹, G. Xenoulis², Dimitris Gizopoulos², Antonis Paschalis³, Yervant Zorian⁴ - Show less +1 more•Institutions (4)

Athens State University¹, University of Piraeus², National and Kapodistrian University of Athens³, Virage Logic⁴

03 Mar 2003

TL;DR: This paper presents a low-cost software-based self-testing methodology for processor cores with the aim of producing compact test code sequences developed with a limited engineering effort and achieving a high fault coverage for the processor core.

...read moreread less

Abstract: Software self-testing of embedded processor cores which effectively partitions the testing effort between low-speed external equipment and internal processor resources, has been recently proposed as an alternative to classical hardware built-in self-lest techniques over which it provides significant advantages. In this paper we present a low-cost software-based self-testing methodology for processor cores with the aim of producing compact test code sequences developed with a limited engineering effort and achieving a high fault coverage for the processor core. The objective of small test code sequences is directly related to the utilization of low-speed external testers since test time is primarily determined by the time required to download the lest code to the processor memory at the tester's low frequency. Successful application of the methodology to a RISC processor core architecture with a 3-stage pipeline is demonstrated.

...read moreread less

Patent•

Multiple processor core device having shareable functional units for self-repairing capability

[...]

David A. Luick¹•Institutions (1)

IBM¹

18 Sep 2003

TL;DR: In this article, multiple processor cores are implemented on a single integrated circuit chip, each having its own respective shareable functional units, which are preferably floating point units, and a failure of a shareable unit in one processor causes that processor to share the corresponding unit in another processor on the same chip.

...read moreread less

Abstract: Multiple processor cores are implemented on a single integrated circuit chip, each having its own respective shareable functional units, which are preferably floating point units. A failure of a shareable unit in one processor causes that processor to share the corresponding unit in another processor on the same chip. Preferably, a functional unit is shared on a cycle interleaved basis.

...read moreread less

Patent•

Data processing system providing hardware acceleration of input/output (I/O) communication

[...]

Arimilli Ravi Kumar¹, Robert Alan Cargnoni¹, Guy Lynn Guthrie¹, Starke William John¹•Institutions (1)

IBM¹

09 Jan 2003

TL;DR: In this article, the authors describe a processor core that executes instructions, an interconnect interface, coupled to the processor core, that supports communication between the processor and a system interconnect external to the integrated circuit, and at least a portion of an external communication adapter coupled to processor core to support input and output communication via an input/output communication link.

...read moreread less

Abstract: An integrated circuit, such as a processing unit, includes a substrate and integrated circuitry formed in the substrate. The integrated circuitry includes a processor core that executes instructions, an interconnect interface, coupled to the processor core, that supports communication between the processor core and a system interconnect external to the integrated circuit, and at least a portion of an external communication adapter, coupled to the processor core, that supports input/output communication via an input/output communication link.

...read moreread less