scispace - formally typeset
Search or ask a question

Showing papers on "Multi-core processor published in 2003"


Proceedings ArticleDOI
03 Dec 2003
TL;DR: This paper proposes and evaluates single-ISA heterogeneousmulti-core architectures as a mechanism to reduceprocessor power dissipation and results indicate a 39% average energy reduction while only sacrificing 3% in performance.
Abstract: This paper proposes and evaluates single-ISA heterogeneous multi-core architectures as a mechanism to reduce processor power dissipation. Our design incorporates heterogeneous cores representing different points in the power/performance design space; during an application's execution, system software dynamically chooses the most appropriate core to meet specific performance and power requirements. Our evaluation of this architecture shows significant energy benefits. For an objective function that optimizes for energy efficiency with a tight performance threshold, for 14 SPEC benchmarks, our results indicate a 39% average energy reduction while only sacrificing 3% in performance. An objective function that optimizes for energy-delay with looser performance bounds achieves, on average, nearly a factor of three improvements in energy-delay product while sacrificing only 22% in performance. Energy savings are substantially more than chip-wide voltage/frequency scaling.

809 citations


Proceedings ArticleDOI
01 May 2003
TL;DR: Results show that high performance can be obtained in each of the three modes--ILP, TLP, and DLP-demonstrating the viability of the polymorphous coarse-grained approach for future microprocessors.
Abstract: This paper describes the polymorphous TRIPS architecture which can be configured for different granularities and types of parallelism. TRIPS contains mechanisms that enable the processing cores and the on-chip memory system to be configured and combined in different modes for instruction, data, or thread-level parallelism. To adapt to small and large-grain concurrency, the TRIPS architecture contains four out-of-order, 16-wide-issue Grid Processor cores, which can be partitioned when easily extractable fine-grained parallelism exists. This approach to polymorphism provides better performance across a wide range of application types than an approach in which many small processors are aggregated to run workloads with irregular parallelism. Our results show that high performance can be obtained in each of the three modes--ILP, TLP, and DLP-demonstrating the viability of the polymorphous coarse-grained approach for future microprocessors.

512 citations


Patent
25 Apr 2003
TL;DR: In this paper, a workload transfer mechanism transfers the executing application software to a second computer hardware processor core in a search for reduced operating power, and a transfer delay mechanism is connected to delay a subsequent transfer of the executed application software if the system operating power may be conserved by such delay.
Abstract: A computer system for conserving operating power includes a number of computer hardware processor cores that differ amongst themselves in at least in their respective operating power requirements and processing capabilities. A monitor gathers performance metric information from each of the computer hardware processor cores that is specific to a particular run of application software then executing. A workload transfer mechanism transfers the executing application software to a second computer hardware processor core in a search for reduced operating power. A transfer delay mechanism is connected to delay a subsequent transfer of the executing application software if the system operating power may be conserved by such delay.

317 citations


Journal ArticleDOI
TL;DR: The article presents a technology that uses event model interfaces and a novel event flow mechanism that extends formal analysis approaches from real-time system design into the multiprocessor system on chip domain.
Abstract: Multiprocessor system on chip designs use complex on-chip networks to integrate different programmable processor cores, specialized memories, and other components on a single chip. MpSoC have been become the architecture of choice in many industries. Their heterogeneity inevitably increases with intellectual-property integration and component specialization. System integration is becoming a major challenge in their design. Simulation is state of the art in MpSoC performance verification, but it has conceptual disadvantages that become disabling as complexity increases. Formal approaches offer a systematic alternative. The article presents a technology that uses event model interfaces and a novel event flow mechanism that extends formal analysis approaches from real-time system design into the multiprocessor system on chip domain.

178 citations


Book ChapterDOI
TL;DR: In this article, the authors focus on optimization techniques for enhancing cache performance by hiding both the low main memory bandwidth and the latency of main memory accesses which is slow in contrast to the floating-point performance of the CPUs.
Abstract: In order to mitigate the impact of the growing gap between CPU speed and main memory performance, today’s computer architectures implement hierarchical memory structures. The idea behind this approach is to hide both the low main memory bandwidth and the latency of main memory accesses which is slow in contrast to the floating-point performance of the CPUs. Usually, there is a small and expensive high speed memory sitting on top of the hierarchy which is usually integrated within the processor chip to provide data with low latency and high bandwidth; i.e., the CPU registers. Moving further away from the CPU, the layers of memory successively become larger and slower. The memory components which are located between the processor core and main memory are called cache memories or caches. They are intended to contain copies of main memory blocks to speed up accesses to frequently needed data [378], [392]. The next lower level of the memory hierarchy is the main memory which is large but also comparatively slow. While external memory such as hard disk drives or remote memory components in a distributed computing environment represent the lower end of any common hierarchical memory design, this paper focuses on optimization techniques for enhancing cache performance.

157 citations


Book ChapterDOI
01 Jan 2003
TL;DR: The chapter presents a framework for the design space exploration of embedded systems and focuses on high level of abstraction, where the goal is to quickly identify interesting architectures that can be further evaluated by taking lower-level details into account.
Abstract: It is noted that network processors (NPs) generally consist of multiple processing units such as CPU cores, microengines, and dedicated hardware for computing-intensive tasks, memory units, caches, interconnections, and I/O interfaces. Following a system-on-a-chip (SoC) design method, these resources are then put on a single chip and they must interoperate in order to perform packet processing tasks at line speed. The process of determining the optimal hardware and software architecture for such processors includes issues involving resource allocation and partitioning. The chapter presents a framework for the design space exploration of embedded systems. It is observed that the architecture exploration and evaluation of network processors involve many tradeoffs and a complex interplay between hardware and software. The chapter focuses on high level of abstraction, where the goal is to quickly identify interesting architectures that can be further evaluated by taking lower-level details into account. Task models, task scheduling, operating system issues, and packet processor architectures collectively play a role in different phases of the design space exploration of packet processor devices.

144 citations


Bishop Brock1, Karthick Rajamani1
01 Jan 2003
TL;DR: This paper discusses several of the SOC design issues pertaining to dynamic voltage and frequency scalable systems, and how these issues were resolved in the IBM PowerPC 405LP processor, and introduces DPM, a novel architecture for policy-guided dynamic power management.
Abstract: This paper discusses several of the SOC design issues pertaining to dynamic voltage and frequency scalable systems, and how these issues were resolved in the IBM PowerPC 405LP processor. We also introduce DPM, a novel architecture for policy-guided dynamic power management. We illustrate the utility of DPM by its ability to implement several classes of power management strategies and demonstrate practical results for a 405LP embedded system. I. INTRODUCTION Advances in low-power components and system design have brought general purpose computation into watches, wireless telephones, PDAs and tablet computers. Power management of these systems has traditionally focused on sleep modes and device power management (1). Embedded processors for these applications are highly integrated system-on-a-chip (SOC) de- vices that also support aggressive power management through techniques such as programmable clock gating and dynamic voltage and frequency scaling (DVFS). This paper describes one of these processors, and the development of a software architecture for policy-guided dynamic power management. II. 405LP DESIGN AND POWER MANAGEMENT FEATURES The IBM PowerPC 405LP is a dynamic voltage and frequency scalable embedded processor targeted at high- performance battery-operated devices. The 405LP is an SOC ASIC design in a 0.18 m bulk CMOS process, integrating a PowerPC 405 CPU core modified for operation over a 1.0 V to 1.8 V range with off-the-shelf IP cores. The chip includes a flexible clock generation subsystem, new hardware accelerators for speech recognition and security, as well as a novel standby power management controller (2). In a system we normally operate the CPU/SDRAM at 266/133 MHz above 1.65 V and at 66/33 MHz above 0.9 V, typically providing a 13:1 SOC core power range over the 4:1 performance range. From a system design and active power management perspec- tive the most interesting facets of the 405LP SOC design concern the way the clocks are generated and controlled. These features of the processor are described in the remainder of this Section.

129 citations


Journal ArticleDOI
TL;DR: Using reconfigurable devices to implement emulated digital architectures provides more flexibility compared to the custom very large-scale integration designs because different Falcon architectures can be used on the same FPGA device.
Abstract: A new emulated digital multilayer cellular neural network (universal machine (CNN-UM) chip architecture called Falcon has been developed. In this brief, the main steps of the field-programmable gate array (FPGA) implementation are introduced. The main results are as follows. The CNN-UM architecture emulated on Xilinx Virtex series FPGA, three-dimensional nonlinear spatio-temporal dynamics can be implemented on this architecture. The critical parameters of the implementation in a single-layer configuration are 55 million cell update/s/processor core, or, equivalently 1 giga-operation per second (GOPS) computing performance. In the face of the high performance, the power requirements of the architecture are relatively low only /spl sim/3 W per processor core. Using reconfigurable devices to implement emulated digital architectures provides more flexibility compared to the custom very large-scale integration designs because different Falcon architectures can be used on the same FPGA device.

123 citations


Journal ArticleDOI
TL;DR: A single-ISA heterogeneousmulti-core architecture as a mechanism to reduce processor power dissipation and demonstrates a five-fold reduction in energy at a cost of only 25% performance.
Abstract: This paper proposes a single-ISA heterogeneousmulti-core architecture as a mechanism to reduce processorpower dissipation. It assumes a single chip containing a diverseset of cores that target different performance levels and consumedifferent levels of power. During an application’s execution,system software dynamically chooses the most appropriate core tomeet specific performance and power requirements. It describesan example architecture with five cores of varying performanceand complexity. Initial results demonstrate a five-fold reductionin energy at a cost of only 25% performance.

111 citations


Patent
16 Jul 2003
TL;DR: In this paper, a workload assignment mechanism assigns jobs to processor cores in order to maximize overall system throughput and the throughput of individual jobs, based on performance metric information from each of the computer hardware processor cores that are specific to a particular run of application software.
Abstract: A computer system for maximizing system and individual job throughput includes a number of computer hardware processor cores that differ amongst themselves in at least in their respective resource requirements and processing capabilities. A monitor gathers performance metric information from each of the computer hardware processor cores that are specific to a particular run of application software then executing. Based on these metrics, a workload assignment mechanism assigns jobs to processor cores in order to maximize overall system throughput and the throughput of individual jobs.

99 citations


Patent
16 Dec 2003
TL;DR: In this article, the authors present a method, apparatus and system may optimize context switching between virtual machines (VMs) according to an embodiment of the present invention, a first processor core may execute a first VM while a second processor core can concurrently retrieve information pertaining to the state of a second VM into a processor cache.
Abstract: A method, apparatus and system may optimize context switching between virtual machines (“VMs”). According to an embodiment of the present invention, a first processor core may execute a first VM while a second processor core may concurrently retrieve information pertaining to the state of a second VM into a processor cache. When the virtual machine manager (“VMM”) performs a context switch between the first and the second VMs, the second processor may immediately begin executing the second VM, while the first processor may save the state information for the first VM. In yet another embodiment, different threads on a processor may be utilized to execute different VMs on a host.

Journal ArticleDOI
TL;DR: SimBed, an execution-driven simulation testbed that measures the execution behavior and power consumption of embedded applications and RTOSs by executing them on an accurate architectural model of a microcontroller with simulated real-time stimuli, shows that there appears no clear winner in timing accuracy between preemptive systems and cooperative systems.
Abstract: We present the modelling of embedded systems with SimBed, an execution-driven simulation testbed that measures the execution behavior and power consumption of embedded applications and RTOSs by executing them on an accurate architectural model of a microcontroller with simulated real-time stimuli. We briefly describe the simulation environment and present a study that compares three RTOSs: /spl mu/C/OS-II, a popular public-domain embedded real-time operating system; Echidna, a sophisticated, industrial-strength (commercial) RTOS; and NOS, a bare-bones multirate task scheduler reminiscent of typical "roll-your-own" RTOSs found in many commercial embedded systems. The microcontroller simulated in this study is the Motorola M-CORE processor: a low-power, 32-bit CPU core with 16-bit instructions, running at 20MHz. Our simulations show what happens when RTOSs are pushed beyond their limits and they depict situations in which unexpected interrupts or unaccounted-for task invocations disrupt timing, even when the CPU is lightly loaded. In general, there appears no clear winner in timing accuracy between preemptive systems and cooperative systems. The power-consumption measurements show that RTOS overhead is a factor of two to four higher than it needs to be, compared to the energy consumption of the minimal scheduler. In addition, poorly designed idle loops can cause the system to double its energy consumption-energy that could be saved by a simple hardware sleep mechanism.

Patent
04 Aug 2003
TL;DR: In this paper, the authors measure the degree of parallelism achieved in executing program instructions and use this to dynamically control the clock speeds and supply voltage levels applied to different processor cores 4, 6 so as to reduce the overall amount of energy consumed by matching the processing performance achieved to the clock speed and voltage levels used.
Abstract: A multi-processing system 2 measures the degree of parallelism achieved in executing program instructions and uses this to dynamically control the clock speeds and supply voltage levels applied to different processor cores 4, 6 so as to reduce the overall amount of energy consumed by matching the processing performance achieved to the clock speeds and voltage levels used.

Journal ArticleDOI
TL;DR: A system chip targeting image and voice processing and recognition application domains is implemented as a representative of the potential of using programmable logic in system design.
Abstract: A system chip targeting image and voice processing and recognition application domains is implemented as a representative of the potential of using programmable logic in system design. It features an embedded reconfigurable processor built by joining a configurable and extensible processor core and an SRAM-based embedded field-programmable gate array (FPGA). Application-specific bus-mapped coprocessors and flexible input/output peripherals and interfaces can also be added and dynamically modified by reconfiguring the embedded FPGA. The architecture of the system is discussed as well as the design flows for pre- and post-silicon design and customization. The silicon area required by the system is 20 mm/sup 2/ in a 0.18-/spl mu/m CMOS technology. The embedded FPGA accounts for about 40% of the system area.

Book
01 Jan 2003
TL;DR: This text begins with a general introduction to parallel computing, then progresses to the specifics of parallel computing with heterogeneous networks, proving a superior reference for researchers and graduate students in computer science.
Abstract: From the Publisher: "Traditional software for parallel computing typically spreads computations evenly over a set of linked processors. This, however, may not always be the best way of maximizing the performance of a given network or cluster of computers. By taking account of the actual performance of individual processors and the links between them, parallel computing on heterogeneous networks offers significant improvements in parallel computation. Alexey Lastovetsky's Parallel Computing on Heterogeneous Networks provides a resource on his innovative technology." "This text begins with a general introduction to parallel computing, then progresses to the specifics of parallel computing with heterogeneous networks. Practically oriented, the book includes illustrative algorithms in the mpC programming language, a unique high-level software tool designed by the author specifically for programming heterogeneous parallel algorithms. All concepts and algorithms are illustrated with working programs that can be compiled or executed on any cluster." All of the contents are also illustrated by carefully tested source code, allowing readers to play with the presented software tools and algorithms - particularly with the mpC programming language - while reading the book. Appendices provide both the complete source code and user's guide for the principal applications used to illustrate the book's material. Parallel Computing on Heterogeneous Networks proves a superior reference for researchers and graduate students in computer science.

Proceedings ArticleDOI
12 May 2003
TL;DR: A Sensor-Network Asynchronous Processor (SNAP) is presented, which is designed to be both a processor core for a sensor- network node and a component of a chip multiprocessor, the Network on a Chip (NoC), which will execute a novel sensor-network simulator.
Abstract: We present a Sensor-Network Asynchronous Processor (SNAP), which we have designed to be both a processor core for a sensor-network node and a component of a chip multiprocessor, the Network on a Chip (NoC), which will execute a novel sensor-network simulator. We discuss the advantages of using the same processor for nodes in physical and simulated sensor networks. We describe the attributes that a processor must possess to function well in both roles, and we then describe the way we designed SNAP to have these attributes.

Patent
28 Feb 2003
TL;DR: In this article, the authors consider a data processing system that includes a processor core and system circuitry coupled to the processor core, and propose a method for conserving power by granting bus access to a requesting device.
Abstract: Power is conserved in a data processing system that includes a processor core and system circuitry coupled to the processor core. A first method for conserving power includes entering a low power state by the processor and the system circuitry and enabling bus arbitration by the processor while the processor core remains in the low power state. One embodiment further contemplates a method of conserving power by granting bus access to a requesting device and entering a power conservation mode by the processor core in response thereto. Bus operations are then performed while the processor core remains in the power conservation mode. Another embodiment contemplates a method of debugging a data processing system in which a debug state is entered by the processor and the system circuitry and, thereafter, bus arbitration is enabled by the processor while the processor core remains in the debug state.

Patent
02 Jul 2003
TL;DR: In this paper, the authors propose a hierarchical testing architecture compliant with IEEE 1149.1 Joint Test Action Group (JTAG) standard that leverages existing standard testing architectures within each processor core to allow for chip level access to schedule built-in self test (BIST) operations for the cores.
Abstract: A multi-core chip (MCC) having a plurality of processor cores includes a hierarchical testing architecture compliant with the IEEE 1149.1 Joint Test Action Group (JTAG) standard that leverages existing standard testing architectures within each processor core to allow for chip level access to schedule built-in self test (BIST) operations for the cores. The MCC includes boundary scan logic, a chip-level JTAG-compliant test access port (TAP) controller, a chip-level master BIST controller, and a test pin interface. Each processor core includes a JTAG-compliant TAP controller and one or more BIST enabled memory arrays. The chip TAP controller includes one or more user defined registers, including a core select register and a test mode register. The core select register stores a plurality of core select bits that select corresponding processor cores for BIST operations.

Patent
09 Jan 2003
TL;DR: A reconfigurable digital processing system for space includes the utilization of field programmable gate arrays utilizing a hardware centric approach to reconfigure software processors in a space vehicle through the reprogramming of multiple FPGAs such that one obtains a power/performance characteristic for signal processing tasks that cannot be achieved simply through the use of off-the-shelf processors as mentioned in this paper.
Abstract: A reconfigurable digital processing system for space includes the utilization of field programmable gate arrays utilizing a hardware centric approach to reconfigure software processors in a space vehicle through the reprogramming of multiple FPGAs such that one obtains a power/performance characteristic for signal processing tasks that cannot be achieved simply through the use of off-the-shelf processors In one embodiment, for damaged or otherwise inoperable signal processors located on a spacecraft, the remaining processors which are undamaged can be reconfigured through changing the machine language and binary to the field programmable gate arrays to change the core processor while at the same time maintaining undamaged components so that the signal processing functions can be restored utilizing a RAM-based FPGA as a signal processor In one embodiment, multiple FPGAs are connected together by a data bus and are also provided with data pipes which interconnect selected FPGAs together to provide the necessary processing function Flexibility in reconfiguration includes the utilizing of a timing and synchronization block as well as a common configuration block which when coupled to an interconnect block permits reconfiguration of a customizable application core, depending on the particular signal processing function desired The result is that damaged or inoperable signal processing components can be repaired in space without having to physically attend to the hardware by transmitting to the spacecraft commands which reconfigure the particular FPGAs thus to alter their signal processing function Also mission changes can be accomplished by reprogramming the FPGAs

Journal ArticleDOI
09 Feb 2003
TL;DR: This 600 MHz single-chip multiprocessor consists of two M32R 32 b CPU cores and 512 kB shared SRAM and is designed for embedded systems and operates at 600 MHz with 800 mW peak power dissipation.
Abstract: A 600-MHz single-chip multiprocessor, which includes two M32R 32-bit CPU cores , a 512-kB shared SRAM and an internal shared pipelined bus, was fabricated using a 0.15-/spl mu/m CMOS process for embedded systems. This multiprocessor is based on symmetric multiprocessing (SMP), and supports modified-exclusive-shared-invalid (MESI) cache coherency protocol. The multiprocessor inherits the advantages of previously reported single-chip multiprocessors, while its multiprocessor architecture is optimized for use as an embedded processor. The internal shared pipelined bus has a low latency and large bandwidth (4.8 GB/s). These features enhance the performance of the multiprocessor. In addition, the multiprocessor employs various low-power techniques. The multiprocessor dissipates 800 mW in a 1.5-V 600-MHz multiprocessor mode. Standby power dissipation is less than 1.5 mW at 1.5 V. Hence, the multiprocessor achieves higher performance and lower power consumption. This paper presents a single-chip multiprocessor architecture optimized for use as an embedded processor and its various low-power techniques.

Proceedings ArticleDOI
13 Oct 2003
TL;DR: An interface synthesis approach that forms part of the hardware-software codesign methodology for such an FPGA-based platform based on a novel memory mapping algorithm that maps data used by both the hardware and the software to shared memories on the reconfigurable fabric.
Abstract: Several system-on-chip (SoC) platforms have recently emerged that use reconfigurable logic (FPGAs) as a programmable coprocessor to reduce the computational load on the main processor core. We present an interface synthesis approach that forms part of our hardware-software codesign methodology for such an FPGA-based platform. The approach is based on a novel memory mapping algorithm that maps data used by both the hardware and the software to shared memories on the reconfigurable fabric. The memory mapping algorithm couples with a high-level synthesis tool and uses scheduling information to map variables, arrays and complex data structures to the shared memories in a way that minimizes the number of registers and multiplexers used in the hardware interface. We also present three software schemes that enable the application software to communicate with this hardware interface. We demonstrate the utility of our approach and study the trade-offs involved using a case study of the codesign of a computationally expensive portion of the MPEG-1 multimedia application on to the Altera Nios platform.

01 Jan 2003
TL;DR: Single-ISA heterogeneous multicore architectures as a mechanism to reduce processor power dissipation are proposed and initial results show a more than three-fold reduction in energy at a cost of only 18% performance.
Abstract: This paper proposes single-ISA heterogeneous multicore architectures as a mechanism to reduce processor power dissipation. It assumes a single chip containing a diverse set of cores that target different performance levels and consume different levels of power. During an application’s execution, system software evaluates the resources required by an application for good performance and dynamically chooses the core that can best meet these requirements while minimizing energy consumption. It describes an example architecture with five cores of varying performance and complexity. Initial results show a more than three-fold reduction in energy at a cost of only 18% performance.

Proceedings ArticleDOI
09 Nov 2003
TL;DR: The INSIDE system is presented, which combines a methodology to determine which code segments are most suited for implementation as a set of extensible instructions, a heuristic algorithm to select pre-configured extensible processors, and an estimation tool which rapidly estimates the performance of an application on a generated extensible processor.
Abstract: This paper presents the INSIDE system that rapidly searches the design space for extensible processors, given area and performance constraints of an embedded application, while minimizing the design turn-around-time. Our system consists of a) a methodology to determine which code segments are most suited for implementation as a set of extensible instructions, b) a heuristic algorithm to select pre-configured extensible processors as well as extensible instructions (library), and c) an estimation tool which rapidly estimates the performance of an application on a generated extensible processor. By selecting the right combination of a processor core plus extensible instructions, we achieve a performance increase on average of 2.03x (up to 7x) compared to the base processor core at a minimum hardware overhead of 25% on average.

Journal ArticleDOI
14 Oct 2003
TL;DR: The HiBRID-SoC multi-core system-on-chip architecture targets a wide range of multimedia applications with particularly high processing demands, including general signal processing applications, video de-/encoding, image processing, or a combination of these tasks.
Abstract: The HiBRID-SoC multi-core system-on-chip architecture targets a wide range of multimedia applications with particularly high processing demands, including general signal processing applications, video encoding/decoding, image processing, or a combination of these tasks. For this purpose, the HiBRID-SoC integrates three fully programmable processor cores and various interfaces onto a single chip, all tied to a 64 bit AMBA AHB bus. The processor cores are individually optimized to the particular computational characteristics of different application fields, complementing each other to deliver high performance levels with high flexibility at reduced system cost. The HiBRID-SoC is fabricated in a 0.18 /spl mu/m 6LM standard-cell technology, occupies about 82 mm/sup 2/, and operates at 145 MHz. An MPEG-4 Advanced Simple Profile decoder in full TV resolution requires about 120 MHz for real-time performance on the HiBRID-SoC, utilizing only two of the three cores.

Patent
22 Oct 2003
TL;DR: In this article, a method of executing program code on a target microprocessor with multiple CPU cores thereon is described, where one of the CPU cores is selected for testing, and inter-core context switching is performed.
Abstract: One embodiment disclosed relates to a method of executing program code on a target microprocessor with multiple CPU cores thereon. One of the CPU cores is selected for testing, and inter-core context switching is performed. Parallel execution occurs of diagnostic code on the selected CPU core and the program code on remaining CPU cores. Another embodiment disclosed relates to a microprocessor having a plurality of CPU cores integrated on the microprocessor chip. Inter-core communications circuitry is coupled to each of the CPU cores and configured to perform context switching between the CPU cores.

Proceedings ArticleDOI
Claudio Mucci1, Carlo Chiesa1, Andrea Lodi1, Mario Toma1, Fabio Campi1 
19 Nov 2003
TL;DR: A C-based algorithm development flow for XiRisc, a reconfigurable processor architecture targeted at embedded systems, that couples a VLIW risc core with a custom designed programmable hardware unit optimized for being programmed starting from data flow graph (DFG) descriptions is presented.
Abstract: Reconfigurable processors are an appealing option to achieve high performance and low energy consumption in digital signal processing, but their utilization often involves hardware issues not usual for algorithm developers proficient in high level languages. This paper presents a C-based algorithm development flow for XiRisc, a reconfigurable processor architecture targeted at embedded systems, that couples a VLIW risc core with a custom designed programmable hardware unit optimized for being programmed starting from data flow graph (DFG) descriptions. Starting from C-language, the flow produces both executable codes for the processor core and configuration bits for the embedded programmable unit. The proposed flow was utilized for implementing a set of DSP algorithms on a prototypal 0.18 /spl mu/m XiRisc test-chip obtaining performance speed-ups up to 10x and energy consumption reduction up to 75%.

Patent
Jay Nejedlo1
20 Mar 2003
TL;DR: In this article, a methodology for testing a computer system using multiple test units, each test unit being associated with its respective core function circuitry, is presented, where the core circuitry and its respective test unit are located in a primary integrated circuit component of the computer system such as a processor, memory, or chipset.
Abstract: A methodology for testing a computer system using multiple test units, each test unit being associated with its respective core function circuitry. The core circuitry and its respective test unit are located in a primary integrated circuit component of the computer system, such as a processor, memory, or chipset. The on-chip test units communicate with one another and with other parts of the system, to determine whether a specification of the computer system is satisfied, without requiring a processor core of the computer system to execute an operating system program for the computer system.

Proceedings ArticleDOI
03 Mar 2003
TL;DR: This paper presents a low-cost software-based self-testing methodology for processor cores with the aim of producing compact test code sequences developed with a limited engineering effort and achieving a high fault coverage for the processor core.
Abstract: Software self-testing of embedded processor cores which effectively partitions the testing effort between low-speed external equipment and internal processor resources, has been recently proposed as an alternative to classical hardware built-in self-lest techniques over which it provides significant advantages. In this paper we present a low-cost software-based self-testing methodology for processor cores with the aim of producing compact test code sequences developed with a limited engineering effort and achieving a high fault coverage for the processor core. The objective of small test code sequences is directly related to the utilization of low-speed external testers since test time is primarily determined by the time required to download the lest code to the processor memory at the tester's low frequency. Successful application of the methodology to a RISC processor core architecture with a 3-stage pipeline is demonstrated.

Patent
David A. Luick1
18 Sep 2003
TL;DR: In this article, multiple processor cores are implemented on a single integrated circuit chip, each having its own respective shareable functional units, which are preferably floating point units, and a failure of a shareable unit in one processor causes that processor to share the corresponding unit in another processor on the same chip.
Abstract: Multiple processor cores are implemented on a single integrated circuit chip, each having its own respective shareable functional units, which are preferably floating point units. A failure of a shareable unit in one processor causes that processor to share the corresponding unit in another processor on the same chip. Preferably, a functional unit is shared on a cycle interleaved basis.

Patent
09 Jan 2003
TL;DR: In this article, the authors describe a processor core that executes instructions, an interconnect interface, coupled to the processor core, that supports communication between the processor and a system interconnect external to the integrated circuit, and at least a portion of an external communication adapter coupled to processor core to support input and output communication via an input/output communication link.
Abstract: An integrated circuit, such as a processing unit, includes a substrate and integrated circuitry formed in the substrate. The integrated circuitry includes a processor core that executes instructions, an interconnect interface, coupled to the processor core, that supports communication between the processor core and a system interconnect external to the integrated circuit, and at least a portion of an external communication adapter, coupled to the processor core, that supports input/output communication via an input/output communication link.