Showing papers on "Multi-core processor published in 2001"

PDF

Open Access

Patent•

Method and system for virtual prototyping

[...]

Stephen L. Bade¹, Shay Ben-Chorin¹, Paul Caamano¹, Marcelo E. Montoreano¹, Ani Taggu¹, Filip C. Thoen¹, Dean C. Wills¹ - Show less +3 more•Institutions (1)

Synopsys¹

01 Jun 2001

TL;DR: An integrated design environment (IDE) is disclosed for forming virtual embedded systems as discussed by the authors, which includes a design language for forming finite state machine models of hardware components that are coupled to simulators of processor cores, preferably instruction set accurate simulators.

...read moreread less

Abstract: An integrated design environment (IDE) is disclosed for forming virtual embedded systems The IDE includes a design language for forming finite state machine models of hardware components that are coupled to simulators of processor cores, preferably instruction set accurate simulators A software debugger interface permits a software application to be loaded and executed on the virtual embedded system A virtual test bench may be coupled to the simulation to serve as a human-machine interface In one embodiment, the IDE is provided as a web-based service for the evaluation, development and procurement phases of an embedded system project IP components, such as processor cores, may be evaluated using a virtual embedded system In one embodiment, a virtual embedded system is used as an executable specification for the procurement of a good or service related to an embedded system

...read moreread less

231 citations

Proceedings Article•DOI•

A fault tolerant approach to microprocessor design

[...]

Christopher T. Weaver¹, Todd Austin•Institutions (1)

University of Michigan¹

01 Jul 2001

TL;DR: This work proposes a fault-tolerant approach to reliable microprocessor design that provides significant resistance to core processor design errors and operational faults such as supply voltage noise and energetic particle strikes, and shows through cycle-accurate simulation and timing analysis of a physical checker design that it preserves system performance while keeping area overheads and power demands low.

...read moreread less

Abstract: We propose a fault-tolerant approach to reliable microprocessor design. Our approach, based on the use of an online checker component in the processor pipeline, provides significant resistance to core processor design errors and operational faults such as supply voltage noise and energetic particle strikes. We show through cycle-accurate simulation and timing analysis of a physical checker design that our approach preserves system performance while keeping area overheads and power demands low. Furthermore, analyses suggest that the checker is a fairly simple state machine that can be formally verified, scaled in performance, and reused. Further simulation analyses show virtually no performance impacts when our simple checker design is coupled with a high-performance microprocessor model. Timing analyses indicate that a fully synthesized unpipelined 4-wide checker component in 0.25 /spl mu/m technology is capable of checking Alpha instructions at 288 MHz. Physical analyses also confirm that costs are quite modest; our prototype checker requires less than 6% the area and 1.5% the power of an Alpha 21264 processor in the same technology. Additional improvements to the checker component are described which allow for improved detection of design, fabrication and operational faults.

...read moreread less

154 citations

Patent•

Methods and systems for the order serialization of information in a network processing environment

[...]

Roger K. Richter, Gustavo G. Hernandez, Ho Wang

02 Nov 2001

TL;DR: In this article, a multi-processor network processing environment is provided in which parallel processing may occur, while still maintaining ordered serialization between the input and the output of the network processor.

...read moreread less

Abstract: A multi-processor network processing environment is provided in which parallel processing may occur. In one embodiment, a network processor having multiple processor cores may be utilized. Parallel processing at the front end of the network processor is encouraged while still maintaining ordered serialization between the input and the output of the network processor. The disclosed order serialization techniques obtain the benefits of parallel processing at the front end of the system while minimizing blocking times at the output.

...read moreread less

137 citations

Patent•

Channel codec processor configurable for multiple wireless communications standards

[...]

Stefan Johannes Bitterlich, David Morris Holmes

28 Dec 2001

TL;DR: In this article, a reconfigurable channel CODEC (encoder and decoder) processor for a wireless communication system is presented, which includes processor cores (210, 250) and algorithm-specific kernels (212, 214, 216, 252, 254, 256).

...read moreread less

Abstract: A reconfigurable channel CODEC (encoder and decoder) processor for a wireless communication system is disclosed. A high degree of user programmability and reconfigurability is provided by the channel CODEC processor (200). In particular, the reconfigurable channel CODEC processor includes processor cores (210, 250) and algorithm-specific kernels (212, 214, 216, 252, 254, 256) that contain logic circuits tailored for carrying out predetermined but user-configurable decoding and encoding algorithms. The interconnects (230, 270) between the processor cores and the algorithm-specific kernels are also user-configurable. Thus, the same hardware can be reconfigured for many different wireless communication standards.

...read moreread less

96 citations

Proceedings Article•DOI•

Hardware/software partitioning of embedded system in OCAPI-xl

[...]

Geert Vanmeerbeeck¹, Patrick Schaumont¹, Serge Vernalde¹, Marc Engels¹, Ivo Bolsens¹ - Show less +1 more•Institutions (1)

Katholieke Universiteit Leuven¹

25 Apr 2001

TL;DR: OCAPI-xl is developed, a methodology in which the HW/SW partitioning decision can be made anywhere in the design flow, even just prior to doing code-generation for both HW and SW, made possible thanks to a refinable, implementable, architecture independent system description.

...read moreread less

Abstract: The implementation of embedded networked appliances requires a mix of processor cores and HW accelerators on a single chip. When designing such complex and heterogeneous SoCs, the HW / SW partitioning decision needs to be made prior to refining the system description. With OCAPI-xl, we developed a methodology in which the partitioning decision can be made anywhere in the design flow, even just prior to doing code-generation for both HW and SW. This is made possible thanks to a refinable, implementable, architecture independent system description. The OCAPI-xl model was used to develop a stand alone, networked camera, with on-board GIF engine and network layer.

...read moreread less

74 citations

Journal Article•DOI•

FPGA prototyping of a RISC processor core for embedded applications

[...]

Michael K. Gschwind¹, Valentina Salapura¹, D. Maurer•Institutions (1)

IBM¹

01 Apr 2001-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: This work describes the use of a reconfigurable processor core based on an RISC architecture as starting point for application-specific processor design and shows how hardware emulation based on programmable logic can be integrated into the hardware/software codesign flow.

...read moreread less

Abstract: Application-specific processors offer an attractive option in the design of embedded systems by providing high performance for a specific application domain. In this work, we describe the use of a reconfigurable processor core based on an RISC architecture as starting point for application-specific processor design. By using a common base instruction set, development cost can be reduced and design space exploration is focused on the application-specific aspects of performance. An important aspect of deploying any new architecture is verification which usually requires lengthy software simulation of a design model. We show how hardware emulation based on programmable logic can be integrated into the hardware/software codesign flow. While previously hardware emulation required massive investment in design effort and special purpose emulators, an emulation approach based on high-density field-programmable gate array (FPGA) devices now makes hardware emulation practical and cost effective for embedded processor designs. To reduce development cost and avoid duplication of design effort, FPGA prototypes and ASIC implementations are derived from a common source: We show how to perform targeted optimizations to fully exploit the capabilities of the target technology while maintaining a common source base.

...read moreread less

70 citations

Proceedings Article•DOI•

A self-test methodology for IP cores in bus-based programmable SoCs

[...]

Jing-Reng Huang¹, M. K. Iyer, Kwang-Ting Cheng•Institutions (1)

University of California, Santa Barbara¹

29 Mar 2001

TL;DR: A novel test methodology for testing IP cores in SoCs with embedded processor cores that supports at-speed testing for delay faults and stuck-at testing of IP cores implementing full-scan is presented.

...read moreread less

Abstract: We present a novel test methodology for testing IP cores in SoCs with embedded processor cores. A test program is run on the processor core that generates and delivers test patterns to the target IP cores in the SoC and analyzes the test responses. This provides tremendous flexibility in the type of patterns that can be applied to the IP cores without incurring significant hardware overhead. We use a bus based SoC simulation model to validate our test methodology. The test methodology involves addition of a test wrapper that can be configured for specific test needs. The methodology supports at-speed testing for delay faults and stuck-at testing of IP cores implementing full-scan.

...read moreread less

64 citations

Patent•

User configurable on-chip memory system

[...]

Ahmad R. Ansari¹, Stephen M. Douglass¹, Mehul R. Vashi¹, Steven P. Young¹, Prasad Sastry¹, Robert Yin¹ - Show less +2 more•Institutions (1)

Xilinx¹

10 Dec 2001

TL;DR: In this article, the address depth of the global block RAM and the number of wait states of the local block RAM are selected by a user, and they can be set either prior to configuration of the FPGA or programmed using instructions of the processor core.

...read moreread less

Abstract: A data processing system having a user configurable memory controller, local block RAMs, global block RAMs and a processor core can be configured in a single field programmable gate array (FPGA). The address depth of the global block RAMs and the number of wait states can be selected by a user, and they can be set either prior to configuration of the FPGA or programmed using instructions of the processor core. The number of wait states of the local block RAM is also user selectable. An algorithm that can optimize the address depth and the number of wait states to achieve a performance level is also disclosed. The present invention can be applied to designs having separate instruction and data sides.

...read moreread less

62 citations

Patent•

Network interface unit including a microcontroller having multiple configurable logic blocks, with a test/program bus for performing a plurality of selected functions

[...]

Freitag¹, W William•Institutions (1)

Advanced Micro Devices¹

22 May 2001

TL;DR: In this article, a network interface unit is presented including a microcontroller having multiple blocks of programmable logic that are variably configurable to perform selected functions, such as assembly, transmit, and receive data units (i.e., frames) of one communication protocol, then later reconfigured to assemble, transmit and receive frames of another protocol.

...read moreread less

Abstract: A network interface unit is presented including a microcontroller having multiple blocks of programmable logic that are variably configurable to perform selected functions. The network interface unit may be configured to assemble, transmit, and receive data units (i.e., frames) of one communication protocol, then later reconfigured to assemble, transmit, and receive frames of another protocol. The microcontroller includes several components formed upon a single monolithic semiconductor substrate, among them an execution unit. The execution unit includes a processor core and multiple configurable logic blocks (CLBs) coupled to the processor core. The processor core is configured to execute instructions, for example x86 instructions. Each of the multiple CLBs includes programmable logic which may be, for example, PLA circuitry, PAL circuitry, or FPGA circuitry. The programmable logic includes programmable switching elements such as, for example, EPROM elements, EEPROM elements, or SRAM elements. During instruction execution, the processor core produces output signals. During a programming operation, the output signals include programming signals which configure the programmable logic within one or more of the multiple CLBs to perform selected functions. Once programmed, each CLB performs the selected function in response to output signals produced by the processor core. The network interface unit also includes one or more memory devices and an electrical interface unit. The one or more memory devices store instructions and data used by the processor core. The electrical interface unit is adapted for coupling to the network transmission medium and performs as an interface between the microcontroller and the network transmission medium.

...read moreread less

59 citations

Proceedings Article•DOI•

Deterministic software-based self-testing of embedded processor cores

[...]

Antonis Paschalis¹, Dimitris Gizopoulos², N. Kranitis, Mihalis Psarakis, Yervant Zorian - Show less +1 more•Institutions (2)

National and Kapodistrian University of Athens¹, University of Piraeus²

13 Mar 2001

TL;DR: A deterministic software-based self-testing methodology for processor cores is introduced that efficiently tests the processor datapath modules without any modification of the processor structure to provide high fault coverage without repetitive fault simulation experiments.

...read moreread less

Abstract: A deterministic software-based self-testing methodology for processor cores is introduced that efficiently tests the processor datapath modules without any modification of the processor structure. It provides a guaranteed high fault coverage without repetitive fault simulation experiments which is necessary in pseudorandom software-based processor self-testing approaches. Test generation and output analysis are performed by utilizing the processor functional modules like accumulators (arithmetic part of ALU) and shifters (if they exist) through processor instructions. No extra hardware is required and there is no performance degradation.

...read moreread less

55 citations

Proceedings Article•DOI•

Multiplex: unifying conventional and speculative thread-level parallelism on a chip multiprocessor

[...]

Chong-Liang Ooi¹, Seon Wook Kim¹, Il Park¹, Rudolf Eigenmann¹, Babak Falsafi², T. N. Vijaykumar¹ - Show less +2 more•Institutions (2)

Purdue University¹, Carnegie Mellon University²

17 Jun 2001

TL;DR: Detailed analysis indicates that the dominant overheads in an implicitly-threaded CMP are speculation state overflow due to limited L1 cache capacity, and load imbalance and data dependences in fine-grain threads.

...read moreread less

Abstract: Recent proposals for Chip Multiprocessors (CMPs) advocate speculative, or implicit, threading in which the hardware employs prediction to peel off instruction sequences (i.e., implicit threads) from the sequential execution stream and speculatively executes them in parallel on multiple processor cores. These proposals augment a conventional multiprocessor, which employs explicit threading, with the ability to handle implicit threads. Current proposals focus on only implicitly-threaded code sections. This paper identifies, for the first time, the issues in combining explicit and implicit threading. We present the Multiplex architecture to combine the two threading models. Multiplex exploits the similarities between implicit and explicit threading, and provides a unified support for the two threading models without additional hardware. Multiplex groups a subset of protocol states in an implicitly-threaded CMP to provide a write-invalidate protocol for explicit threads.Using a fully-integrated compiler infrastructure for automatic generation of Multiplex code, this paper presents a detailed performance analysis for entire benchmarks, instead of just implicitly-threaded sections, as done in previous papers. We show that neither threading models alone performs consistently better than the other across the benchmarks. A CMP with four dual-issue CPUs achieves a speedup of 1.48 and 2.17 over one dual-issue CPU, using implicit-only and explicit-only threading, respectively. Multiplex matches or outperforms the better of the two threading models for every benchmark, and a four-CPU Multiplex achieves a speedup of 2.63. Our detailed analysis indicates that the dominant overheads in an implicitly-threaded CMP are speculation state overflow due to limited L1 cache capacity, and load imbalance and data dependences in fine-grain threads.

...read moreread less

Patent•

Image processing apparatus and method

[...]

Hara Katsuhiko¹•Institutions (1)

Canon Inc.¹

12 Jul 2001

TL;DR: System Bus Bridge (SBB) as mentioned in this paper is a multichannel bidirectional bus bridge and provides a mutual connection between a B bus (I/O bus), a G bus (graphic bus), an SC bus (processor bus), and an MC bus (local bus) by using a crossbar switch.

...read moreread less

Abstract: The invention relates to an image processing apparatus which has a coding function of image data and inputs and outputs an image and a method for such an apparatus, and has an object to simultaneously perform coding of the image data in parallel in both a CPU and a coder/decoder A system bus bridge (SBB) is a multichannel bidirectional bus bridge and provides a mutual connection between a B bus (I/O bus), a G bus (graphic bus), an SC bus (processor bus), and an MC bus (local bus) by using a crossbar switch The connections of two systems can be simultaneously established by the crossbar switch A high speed data transfer of high parallel performance can be realized among a CPU core, a CODEC, and a DRAM

...read moreread less

Proceedings Article•DOI•

The performance and energy consumption of three embedded real-time operating systems

[...]

Kathleen Baynes¹, Chris Collins¹, Eric Fiterman¹, Brinda Ganesh¹, Paul Kohout¹, Christine Smit¹, Tiebing Zhang¹, Bruce Jacob¹ - Show less +4 more•Institutions (1)

University of Maryland, College Park¹

16 Nov 2001

TL;DR: SimBed, an execution-driven simulation testbed that measures the execution behavior and power consumption of embedded applications and RTOSs by executing them on an accurate architectural model of a microcontroller with simulated real-time stimuli is presented.

...read moreread less

Abstract: This paper presents the modeling of embedded systems with SimBed, an execution-driven simulation testbed that measures the execution behavior and power consumption of embedded applications and RTOSs by executing them on an accurate architectural model of a microcontroller with simulated real-time stimuli. We briefly describe the simulation environment and present a study that compares three RTOSs: mC/OS-II, a popular public-domain embedded real-time operating system; Echidna, a sophisticated, industrial-strength (commercial) RTOS; and NOS, a bare-bones multi-rate task scheduler reminiscent of typical "roll-your-own" RTOSs found in many commercial embedded systems. The microcontroller simulated in this study is the Motorola M-CORE processor: a low-power, 32-bit CPU core with 16-bit instructions, running at 20MHz.

...read moreread less

Journal Article•DOI•

Data cache and direct memory access in programming mediaprocessors

[...]

D. Kim¹, R. Managuli, Y. Kim•Institutions (1)

University of Washington¹

01 Jul 2001-IEEE Micro

TL;DR: Data cache and direct memory access address the challenge of transferring data between off- and on-chip memories without slowing down the core processor's performance.

...read moreread less

Abstract: Mediaprocessors provide high performance by using both instruction- and data-level parallelism. Because of the increased computing power, transferring data between off- and on-chip memories without slowing down the core processor's performance is challenging. Two methods, data cache and direct memory access, address this problem in different ways.

...read moreread less

Patent•

System and method for reading and writing a thread state in a multithreaded central processing unit

[...]

David A. Fotland, Tibet Mimaroglu

22 Jun 2001

TL;DR: In this article, a system and method for enabling multithreading in a embedded processor, invoking zero-time context switching in a multi-reading environment, scheduling multiple threads to permit numerous hard real-time and non-real-time priority levels, fetching data and instructions from multiple memory blocks in a multithreaded environment, and enabling a particular thread to modify the multiple states of the multiple threads in the processor core.

...read moreread less

Abstract: A system and method for enabling multithreading in a embedded processor, invoking zero-time context switching in a multithreading environment, scheduling multiple threads to permit numerous hard-real time and non-real time priority levels, fetching data and instructions from multiple memory blocks in a multithreading environment, and enabling a particular thread to modify the multiple states of the multiple threads in the processor core.

...read moreread less

Patent•

Integrated circuit with CPU and FPGA for reserved instructions execution with configuration diagnosis

[...]

Junichi Yano¹, Hisato Yoshida¹, Kimihiko Aiba¹, Imamura Katsuyuki¹, Junichi Mori¹, Yamamoto Junya¹ - Show less +2 more•Institutions (1)

Panasonic¹

05 Dec 2001

TL;DR: In this article, an instruction memory, an instruction storage portion that stores reserved instructions as F instructions, and stores the substantially equivalent processing contents to the F instructions as substitute instructions for processing by the CPU, a pre-fetch portion, a history storage portion, diagnosing portion for diagnosing the types of instructions, a reprogramming control portion for re-programming the instructions.

...read moreread less

Abstract: A semiconductor integrated circuits can send and receive signals to and form a configuration memory. The semiconductor integrated circuits is provided therein wiht an instruction memory, an instruction storage portion that stores reserved instructions as F instructions, and stores the substantially equivalent processing contents to the F instructions as substitute instructions for processing by the CPU, a pre-fetch portion, a history storage portion, a diagnosing portion for diagnosing the types of instructions, a reprogramming control portion for reprogramming the instructions, a CPU, an FPGA, a configuration data memory, a built-in memory, and a configuration data tag. When the configuration data of the F instruction does not exist in the FPGA, the substantially equivalent processing by FPGA is executed by the CPU by making use of the substitute instructions.

...read moreread less

Patent•

User configurable memory system having local and global memory blocks

[...]

Stephen M. Douglass¹, Prasad L. Sastry¹, Mehul R. Vashi¹, Robert Yin¹•Institutions (1)

Xilinx¹

27 Jul 2001

TL;DR: In this paper, the address depth of global block RAMs and the number of wait states can be selected by a user, and they can be set either prior to configuration of the FPGA or programmed using instructions of the processor core.

...read moreread less

Abstract: A data processing system having a user configurable memory controller, one or more local block RAMs, one or more global block RAMs and a processor core can be configured in a single field programmable gate array (FPGA). The address depth of the global block RAMs and the number of wait states can be selected by a user, and they can be set either prior to configuration of the FPGA or programmed using instructions of the processor core. The number of wait states of the local block RAM is also user selectable. An algorithm that can optimize the address depth and the number of wait states to achieve a performance level is also disclosed. The present invention can be applied to designs having separate instruction and data sides.

...read moreread less

Patent•

Multicore dsp device having shared program memory with conditional write protection

[...]

Dan K. Bui, Harland Glenn Hopkins, Yi Luo, Kevin A Mcgonagle, Tai H. Nguyen, Jay B. Reimer¹ - Show less +2 more•Institutions (1)

Texas Instruments¹

12 Nov 2001

TL;DR: In this article, a multi-core digital signal processor with a shared program memory (132), an emulation logic module (141), and multiple processor cores (11, 21) are disclosed.

...read moreread less

Abstract: A multi-core digital signal processor is disclosed having a shared program memory (132) with conditional write protection. In one embodiment, the digital signal processor includes a shared program memory (132), an emulation logic module (141), and multiple processor cores (11, 21) each coupled to the shared program memory (132) by corresponding instruction buses (P1, P2). The emulation logic module (141) preferably determines the operating modes of each of the processors, e.g., whether they are operating in a normal mode or an emulation mode. In the emulation mode, the emulation logic can alter the states of various processor hardware and the contents of various registers and memory. The instruction buses (P1, P2) each include a read/write signal that, while their corresponding processor cores (11, 21) are in normal mode, is maintained in a read state. On the other hand, when the processor cores (11, 21) are in the emulation mode, the processor cores (11, 21) are allowed to determine the state of the instruction bus read/write signals. Each instruction bus read/write signal is preferably generated by a logic gate that prevents the processor core (11, 21) from affecting the read/write signal value in normal mode, but allows the processor core to determine the read/write signal value in emulation mode. In this manner, the logic gate prevents write operations to the shared program memory (132) when the emulation logic (141) de-asserts a signal indicative of emulation mode, and allows write operations to the shared program memory (132) when the emulation logic (141) asserts the signal indicative of emulation mode. The logic gate is preferably included in a bus interface module (31) in each processor core (11, 21).

...read moreread less

Patent•

Performance monitor system and method suitable for use in an integrated circuit

[...]

David Ruimy Gonzales¹, Brian D. Branson², Jimmy Gumulja¹, William C. Moyer¹•Institutions (2)

Motorola¹, Freescale Semiconductor²

28 Mar 2001

TL;DR: In this article, a performance monitor system includes a core processor (115), a cache (123), and a first logic (127) coupled to the core processor associated device (123) and monitors the first signal (CACHE_PERF) in response to a second signal (WPT0,1).

...read moreread less

Abstract: A performance monitor system includes a core processor (115), a core processor associated device, such as a cache (123), and first logic, such as performance logic (127). The core processor (115) is operable to execute information. The core processor associated device provides a first signal (CACHE_PERF), which defines performance of the core processor associated device (123) during operation of the core processor (115). The first logic (127) is coupled to the core processor associated device (123) and monitors the first signal (CACHE_PERF) in response to a second signal (WPT0,1), which defines a match of user-settable attributes associated with the operation of the core processor (115).

...read moreread less

Patent•

Intercalling between native and non-native instruction sets

[...]

Edward Colles Nevill

25 Jun 2001

TL;DR: In this paper, a data processing system that supports execution of both native instructions using a processor core and non-native instructions that are interpreted using either a hardware translator or a software interpreter is presented.

...read moreread less

Abstract: A data processing system 118 is provided that supports execution of both native instructions using a processor core and non-native instructions that are interpreted using either a hardware translator 122 or a software interpreter. Separate explicit return to non-native instructions and return to native instructions are provided for terminating subroutines whereby intercalling between native and non-native code may be achieved with reduced processing overhead. Veneer non-native subroutines may be used between native code and non-native main subroutines. The veneer non-native subroutines may be dynamically created within the stack memory region of the native mode system.

...read moreread less

Proceedings Article•DOI•

Accelerating boolean satisfiability through application specific processing

[...]

Ying Zhao¹, Sharad Malik¹, Matthew W. Moskewicz², Conor F. Madigan³•Institutions (3)

Princeton University¹, University of California, Berkeley², Massachusetts Institute of Technology³

30 Sep 2001

TL;DR: An application specific multiprocessor system for SAT, utilizing the most recent results such as the development of highly efficient sequential SAT algorithms, the emergence of commercial configurable processor cores and the rapid progress in IC manufacturing techniques is presented.

...read moreread less

Abstract: This paper presents our work in developing an application specific multiprocessor system for SAT, utilizing the most recent results such as the development of highly efficient sequential SAT algorithms, the emergence of commercial configurable processor cores and the rapid progress in IC manufacturing techniques. Based on an analysis of the basic SAT search algorithm, we propose a new parallel SAT algorithm that utilizes fine grain parallelism. This is then used to design a multiprocessor architecture in which each processing node consists of a processor and a communication assist node that deals with message processing. Each processor is an application specific processor built from a commercial configurable processor core. All the system configurations are determined based on the characteristics of SAT algorithms, and are supported by simulation results. While this hardware accelerator system does not change the inherent intractability of the SAT problems, it achieves a 30-60x speedup over and above the fastest known SAT solver - Chaff. We believe that this system can be used to expand the practical applicability of SAT in all its application areas.

...read moreread less

Proceedings Article•DOI•

SSD: an affordable fault tolerant architecture for superscalar processors

[...]

Seongwoo Kim¹, Arun K. Somani¹•Institutions (1)

Iowa State University¹

17 Dec 2001

TL;DR: An integrity checking architecture for superscalar processors that can achieve fault tolerance capability of a duplex system at much less cost than the traditional duplication approach is proposed.

...read moreread less

Abstract: The paper proposes an integrity checking architecture for superscalar processors that can achieve fault tolerance capability of a duplex system at much less cost than the traditional duplication approach. The pipeline of the CPU core (P-pipeline) is combined in series with another pipeline (V-pipeline), which re-executes instructions processed in the P-pipeline. Operations in the two pipelines are compared and any mismatch triggers the recovery process. The V-pipeline design is based on replication of the P-pipeline, and minimized in size and functionality by taking advantage of control flow and data dependency resolved in the P-pipeline. Idle cycles propagated from the P-pipeline become extra time for the V-pipeline to keep up with program re-execution. For a large-scale superscalar processor, the proposed architecture can bring up to 61.4% reduction in die area and the average-execution time increase is 0.3%.

...read moreread less

Patent•

Low power reconfigurable systems and methods

[...]

Jr. Robert Warren Sherburne

21 Mar 2001

TL;DR: A low power reconfigurable processor core includes one or more processing units, each unit having a clock input that controls the performance of the unit; a high-density memory array core coupled to the processing units.

...read moreread less

Abstract: A low power reconfigurable processor core includes one or more processing units, each unit having a clock input that controls the performance of the unit; one or more clock controllers having clock outputs coupled to the clock inputs of the processing units, the controller operating varying the clock frequency of each processing unit to optimize speed and processing power for a task; and a high-density memory array core coupled to the processing units.

...read moreread less

Journal Article•DOI•

Dynamic code partitioning for clustered architectures

[...]

Ramon Canal¹, Joan-Manuel Parcerisa¹, Antonio González¹•Institutions (1)

Polytechnic University of Catalonia¹

01 Feb 2001-International Journal of Parallel Programming

TL;DR: A dynamic instruction steering logic for these clustered architectures that decides at decode time the cluster where each instruction is executed to achieve an average speed-up of 35% over a conventional 8-way issue (4 int + 4 fp) machine and that it outperforms other previous proposals, either static or dynamic.

...read moreread less

Abstract: Recent works^(1) show that delays introduced in the issue and bypass logic will become critical for wide issue superscalar processors. One of the proposed solutions is clustering the processor core. Clustered architectures benefit from a less complex partitioned processor core and thus, incur in less critical delays. In this paper, we propose a dynamic instruction steering logic for these clustered architectures that decides at decode time the cluster where each instruction is executed. The performance of clustered architectures depends on the inter-cluster communication overhead and the workload balance. We present a scheme that uses runtime information to optimize the trade-off between these figures. The evaluation shows that this scheme can achieve an average speed-up of 35% over a conventional 8-way issue (4eint+4efp) machine and that it outperforms other previous proposals, either static or dynamic.

...read moreread less

Proceedings Article•DOI•

Application-driven processor design exploration for power-performance trade-off analysis

[...]

Diana Marculescu¹, Anoop Iyer¹•Institutions (1)

Carnegie Mellon University¹

04 Nov 2001

TL;DR: The heart of the proposed design exploration framework is a two-level simulation engine that combines detailed simulation for critical portions of the code with fast profiling for the rest, which is completely general and applicable to any microarchitectural power/performance simulation engine.

...read moreread less

Abstract: This paper presents an efficient design exploration environment for high-end core processors. The heart of the proposed design exploration framework is a two-level simulation engine that combines detailed simulation for critical portions of the code with fast profiling for the rest. Our two-level simulation methodology relies on the inherent clustered structure of application programs and is completely general and applicable to any microarchitectural power/performance simulation engine. The proposed simulation methodology is 3-17/spl times/ faster, while being sufficiently accurate (within 5%) when compared to the fully detailed simulator The design exploration environment is able to vary different microarchitectural configurations and find the optimal one as far as energy/spl times/delay product is concerned in a matter of minutes. The parameters that are found to affect drastically the core processor power/performance metrics are issue width, instruction window size, and pipeline depth, along with correlated clock frequency. For very high-end configurations for which balanced pipelining, may not be possible, opportunities for running faster stages at lower voltage exist. In such cases, by using up to 3 voltage levels, the energy/spl times/delay product is reduced by 23-30% when compared to the single voltage implementation.

...read moreread less

Patent•

Method of de-allocating multiple processor cores for an L2 correctable error

[...]

Sheldon Ray Bailey¹, Michael Alan Kobler¹, Lim Michael Youhour¹, Stuart Allen Werbner¹•Institutions (1)

IBM¹

01 Mar 2001

TL;DR: In this article, a method of deallocating multiple processor cores sharing a failing bank of memory is proposed, which allows new multiple-processor integrated circuits with on-chip shared memory to be de-allocated using existing technology designed for use with singleprocessor integrated circuit technology.

...read moreread less

Abstract: A method of de-allocating multiple processor cores sharing a failing bank of memory is disclosed. The method allows new multiple-processor integrated circuits with on-chip shared memory to be de-allocated using existing technology designed for use with single-processor integrated circuit technology.

...read moreread less

Journal Article•DOI•

mAgic-FPU and MADE: A customizable VLIW core and the modular VLIW processor architecture description environment

[...]

Pierluigi Paolucci¹, Philippe Kajfasz, Philippe Bonnot, Bernard Candaele, Daniel Maufroid, Elena Pastorelli, Andrea Ricciardi, Yves Fusella, Eugenio Guarino - Show less +5 more•Institutions (1)

Sapienza University of Rome¹

01 Sep 2001

TL;DR: The mAgic-FPU core architecture satisfies the requisite of portability among silicon foundries and fit the requirements of ‘Smart Antenna for Adaptive Beam-Forming processing’ and ‘Physical Sound Synthesis’.

...read moreread less

Abstract: mAgic-FPU is the architecture of a family of VLIW cores for configurable system level integration of floating and fixed point computing power. mAgic customization permits the designer to tune basic parameters, such as the computing power/memory access ratio of the core processor, the number of available arithmetic operation per cycle, the register file size and number of port, as well as of the number of arithmetic operators. The reconfiguration (e.g., of register file size and number of port, as well as of the number of arithmetic operators) is supported by the software environment MADE (Modular VLIW processor Architecture and Assembler Description Environment). MADE reads an architecture description file and produces a customized assembler-scheduler for the target VLIW architecture, configuring a general purpose VLIW optimizer-scheduler engine. The mAgic-FPU core architecture satisfies the requisite of portability among silicon foundries. The first members of the mAgic FPU core family architecture fit the requirements of ‘Smart Antenna for Adaptive Beam-Forming processing’ and ‘Physical Sound Synthesis’. The first 1 GigaFlops mAgic core will run at 100 MHz within an area of 40 mm 2 in 0.25 μ m ATMEL CMOS technology in first half 2002.

...read moreread less

Patent•

System for executing virtual machine instructions

[...]

Otto Steinbusch, Menno M. Lindwer

22 Aug 2001

TL;DR: In this article, the virtual machine interpreter identifies an initial virtual machine instruction from a body of virtual machine instructions, where the body is expected to be executed repeatedly and writes native instructions for the body into the memory from said memory location.

...read moreread less

Abstract: A data processing system has a processor core, memory and a virtual machine interpreter. The virtual machine interpreter receives virtual machine instructions selected dependent on program flow during execution of a virtual machine program. The virtual machine interpreter generates native machine instructions that implement the virtual machine instructions for execution by the processor core. The virtual machine interpreter identifies an initial virtual machine instruction from a body of virtual machine instructions, where the body is expected to be executed repeatedly. The virtual machine interpreter records a correspondence between the initial virtual machine instructionin the body and a memory location in the memory and writes native instructions for the body into the memory from said memory location. The processor core executes the native instructions for the body and repeats execution of the native instructions for the body by executing the written native machine instructions for the body from memory starting from said memory location.

...read moreread less

Patent•

Methods and apparatuses for signal processing

[...]

Kumar Ganapathy, Ruban Kanapathipillai

23 Feb 2001

TL;DR: In this article, an application specific signal processor (ASSP) performs vectorized and nonvectorized operations using a saturated multiplication and accumulation operation, which is used in telecommunication interface devices such as a gateway.

...read moreread less

Abstract: An application specific signal processor (ASSP) performs vectorized and nonvectorized operations. Nonvectorized operations may be performed using a saturated multiplication and accumulation operation. The ASSP includes a serial interface, a buffer memory, a core processor for performing digital signal processing which includes a reduced instruction set computer (RISC) processor and four signal processing units. The four signal processing units execute the digital signal processing algorithms in parallel including the execution of the saturated multiplication and accumulation operation. The ASSP is utilized in telecommunication interface devices such as a gateway. The ASSP is well suited to handling voice and data compression/decompression in telecommunication systems where a packetized network is used to transceive packetized data and voice.

...read moreread less

Proceedings Article•DOI•

Embedded-software-based approach to testing crosstalk-induced faults at on-chip buses

[...]

Wei-Cheng Lai¹, Jing-Reng Huang¹, Kwang-Ting Cheng¹•Institutions (1)

University of California, Santa Barbara¹

29 Mar 2001

TL;DR: Experimental results show that, for testing interconnects between a processor core and any other on-chip core, a 3 K-byte program is sufficient to achieve the complete coverage for crosstalk-induced logical and delay faults.

...read moreread less

Abstract: Crosstalk effects on long interconnects are becoming significant for high-speed circuits. This paper addresses the problem of testing crosstalk-induced faults at on-chip buses in system-on-a-chip (SOC) designs. We propose a method to self-test on-chip buses at-speed, by executing an automatically synthesized program using on-chip processor cores. The test program, executed at system operational speed, can activate and capture the worst-case crosstalk effects on buses and achieve a complete coverage of crosstalk-induced logical and delay faults. This paper discusses the method and the framework for synthesizing such a test program. Based on the bus protocol, the instruction set architecture of an on-chip processor core, and the system specification, the method generates deterministic tests in the form of instruction sequences. The synthesized test program is highly modularized and compact. The experimental results show that, for testing interconnects between a processor core and any other on-chip core, a 3 K-byte program is sufficient to achieve the complete coverage for crosstalk-induced logical and delay faults.

...read moreread less