scispace - formally typeset
Search or ask a question

Showing papers on "Multi-core processor published in 1996"


Proceedings ArticleDOI
Chow1
17 Apr 1996
TL;DR: A processor architecture called OneChip, which combines a fixed-logic processor core with reconfigurable logic resources into a MIPS-like processor, which eliminates the shortcomings of other custom compute machines.
Abstract: This paper describes a processor architecture called OneChip, which combines a fixed-logic processor core with reconfigurable logic resources. Using the programmable components of this the performance of speed-critical can be improved by customizing OneChip's execution units, or flexibility can be added to the glue logic interfaces of embedded controller applications. OneChip eliminates the shortcomings of other custom compute machines by tightly integrating its reconfigurable resources into a MIPS-like processor. Speedups of close to 50 over strict software implementations on a MIPS R4400 are achievable for computing the DCT.

306 citations


Proceedings Article
01 Jan 1996
TL;DR: OneChip as discussed by the authors is a processor architecture that combines a fixed logic processor core and reconfigurable logic resources, which can be used to improve the performance of speed-critical applications by customizing OneChip's execution units or flexibility can be added to the glue logic interfaces of embedded controller type applications.
Abstract: This thesis describes a processor architecture called OneChip, which combines a fixed logic processor core and reconfigurable logic resources. Using the variable components of this architecture, the performance of speed-critical applications can be improved by customizing OneChip’s execution units or flexibility can be added to the glue logic interfaces of embedded controller type applications. This work eliminates the shortcomings of other custom compute machines by tightly integrating the reconfigurable resources into a MIPS-like processor. The details of the core processor, the fixed to reconfigurable logic interface and the actual reconfigurable structures are described. To study OneChip’s feasibility, a 32-bit processor as well as several performance enhancement and embedded controller type applications are implemented on the Transmogrifier-1 field programmable system. It is shown that application speedups of over 40 are achievable. However, the design flexibility introduced with the use of less dense, reconfigurable structures carries an area penalty of no less than 3.5 times the size of the custom silicon design implementation.

298 citations


Patent
25 Oct 1996
TL;DR: In this article, a debug buffer is used as a video FIFO for buffering pixels for display on a monitor and a dedicated bus is connected to an external DAC rather than to the external ICE when debugging is not being performed.
Abstract: A microprocessor die contains several processor cores and a shared cache. Trigger conditions for one or more of the processor cores are programmed into debug registers. When a trigger is detected, a trace record is generated and loaded into a debug queue on the microprocessor die. Several trace records from different processor cores can be rapidly generated and loaded into the debug queue. The external interface cannot transfer these trace records to an external in-circuit emulator (ICE) at the rate generated. The debug queue transfers trace records to the external ICE using a dedicated bus to the ICE so that bandwidth is not taken from the memory bus. The memory bus is not slowed for debugging, providing a more realistic debugging session. The debug buffer is also used as a video FIFO for buffering pixels for display on a monitor. The dedicated bus is connected to an external DAC rather than to the external ICE when debugging is not being performed.

143 citations


Dissertation
01 Jan 1996
TL;DR: This thesis presents techniques for code generation and optimization that target embedded digital signal processors that have proven to be effective in improving the performance and reducing the size of compiled software.
Abstract: The advent of deep submicron processing technology has made it possible and desirable to integrate a processor core, a program ROM, and application-specific circuitry all on a single IC. As the complexity of embedded software grows, highlevel languages such as C and C++ are increasingly employed in writing embedded software. Consequently, high-level language compilers have become an essential tool in the development of embedded systems. Fixed-point digital signal processors are among the most commonly embedded cores, due to their favorable performance–cost characteristics. However, these architectures are usually designed and optimized for their application domain, and pose challenges for compiler technology. Traditional compiler optimizations, though necessary, are insufficient for generating efficient and compact code. Therefore, new optimizations are required to produce code of the highest quality in a reasonable amount of time. In this thesis the author presents techniques for code generation and optimization that target embedded digital signal processors. These techniques have proven to be effective in improving the performance and reducing the size of compiled software. This thesis emphasizes optimization techniques; only by gaining a deeper understanding of the problems involved can we then apply them to a wider class of architectures. Keywords—compiler optimizations, digital signal processors, embedded systems. Thesis Supervisor: Srinivas Devadas Title: Associate Professor of Electrical Engineering and Computer Science

93 citations


Patent
David R. Evoy1
27 Nov 1996
TL;DR: In this article, a translating circuit coupled to a processor and memory of a computer system translates platform-independent instructions such as Java bytecodes into corresponding native instructions for execution by the processor.
Abstract: A translating circuit coupled to a processor and memory of a computer system translates platform-independent instructions such as Java bytecodes into corresponding native instructions for execution by the processor. In one embodiment, the translating circuit is incorporated into the same integrated circuit device as the processor. In another embodiment, the translating circuit is provided within one or more external integrated circuit devices. One or more look-up tables map platform-independent instructions into one or more native instructions for the processor, thereby minimizing software-based interpretation of platform-independent program code. Moreover, platform-independent instructions are mapped to native instructions on-the-fly, or alternatively, in blocks prior to execution using a state machine.

84 citations


Proceedings ArticleDOI
17 Apr 1996
TL;DR: Analysis shows that integrated FPGA arrays are suitable as coprocessor platforms for realising algorithms that require only limited numbers of multiplication instructions that can be supported efficiently for these applications.
Abstract: The paper examines the viability of using integrated programmable logic as a coprocessor to support a host CPU core. This adaptive coprocessor is compared to a VLIW machine in term of both die area occupied and performance. The parametric bounds necessary to justify the adoption of an FPGA-based coprocessor are established. An abstract field programmable gate array model is used to investigate the area and delay characteristics of arithmetic circuits implemented on FPGA architectures to determine the potential speedup of FPGA-based coprocessors. Analysis shows that integrated FPGA arrays are suitable as coprocessor platforms for realising algorithms that require only limited numbers of multiplication instructions. Inherent FPGA characteristics limit the data-path widths that can be supported efficiently for these applications. An FPGA-based adaptive coprocessor requires a large minimum die area before any advantage over a VLIW machine of a comparable size can be realised.

72 citations


Book ChapterDOI
01 Jan 1996
TL;DR: The advent of 0.5μ processing that allows for the integration of 5 million transistors on a single integrated circuit has brought forth new challenges and opportunities in embedded-system design.
Abstract: The advent of 0.5μ processing that allows for the integration of 5 million transistors on a single integrated circuit has brought forth new challenges and opportunities in embedded-system design. This high level of integration makes it possible and desirable to integrate a processor core, a program ROM, and an ASIC together on a single IC. To justify the design costs of such an IC, these embedded-system designs must be sold in large volumes and, as a result, they are very cost-sensitive. The cost of an IC is most closely linked to its size, which is derived from the final circuit area. It is not unusual for the ROM that stores the program code to be the largest contributor to the area of such ICs. Thus the incremental value of using logic optimization to reduce the size of the ASIC is smaller because the ASIC takes up a relatively smaller percentage of the final circuit area. On the other hand, the potential for cost reduction through diminishing the size of the program ROM is great. There are also often strong real-time performance requirements on the final code; hence, there is a necessity for producing high-performance code as well.

58 citations


Patent
29 Jul 1996
TL;DR: In this article, a monolithic digital signal processor includes a core processor for performing digital signal computations, an I/O processor for controlling external access to and from the signal processor through an external port, first and second memory banks for storing instructions and data for the digital signal computation, and first and two buses interconnecting the core processor, the I /O processor and the memory banks.
Abstract: A monolithic digital signal processor includes a core processor for performing digital signal computations, an I/O processor for controlling external access to and from the digital signal processor through an external port, first and second memory banks for storing instructions and data for the digital signal computations, and first and second buses interconnecting the core processor, the I/O processor and the memory banks. The core processor and the I/O processor access the memory banks on the first bus without interference on different clock phases of a clock cycle. The internal memory and the I/O processor of the digital signal processor are assigned to a region of a global memory space, which facilitates multiprocessing configurations. In a multiprocessor system, each digital signal processor is assigned a processor ID. The digital signal processor includes a bus arbitration circuit for controlling access to an external bus through the external port. The digital signal processor may include one or more serial ports and one or more link ports for point-to-point communication with external devices. A DMA controller controls DMA transfers through the external port, the serial ports and the link ports.

51 citations


Patent
16 May 1996
TL;DR: In this paper, a distributed bus access and control arbitration is proposed for In-Circuit Emulation (ICE) in an integrated circuit (IC), which includes multiple circuits and functions which share multiple internal signal buses, three physical and five logical.
Abstract: An integrated circuit (IC) includes multiple circuits and functions which share multiple internal signal buses, three physical and five logical, according to distributed bus access and control arbitration. The multiple internal signal buses are shared among three tiers of internal circuit functions: a central processing unit and a DMA controller; a DRAM controller and a bus interface unit; and peripheral interface circuits, such as PCMCIA and display controllers. Two of the physical buses correspond to two of the logical buses and are used for communications within the IC. The third physical bus corresponds to three of the logical buses and is used for communications between the IC and circuits external to the IC. Arbitration for accessing and controlling the various signal buses is distributed both within and among the three tiers of internal circuit functions. Maximum performance is thereby achieved from the circuit functions accessed most frequently, while still achieving high performance from those circuit functions accessed less frequently. The IC may be provided with a processor core with features that support In-Circuit Emulation (ICE).

38 citations


Patent
03 May 1996
TL;DR: In this paper, the authors present a method and apparatus for designing re-useable interfacing logic hardware shells which provide interface functions between a hardware core and one or more busses.
Abstract: Disclosed is a method and apparatus for designing re-useable interfacing logic hardware shells which provide interface functions between a hardware core and one or more busses. An interface logic hardware shell provides previously characterized, tested and implemented interface logic designs for use in future applications with little or no redesign. The hardware circuitry (cells) of which such shells are comprised includes circuitry for bus interface units, memory interface units, buffers, and bus protocol logic. The cores for which the shells provide interface functions include CPU cores, memory cores, digital video decoding cores, digital audio decoding cores, ATM cores, Ethernet cores, JPEG cores and other data processing cores.

33 citations


Proceedings ArticleDOI
TL;DR: This paper outlines an FPGA-based reconfigurable processor architecture targeted to embedded DSP applications that consists of a high gate countFPGA multichip module supplemented with four dedicated floating point multipliers.
Abstract: Many DSP applications require dedicated hardware to achieve acceptable levels of performance. This is particularly true of real-time applications that have strict timeline requirements on processing throughput and latency. This paper outlines an FPGA-based reconfigurable processor architecture targeted to embedded DSP applications. The processor core consists of a high gate count FPGA multichip module (MCM) supplemented with four dedicated floating point multipliers. A dual port data memory provides a 480 Mbyte/sec channel to the processor and a 240 Mbyte/sec channel to the external interface. Coefficient memories are also included for static look-up table storage. A configuration bit stream loaded from non-volatile memory or an external source is used to program the FPGA.© (1996) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

Proceedings ArticleDOI
20 Sep 1996
TL;DR: A system to map hardware-software systems specified with statechart models on an ASIP architecture based on FPGAs that supports extended statecharts and assists designers during space/time tradeoff optimizations is described.
Abstract: In this paper, we describe a system to map hardware-software systems specified with statechart models on an ASIP architecture based on FPGAs. The architecture consists of a reusable CPU core with enhancements to execute the behavior of statecharts correctly. Our codesign system generates an application-specific hardware control block, an application-specific set of registers, and an instruction stream. The instruction stream consists of a static set of core instructions, and a set of custom instructions for performance enhancements. In contrast to previous approaches, the presented method supports extended statecharts. The system also assists designers during space/time tradeoff optimizations. The benefits of the approach are demonstrated with an industrial control application comparing two different timing schemes.

Patent
12 Dec 1996
TL;DR: In this paper, a dual processor computer system with error checking that stops immediately when a discrepancy is detected between the two processors is presented, where the I/O bus is independent of the processor type, clock rate and peripherals chosen in the construction of the computing system.
Abstract: A dual processor computer system with error checking that stops immediately when a discrepancy is detected between the two processors. The system includes a first processing system (20) for executing a series of instructions including input/output instructions. A second processor (30) executes the same instructions independently of and in synchronization with the first processing system. All significant processor address, data, and control signals are connected to all peripheral devices by a processor independent I/O bus (10). A comparison circuit (9) which detects discrepancies in the operation of the two lock step processing systems is connected between the processors and the processor independent I/O bus. The comparison circuit provides a signal that immediately stops operation of the processors when an error is detected. The I/O bus is independent of the processor type, clock rate, and peripherals chosen in the construction of the computing system. This independence allows the computing system to be upgraded with faster processors and newer peripherals, without having to redesign the computing platform and the error checking circuit.

Patent
05 Aug 1996
TL;DR: In this paper, a data processing system for decoding instructions in parallel in a superscalar, complex instruction set computing (CISC) computer is presented, where instruction information is fetched and decrypted in decrypter 30.
Abstract: A data processing system for decoding instructions in parallel in a superscalar, complex instruction set computing (CISC) computer. In a training mode of operation, an encrypter 29 encrypts preprocessed instructions retrieved from an instruction cache 26. In a processing mode of operation, instruction information is fetched and decrypted in decrypter 30. A prefetcher 21 separates the fetched instruction according to the decrypted boundary information. An instruction length verifier 25 verifies that the instructions were separated correctly and controls decoders 22a-c according to the verification. If the verification is correct for a given set of instructions, the system processes the instructions in parallel through the decoders to a dispatch logic circuit 23 and then to functional units 24. If the verification is incorrect, those related instructions may be needed to decode serially.

Patent
27 Mar 1996
TL;DR: In this paper, a general purpose (GP) central processing unit (CPU) is connected to the shared internal bus for retrieving GP instructions and the GP CPU includes an execution unit for executing GP instructions.
Abstract: An integrated data processing system includes a shared internal bus for transferring both instructions and data. A shared bus interface unit is connected to the shared internal bus and connectable via a shared external bus to a shared external memory array such that instructions and data held in the shared external memory array are transferrable to the shared internal bus via the shared bus interface unit. A general purpose (GP) central processing unit (CPU) is connected to the shared internal bus for retrieving GP instructions. The GP CPU includes an execution unit for executing GP instructions to process data retrieved by the GP CPU from the shared internal bus. A digital signal processor (DSP) module connected to the shared internal bus, the DSP module includes a signal processor for processing an externally-provided digital signal received by the DSP module by executing DSP command-list instructions. Execution of DSP command-list code instructions by the DSP module is independent of and in parallel with execution of GP instructions by the GP CPU. A shared internal memory that holds command-list code instructions and is connected for access by the DSP module for retrieval of command-list code instructions for execution by the DSP module and for access by the GP CPU for storage and retrieval of instructions and data.

01 Apr 1996
TL;DR: The UltraSPARC-I processor as mentioned in this paper implements a set of new instructions that accelerate image and video processing -the visual instruction set, or VIS, which addresses a number of areas in which traditional instructions perform poorly for these highly parallel tasks.
Abstract: The UltraSPARC-I processor implements, in addition to the SPARC v9 instruction set, a set of new instructions that accelerate image and video processing - the visual instruction set, or VIS. These instructions address a number of areas in which traditional instructions perform poorly for these highly parallel tasks. Although these instructions support a wide variety of functions, they represent far less implementation effort than that needed to design dedicated imaging hardware because they leverage the design efforts of the CPU and memory system, and will continue to provide performance improvements as the processor speed is increased. Unlike traditional CPU features, the performance benefits of such instructions have not been quantified. We attempt to demonstrate the performance effects of the VIS instructions in the context of typical image processing loops. For the greatest benefit, these instructions must be used with an eye to maximizing various forms of parallelism, including superscalar instruction issue, loop vectorization, and pipelining in both hardware and software. Currently much of this work must be done by hand. We propose some ways to automate portions of this process and describe some of the existing tools.

01 Jan 1996
TL;DR: Using the programmable components of this architecture, the performance of speed-critical applications can be improved by customizing OneChip’s execution units, or flexibility can be added to the glue logic interfaces of embedded controller applications.
Abstract: This paper describes a processor architecture called OneChip, which combines a fixed-logic processor core with reconfigurable logic resources. Using the programmable components of this architecture, the performance of speed-critical applications can be improved by customizing OneChip’s execution units, or flexibility can be added to the glue logic interfaces of embedded controller applications. Oneclzip eliminates the shortcomings of other custom compute machines by tightly integrating its reconfigurable resources into a MIPS-like processor. Speedups of close to 50 over strict software implementations on a MIPS R4400 are achievable for computing the DCT.

Patent
05 Jan 1996
TL;DR: In this article, the authors present an integrated data processing system that includes a general purpose (GP) CPU core for processing data in accordance with a GP instruction set and a digital signal processor (DSP) module.
Abstract: The present invention is directed to various features of an integrated data processing system that includes a general purpose (GP) CPU core for processing data in accordance with a GP instruction set and a digital signal processor (DSP) module for processing data in accordance with command-list code. The DSP module is operable to execute the command-list code independent of and in parallel with execution of the GP instruction set by the CPU core. The system also includes test hook functions for facilitating production testing of the system.

Proceedings ArticleDOI
21 Oct 1996
TL;DR: A low-power processor architecture is described dedicatedly for embedded application programs by means of an object code compression approach, which unifies duplicated instructions existing in the embedded program and assigns a simple number to each distinct instruction.
Abstract: A low-power processor architecture is described dedicatedly for embedded application programs by means of an object code compression approach. This approach unifies duplicated instructions existing in the embedded program and assigns a simple number to each distinct instruction. An instruction decompressor is constructed in an embedded processor, which is to generate an object code from a compressed object code (pseudo code) input. A single-chip implementation of this decompressor together with a processor core can effectively reduce the bandwidth required for the I/O interface. Experiments are applied to an embedded processor ARMG10 to demonstrate the practicability of the proposed approach.

Proceedings ArticleDOI
18 Nov 1996
TL;DR: A systematic method is proposed which synthesizes the data path and control path of CPU Core from the instruction sequence compiled and translated from C language description of target algorithm in hardware-software codesign environment.
Abstract: We propose a systematic method which synthesizes the data path and control path of CPU Core from the instruction sequence compiled and translated from C language description of target algorithm in hardware-software codesign environment. We use a graphical representation method to describe instructions in register transfer level. To explore design space more broadly, we apply synthesis parameters selectively, which change the architecture of data path. The number of data transfer paths is reduced by replacing the rarely used path with its bypass route. To select the best among the candidate CPU cores, the data path cost and control path cost are synthesized together.

Patent
04 Jun 1996
TL;DR: In this paper, the cache controller stores an indication of whether a line contains non-sequential instructions with the line of instructions in the cache memory when one of the instructions in a given line is requested by the processor core.
Abstract: An apparatus and method for reducing the time required to supply a processor core with instructions uses a cache memory, a cache controller, and an instruction predecoding unit. When a line of instructions is retrieved into the cache memory, the instruction predecoding unit inspects the instructions in the line to determine if the line contains any non-sequential instructions. The cache controller stores an indication of whether the line contains non-sequential instructions with the line of instructions in the cache memory. If a given line of instructions does not contain any non-sequential instructions, the line of instructions following the given line is retrieved into the cache memory when one of the instructions in the given line is requested by the processor core.

Patent
11 Oct 1996
TL;DR: In this paper, the authors present an integrated processor using a circuit for coping with the data concentration, vision concentration and voice concentration requests of a personal information device on a single monolithic circuit.
Abstract: PROBLEM TO BE SOLVED: To manufacture an integrated processor using a circuit for coping with the data concentration, vision concentration and voice concentration requests of a personal information device on a single monolithic circuit. SOLUTION: This integrated processor 10 is provided with a CPU core 14, a memory controller 16 and various peripheral devices and becomes almighty and high performance. Since a clock controller 26 provided with plural phase locked loops for generating the clock signals of different frequencies is provided and various sub systems are appropriately clocked, the power consumption of the processor 10 is small. Clock signals supplied to the various sub systems by the clock controller 26 are drawn from one crystal oscillator input signal. A power management device 24 is incorporated inside the processor, controls the frequency and/or application of the clock signals to the various sub systems and controls the other power management functions. Since certain external pins are selectively multiplexed corresponding to the desired functionality of the processor 10, the pin number of the processor 10 is minimized.

Patent
08 Jul 1996
TL;DR: In this paper, a cache register file, indexed via the offset field of the load instruction, is used for retaining cache lines from previously executed load instructions, which is then used by subsequent instructions (e.g. load instructions) requiring the data previously loaded therein.
Abstract: A method and apparatus for reducing the number of cycles required to implement load instructions in a data processing system having a Central Processing Unit (CPU). The CPU includes a cache register file, indexed via the offset field of the load instruction, for retaining cache lines from previously executed load instructions. The cache register file is then used by subsequent instructions (e.g. load instructions) requiring the data previously loaded therein. Thus, reducing the cycles normally associated with retrieving the data from the cache for the subsequent instructions.

Book ChapterDOI
23 Sep 1996
TL;DR: Electronic Industry is moving toward design and realization of “systems-on-silicon”, where Logic functions are added to this core in order to realize an Application Specific Integrated Processor (ASIP).
Abstract: Electronic Industry is moving toward design and realization of “systems-on-silicon”. This kind of integrated circuit is generally built around a processor core. Logic functions are added to this core in order to realize an Application Specific Integrated Processor (ASIP).

Proceedings ArticleDOI
13 Oct 1996
TL;DR: This paper presents an object-oriented machine, currently under development, which incorporates (at the machine-code level) some mechanisms needed for manipulating objects and methods.
Abstract: Microprocessor design and manufacturing have experienced great improvements in the last years. However object-oriented concepts, in spite of their widespread diffusion as a programming principle, have not been given great attention in hardware design. This paper presents an object-oriented machine, currently under development, which incorporates (at the machine-code level) some mechanisms needed for manipulating objects and methods. The processor, oriented to control applications, is composed of a commercial, full-32-bit RISC processor acting as the computing core, and additional circuitry. The additional elements constitute a shell, providing dedicated registers and functions for dealing with class instances and related methods. A mechanism for tracking called methods, by hardware support of the Virtual Method Table, is provided in parallel to the normal calling operation of the processor. The overhead associated with this mechanism, normally taken in charge by the core processor, is therefore left to the additional circuitry.

Patent
05 Nov 1996
TL;DR: In this article, an integrated processor using a circuit which facilitates a countermeasure to the data concentrating, vision concentrating, and voice concentrating request of a personal information device on a single monolithic circuit.
Abstract: PROBLEM TO BE SOLVED: To manufacture an integrated processor using a circuit which facilitates a countermeasure to the data concentrating, vision concentrating, and voice concentrating request of a personal information device on a single monolithic circuit. SOLUTION: An integrated processor 10 includes a CPU core 14, memory controller 16, and various peripheral equipment, and it can be operated with almighty and high performance. This is provided with a clock controller 26 including plural phase lock droops for generating clock signals with different frequencies, and various sub-systems are appropriately clocked so that the power consumption of the processor can be reduced. A clock signal applied from the clock controller to the various sub-systems is extracted from one crystal oscillator input signal. A power managing device 24 is integrated in the processor, the frequency and/or application of the clock signal to the various sub-systems is controlled, and the other power managing functions are controlled. The number of the pins of the processor can be minimum by selectively multiplexing a certain outside pin according to desired functionality.


Proceedings ArticleDOI
02 Sep 1996
TL;DR: This work investigates the effect on performance caused by the way instructions are distributed among the functional units of superscalar processors, and shows that a performance gain of up to 38% can be obtained when the instructions are evenly distributed amongThe functional units.
Abstract: New techniques are increasing the degree of instruction-level parallelism exploited by processors. Recent superscalar implementations include multiple functional units, allowing the parallel execution of several instructions from the same application program. The trend towards an expansion of the number of hardware resources is likely to continue in future superscalar designs, and in order to maximize the processor throughput, the computational load must be balanced among these resources by the dynamic instruction-issuing algorithm. We investigate the effect on performance caused by the way instructions are distributed among the functional units of superscalar processors. Our results show that a performance gain of up to 38% can be obtained when the instructions are evenly distributed among the functional units.

01 Jan 1996
TL;DR: This work investigates the effect on perormunce caused by the way instructions are distributed among the functional units of superscalar processors, and shows that up to 38% of performance gains can be obtained when the instructions are evenly distributed amongThe functional units.
Abstract: New techniques are consistently increasing the degree of instruction-level parallelism exploited by processors. Recent superscalar implementations include multiple functional units, allowing the parallel execution of several instructions from the same application program. The trend toward the expansion in the number of hardware resources is likely to continue in future superscalar designs, and in order to maximize the processor throughput, the computational load must be balanced among these resources by the dynamic instruction-issuing algorithm. In this work we investigate the effect on per$ormunce caused by the way instructions are distributed among the functional units of superscalar processors. Our results show that up to 38% of performance gains can be obtained when the instructions are evenly distributed among the functional units.

Proceedings Article
01 Jan 1996
TL;DR: In this article, the authors examined the feasibility of using integrated programmable logic as a coprocessor to support a host CPU core and compared it to a VLIW machine in terms of both die area occupied and per,$ormance.
Abstract: This paper examines the viability of using integrated programmable logic as a coprocessor to support a host CPU core. This adaptive coprocessor is compared to a VLIW machine in term of both die area occupied and per,$ormance. The parametric bounds necessary to justifi the adoption of an FPGA-based coprocessor are established. An abstract Field Programmable Gate Array model is used to investigate the area and delay characteristics of arithmetic circuits implemented on FPGA architectures to determine the potential speedup of FPGA-based coprocessors. Our analysis shows that integrated FPGA arrays are suitable as coprocessor platjorms for realising algorithms that require only limited numbers of multiplication instructions. Inherent FPGA characteristics limit the data-path widths that can be supported eflciently for these applications. An FPGA-based adaptive coprocessor requires a large minimum die area before any advantage over a VLIW machine of a comparable size can be realised.