scispace - formally typeset
Search or ask a question

Showing papers on "Multi-core processor published in 1993"


Journal ArticleDOI
TL;DR: The processor reconfiguration through instruction-set metamorphosis (PRISM) general-purpose architecture, which speeds up computationally intensive tasks by augmenting the core processor's functionality with new operations, is described.
Abstract: The processor reconfiguration through instruction-set metamorphosis (PRISM) general-purpose architecture, which speeds up computationally intensive tasks by augmenting the core processor's functionality with new operations, is described. The PRISM approach adapts the configuration and fundamental operations of a core processing system to the computationally intensive portions of a targeted application. PRISM-1, an initial prototype system, is described, and experimental results that demonstrate the benefits of the PRISM concept are presented. >

415 citations


Proceedings ArticleDOI
05 Apr 1993
TL;DR: The architecture and compiler for a general-purpose metamorphic computing platform called PRISM-II, which improves the performance of many computationally-intensive tasks by augmenting the functionality of the core processor with new instructions that match the characteristics of targeted applications.
Abstract: This paper discusses the architecture and compiler for a general-purpose metamorphic computing platform called PRISM-II. PRISM-II improves the performance of many computationally-intensive tasks by augmenting the functionality of the core processor with new instructions that match the characteristics of targeted applications. In essence, PRISM (processor reconfiguration through instruction set metamorphosis) is a general purpose hardware platform that behaves like an application-specific platform. Two methods for hardware synthesis, one using VHDL Designer and the other using X-BLOX, are presented and synthesis results are compared. >

164 citations


Proceedings ArticleDOI
01 Dec 1993
TL;DR: The authors present two hardware components for high performance parallel computing: a superscalar RISC microprocessor with an integrated 400 Mb/s user-level network interface (the 88110MP), and a companion 8 /spl times/ 8 low-latency packet router chip (ARTIC).
Abstract: The authors present two hardware components for high performance parallel computing: a superscalar RISC microprocessor with an integrated 400 Mb/s user-level network interface (the 88110MP), and a companion 8 /spl times/ 8 low-latency packet router chip (ARTIC). The design point combines very low message overhead and high delivered communications bandwidth with a commercially competitive sequential processor core. The network interface is directly programmed in user mode as an instruction set extension to the Motorola 88110. Importantly, naming and protection mechanisms are provided to support robust multi-user space and time sharing. Thus, fine-grain messaging and synchronization can be supported efficiently, without compromising pre-processor performance or system integrity. Preliminary performance modeling results are presented.

63 citations


Patent
29 Jul 1993
TL;DR: In this article, a digital computer system capable of processing two or more computer instructions in parallel and having a main memory unit for storing information blocks including the computer instructions includes an instruction compounding unit for analyzing the instructions and adding to each instruction a tag field which indicates whether or not that instruction may be processed in parallel with another neighboring instruction.
Abstract: A digital computer system capable of processing two or more computer instructions in parallel and having a main memory unit for storing information blocks including the computer instructions includes an instruction compounding unit for analyzing the instructions and adding to each instruction a tag field which indicates whether or not that instruction may be processed in parallel with another neighboring instruction. Tagged instructions are stored in the main memory. The computer system further includes a plurality of functional instruction processing units which operate in parallel with one another. The instructions supplied to the functional units are obtained from the memory by way of a cache storage unit. At instruction issue time, the tag fields of the instructions are examined and those tagged for parallel processing are sent to different ones of the functional units in accordance with the codings of their operation code fields.

62 citations


Proceedings ArticleDOI
20 Sep 1993
TL;DR: A novel method that formulates the design of an optimal instruction set using an integer programming approach is described and a tool that enables the designer to predict the chip area and performance of the design before the detailed design is completed is discussed.
Abstract: The current implementation and experimental results of the PEAS-1 (practical environment for application specific integrated processor (ASIP) development - Version I) system are described. The PEAS-I system is a hardware/software co-design system for ASIP development. The input to the system is a set of application programs written in C language, an associated data set, and design constraints such as chip area and power consumption. The system generates an optimized CPU core design in the form of an HDL, as well as a set of application program development tools, such as a C compiler, assembler, and simulator. A novel method that formulates the design of an optimal instruction set using an integer programming approach is described. A tool that enables the designer to predict the chip area and performance of the design before the detailed design is completed is discussed. Application program development tools are generated in addition to the ASIP hardware design. >

46 citations


Proceedings ArticleDOI
23 Mar 1993
TL;DR: A method is presented which improves the performance of many computationally intensive tasks by utilizing information extracted at compile-time to synthesize new operations which augment the functionality of a core processor.
Abstract: Many computationally intensive tasks spend nearly all of their execution time within a small fraction of the executable code. Substantial gains can be achieved by allowing the configuration and fundamental operations of a processor to adapt to these frequently accessed portions. A method is presented which improves the performance of many computationally intensive tasks by utilizing information extracted at compile-time to synthesize new operations which augment the functionality of a core processor. The newly synthesized operations are targeted for RAM-based field-programmable gate array (FPGA) devices which provide a mechanism for fast processor reconfiguration. A proof-of-concept system called PRISM, consisting of a specialized C configuration compiler and a reconfigurable hardware platform, is presented. Compilation and performance results are provided which confirm the concept viability, and demonstrate significant speed-up over conventional general-purpose architectures. >

37 citations


Patent
19 May 1993
TL;DR: A data processing system capable of returning correctly from an exceptional processing by the same processing as that in the case of executing instructions one by one without particular control, even if an exception occurs in the midway of the instruction processing, is defined in this paper.
Abstract: A a data processing system capable of returning correctly from an exceptional processing by the same processing as that in the case of executing instructions one by one without particular control even if an exception occurs in the midway of the instruction processing, and capable of selecting a mode for executing instructions one by one in debugging or a test so that a plurality of instructions are executed in parallel with simple control.

15 citations


Proceedings ArticleDOI
Joseph Dao1, Nobu Matsumoto, Tsuneo Hamai, Chusei Ogawa, S. Mori 
01 Jul 1993
TL;DR: In this article, an algorithm independent layout compaction method for full chip layouts is proposed, which cuts up a large layout, compacts each block independently and then merges them to give the final compacted layout.
Abstract: An algorithm independent layout compaction method for full chip layouts is proposed. The partitioning compaction method cuts up a large layout, compacts each block independently and then merges them to give the final compacted layout. A 16-bit CPU core (28.8K transistors) layout was compacted on a standard workstation using this method. Both the computer memory usage and processing time were reduced. Parallel processing is possible to further speed up the computation.

15 citations


Proceedings ArticleDOI
13 Apr 1993
TL;DR: Of the many forms of parallel processor system, distributed heterogeneous parallel system is a possible attractive approach and despite the continuing research efforts in parallel processing, persistent difficulty and challenge still exists.
Abstract: In many defense related applications, very complex hardware and software systems exist. They are characterized by difficult computation and real time requirements. These systems typically have an assorted collection of heterogeneous analog or digital processors. The program that run the embedded system typically is in the order of hundreds of thousands of line of source code. The system is generally very complex, hard to design, and hard to maintain. Due to recent substantial investment and possible pay off in high performance computation by the government it is all natural to examine the possibility of implementing this kind of very complex system on high performance parallel processors. Of the many forms of parallel processor system, distributed heterogeneous parallel system is a possible attractive approach. Despite the continuing research efforts in parallel processing, persistent difficulty and challenge still exists. (1 ) Scalability problem: Measured efficiency of speed up (from benchmarks) experiments with large (=1000’s processing) MIMD parallel architectures are still in the single digit percent range. For vector processors type of supercomputer the performance is a little better in the tenth of percent range. (2) Software parallel processor is still a problem: Programming a paralIel processor system can be done in two different approaches. The first one is to take a regular sequential program and compile i t for a parallel processor system. This is geiierally referred to as the parallelizing compiler approach. The second approach is to recode the program in a parallel language such as LINDA, Fortran 90, or functional (applicative) language. The first approach does not require a large effort of program rewrite. The parallelizing compiler dealing with 1000’s of lines of code does not exist yet, and the available one for small programs still suffers performance problems. The second approach can achieve better performance. However, there is no automatic mapping technology for partitioning and scheduling. Good pedormance in programming parallel processors still relies on slow and tedious manual mapping.

3 citations


Proceedings ArticleDOI
F. Terayama1, J. Korematsu, F. Kitamura, J. Hinata, T. Enomoto1 
09 May 1993
TL;DR: The architecture and implementation of an application-specific processor, the G100FTS, designed for fault-tolerant systems are described, which drastically reduces the component count of a system and provides high system reliability.
Abstract: The architecture and implementation of an application-specific processor, the G100FTS, designed for fault-tolerant systems are described. The G100FTS integrates the core processor and the support module for fault-tolerant operation. The core processor is the 32-b microprocessor Gmicro/100 based on the TRON specifications. The support module provides a mechanism of error detection, a rollback operation that recovers a processor from a transient fault, diagnosis of faulty processors, and reconfiguration of a single processor system to make it operate continuously. The G100FTS drastically reduces the component count of a system and provides high system reliability. The G100FTS performs all the fault-tolerant operations in hardware and requires no dedicated programs for fault tolerance. The chip contains 805K transistors within a chip size of 13 mm /spl times/ 14 mm. A 1.0-/spl mu/m, double-metal CMOS technology was used.

2 citations


Patent
26 Nov 1993
TL;DR: In this article, the OR circuit detects an area similar to an address area which can be selected in an address decoder circuit 13, a switch circuit 14 selects a decoding signal decoded by the address decoders and outputs the decoding signal to an external bus B22.
Abstract: PROBLEM TO BE SOLVED: To provide a one chip computer which can do with a compact CPU core block having one version without waste in spite of the number of peripheral circuit blocks, and whose cost can be reduced. SOLUTION: When OR circuit 17 detects an area similar to an address area which can be selected in an address decoder circuit 13, a switch circuit 14 selects a decoding signal decoded by the address decoder circuit 13 and outputs the decoding signal to an external bus B22. When the OR circuit 17 detects an area different from the address area which can be selected in the address decoder circuit 13, the switch circuit 14 selects the address signal of an internal bus B12B and outputs the address signal which is not decoded to the external bus B22.

Patent
26 Nov 1993
TL;DR: In this paper, the authors propose a structure and method for generating forcing-in instructions to a CPU core by using a microprocessor provided with an on-chip cache memory, where the TLB unit provides mapping between a virtual address and a physical address.
Abstract: PURPOSE: To provide a structure and method for generating a forcing-in instruction to a CPU core by using a microprocessor provided with an on-chip cache memory. CONSTITUTION: A CPU core 103 has two co-processors controlled by a master pipeline control unit 103c, an integer CPU, and a system control co-processor 103b. The integer CPU executes an instruction set known as the instruction set architecture. The co-porcessor 103b has a translation look-aside buffer (TLB) 103b-3 and provides the mapping between a virtual address and a physical address. The cache system of a microprocessor has two cache memories, an instruction cache memory 102a, and a data cache memory 102b. The TLB unit 103b-3 receives a virtual address on a bus 109 and supplies a physical address which responds to either one of the cache memories 102a and 102b to a bus 107.

Patent
13 Jul 1993
TL;DR: In this article, the authors propose an emulator which does not require an expanded evaluation chip for each kind at the time of kind expansion of a microcomputer executed by changing a peripheral circuit as desired with one central processing unit as the center.
Abstract: PURPOSE:To provide the emulator which does not require an expanded evaluation chip for each kind at the time of kind expansion of a microcomputer executed by changing a peripheral circuit as desired with one central processing unit as the center. CONSTITUTION:An evaluation chip corresponding to a specific actual chip 22 subjected to kind expansion is substituted with an evaluation module 10 where an actual chip 22 and an evaluation chip 21 including the same CPU core 23 as a CPU core 24 of the chip 22 are mounted on a wiring board 20. The actual chip 22 and the evaluation chip 21 are connected on the wiring board, and the actual chip 22 has the architecture which disconnects the incorporated CPU core 24 from a peripheral module 25 on operation or physically at the time of emulation, and the peripheral module 25 is subjected to access control of the evaluation chip 21.

Patent
Ando Hideki1, Ikenaga Chikako1
01 Apr 1993
TL;DR: In this paper, a parallel processing system contains a decoder circuit for detecting instructions which can be performed simultaneously from simultaneously applied instructions, and each of a number of identical parallel functional units receives and performs an instruction from the decoder.
Abstract: The parallel processing system contains a decoder circuit for detecting instructions which can be performed simultaneously from simultaneously applied instructions. Each of a number of identical parallel functional units receives and performs an instruction from the decoder. Data are stored in a memory (6). A further functional unit, which does not receive instructions from the decoder, is only used to perform a control instruction when there is an access to the data memory. The functional units have an associated processing unit for performing arithmetic and logical operations on the received data. USE/ADVANTAGE - Faster processing speed.