scispace - formally typeset
Search or ask a question

Showing papers on "Multi-core processor published in 1999"


Proceedings ArticleDOI
16 Nov 1999
TL;DR: It is argued that the DIVA checker should lend itself to functional and electrical verification better than a complex core processor, and overall design cost can be dramatically reduced because designers need only verify the correctness of the checker unit.
Abstract: Building a high-performance microprocessor presents many reliability challenges. Designers must verify the correctness of large complex systems and construct implementations that work reliably in varied (and occasionally adverse) operating conditions. To further complicate this task, deep submicron fabrication technologies present new reliability challenges in the form of degraded signal quality and logic failures caused by natural radiation interference. In this paper, we introduce dynamic verification, a novel microarchitectural technique that can significantly reduce the burden of correctness in microprocessor designs. The approach works by augmenting the commit phase of the processor pipeline with a functional checker unit. The functional checker verifies the correctness of the core processor's computation, only permitting correct results to commit. Overall design cost can be dramatically reduced because designers need only verify the correctness of the checker unit. We detail the DIVA checker architecture, a design optimized for simplicity and low cost. Using detailed timing simulation, we show that even resource-frugal DIVA checkers have little impact on core processor performance. To make the case for reduced verification costs, we argue that the DIVA checker should lend itself to functional and electrical verification better than a complex core processor. Finally, future applications that leverage dynamic verification to increase processor performance and availability are suggested.

680 citations


Journal ArticleDOI
TL;DR: This work developed the design methodology for the low-power core-based real-time SOC based on dynamically variable voltage hardware and proposes a nonpreemptive scheduling heuristic, which results in solutions very close to optimal ones for many test cases.
Abstract: The growing class of portable systems, such as personal computing and communication devices, has resulted in a new set of system design requirements, mainly characterized by dominant importance of power minimization and design reuse. The energy efficiency of systems-on-a-chip (SOC) could be much improved if one were to vary the supply voltage dynamically at run time. We developed the design methodology for the low-power core-based real-time SOC based on dynamically variable voltage hardware. The key challenge is to develop effective scheduling techniques that treat voltage as a variable to be determined, in addition to the conventional task scheduling and allocation. Our synthesis technique also addresses the selection of the processor core and the determination of the instruction and data cache size and configuration so as to fully exploit dynamically variable voltage hardware, which results in significantly lower power consumption for a set of target applications than existing techniques. The highlight of the proposed approach is the nonpreemptive scheduling heuristic, which results in solutions very close to optimal ones for many test cases. The effectiveness of the approach is demonstrated on a variety of modern industrial strength multimedia and communication applications.

270 citations


Patent
07 May 1999
TL;DR: In this paper, an integrated circuit for processing streams of data generally and streams of packets in particular is described. The integrated circuit includes a number of packet processors ( 307, 313, 303 ), a table look up engine ( 301 ), a queue management engine ( 305 ) and a buffer management engine( 315 ).
Abstract: An integrated circuit ( 203 ) for use in processing streams of data generally and streams of packets in particular. The integrated circuit ( 203 ) includes a number of packet processors ( 307, 313, 303 ), a table look up engine ( 301 ), a queue management engine ( 305 ) and a buffer management engine ( 315 ). The packet processors ( 307, 313, 303 ) include a receive processor ( 421 ), a transmit processor ( 427 ) and a risc core processor ( 401 ), all of which are programmable. The receive processor ( 421 ) and the core processor ( 401 ) cooperate to receive and route packets being received and the core processor ( 401 ) and the transmit processor ( 427 ) cooperate to transmit packets. Routing is done by using information from the table look up engine ( 301 ) to determine a queue ( 215 ) in the queue management engine ( 305 ) which is to receive a descriptor ( 217 ) describing the received packet's payload.

167 citations


Proceedings ArticleDOI
26 Apr 1999
TL;DR: Instruction Randomization Self Test (IRST) achieves stuck-at-fault coverage for an embedded processor core without the need for scan insertion or mux isolation for application of test patterns.
Abstract: Access to embedded processor cores for application of test has greatly complicated the testability of large systems on silicon. Scan based testing methods cannot be applied to processor cores which cannot be modified to meet the design requirements for scan insertion. Instruction Randomization Self Test (IRST) achieves stuck-at-fault coverage for an embedded processor core without the need for scan insertion or mux isolation for application of test patterns. This is a new built-in self test method which combines the execution of microprocessor instructions with a small amount of on-chip test hardware which is used to randomize those instructions. IRST is well suited for meeting the challenges of testing ASIC systems which contain embedded processor cores.

121 citations


Proceedings ArticleDOI
21 Apr 1999
TL;DR: A smart compilation chain in which the compiler is no longer limited by a pre-defined instruction set, but can generate application-specific custom instructions and synthesise them in Field-Programmable Logic to reduce the reconfiguration overhead and optimise the utilisation of resources is proposed.
Abstract: We propose a smart compilation chain in which the compiler is no longer limited by a pre-defined instruction set, but can generate application-specific custom instructions and synthesise them in Field-Programmable Logic. We also present a RISC micro-architecture enhanced by a CPLD-based Reconfigurable Functional Unit (RFU) which supports our compiler approach. The main difference between our smart compiler and similar methods is the ability to encode multiple custom instructions in a single RFU configuration, cross-minimising the logic among them. The objective is to reduce (or eliminate) the reconfiguration overhead and optimise the utilisation of resources. The CPLD core that implements the RFU is based on the Philips XPLA2 architecture. We discuss the advantages of using the XPLA2 instead of conventional FPGAs. Application examples are also presented, which show that our RFU-extended CPU can achieve speed-ups of more than 40% for encryption algorithms, when compared to the standard CPU core alone.

111 citations


Proceedings ArticleDOI
21 Apr 1999
TL;DR: A protection architecture is proposed for the Morph/AMRM reconfigurable processor which enable nearly the full range of power of reconfigurability in the processor core while requiring only a small number of fixed logic features which to ensure safe, protected multiprocess execution.
Abstract: Technology scaling of CMOS processes brings relatively faster transistors (gates) and slower interconnects (wires), making viable the addition of reconfigurability to increase performance. In the Morph/AMRM system we are exploring the addition of reconfigurable logic, deeply integrated with the processor core, employing the reconfigurability to manage the cache, datapath, and pipeline resources more effectively. However, integration of reconfigurable logic introduces significant protection and safety challenges for microprocess execution. We analyze the protection structures in a state of the art microprocessor core (R10000), identifying the few critical logic blocks and demonstrating that the majority of the logic in the processor core can be safely reconfigured. Subsequently, we propose a protection architecture for the Morph/AMRM reconfigurable processor which enable nearly the full range of power of reconfigurability in the processor core while requiring only a small number of fixed logic features which to ensure safe, protected multiprocess execution.

89 citations


Journal ArticleDOI
01 Feb 1999
TL;DR: Embedded system chip AMULET2e silicon demonstrates competitive performance and power efficiency, ease of system design, and it includes innovative features that exploit its asynchronous operation to advantage in applications that require low standby power and/or freedom from the electromagnetic interference generated by system clocks.
Abstract: AMULET2e is an embedded system chip incorporating a 32-bit ARM-compatible asynchronous processor core, a 4-Kb pipelined cache, a flexible memory interface with dynamic bus sizing, and assorted programmable control functions. Many on-chip performance-enhancing and power-saving features are switchable, enabling detailed experimental analysis of their effectiveness. AMULET2e silicon demonstrates competitive performance and power efficiency, ease of system design, and it includes innovative features that exploit its asynchronous operation to advantage in applications that require low standby power and/or freedom from the electromagnetic interference generated by system clocks.

80 citations


Proceedings ArticleDOI
Michael K. Gschwind1
01 Mar 1999
TL;DR: The presented approach uses the processor core to allow early evaluation of ASIP design options using rapid prototyping techniques, and describes a hardware/software co-design methodology which can be used with this design approach.
Abstract: We describe an approach for application-specific processor design based on an extendible microprocessor core. Core-based design allows to derive application-specific instruction processors from a common base architecture with low non-recurring engineering cost. The results of this application-specific customization of a common base architecture are families of related and largely compatible processor families. These families can share support tools and even binary compatible code which has been written for the common base architecture. Critical code portions are customized using the application-specific instruction set extensions. We describe a hardware/software co-design methodology which can be used with this design approach. The presented approach uses the processor core to allow early evaluation of ASIP design options using rapid prototyping techniques. We demonstrate this approach with two case studies, based on the implementation and evaluation of application-specific processor extensions for Prolog program execution, and memory prefetching for vector and matrix operations.

67 citations


Journal ArticleDOI
01 Nov 1999
TL;DR: A new emulated digital CNN Universal Machine chip architecture is introduced and the main steps of the design process are shown and its variable precision capability allows the user to trade off precision for speed.
Abstract: A new emulated digital CNN Universal Machine chip architecture is introduced and the main steps of the design process are shown in this paper. One core processor can be implemented on 2 × 2 mm^2 silicon area with a 0.35 μm CMOS technology. Assuming an array of 24 processors on a chip, its speed is 1ns/virtual cell/CNN iteration with 12 bit precision. This enables the execution of over five hundred 3 × 3 convolution operations on each frame of a 240 × 320-pixel 25 fps digital image flow. Another new feature of the design is its variable precision capability. This allows the user to trade off precision for speed. The architecture supports some non-linear filter implementation as well.

50 citations


Patent
Douglas Garde1
08 Jan 1999
TL;DR: In this paper, a high performance digital signal processor includes a memory for storing instructions and operands for digital signal computations and a core processor connected to the memory, where a data alignment buffer is provided between the memory banks and the computation blocks, allowing unaligned accesses to specified operands that are stored in different memory rows.
Abstract: A high performance digital signal processor includes a memory for storing instructions and operands for digital signal computations and a core processor connected to the memory. The memory may include first, second and third memory banks connected to the core processor by first, second and third data and address buses, respectively. The core processor includes a program sequencer and may include first and second computation blocks for performing first and second subsets, respectively, of the digital signal computations. A data alignment buffer is provided between the memory banks and the computation blocks. The data alignment buffer permits unaligned accesses to specified operands that are stored in different memory rows. The specified operands are supplied to one or both of the computation blocks in the same processor cycle.

50 citations


Patent
01 Oct 1999
TL;DR: In this paper, a method for effectuating multiplication in a processor core is presented, which supports multiplication instructions for two formats of data: integer-formatted data and fixed point data, exclusive of a floating point unit.
Abstract: A method is disclosed for effectuating multiplication in a processor core. The method supports multiplication instructions for two formats of data: integer-formatted data and fixed-point data, exclusive of a floating point unit. The data can be packed data, including 16-bit packed data and 32-bit packed data.

Patent
10 Feb 1999
TL;DR: In this paper, a simple syntax for defining instructions, similar to that of the C programming language, is presented, and a method for executing instructions in a data processor and improvements to data processor design.
Abstract: A method for executing instructions in a data processor and improvements to data processor design, which combine the advantages of regular processor architecture and Very Long Instruction Word architecture to increase execution speed and ease of programming, while reducing power consumption. Instructions each consisting of a number of operations to be performed in parallel are defined by the programmer, and their corresponding execution unit controls are generated at compile time and loaded prior to program execution into a dedicated array in processor memory. Subsequently, the programmer invokes reference instructions to call these defined instructions, and passes parameters from regular instructions in program memory. As the regular instructions propogate down the processor's pipeline, they are replaced by the appropriate controls fetched from the dedicated array in processor memory, which then go directly to the execution unit for execution. These instructions may be redefined while the program is running. In this way the processor benefits from the speed of parallel processing without the chip area and power consumption overhead of a wide program memory bus and multiple instruction decoders. A simple syntax for defining instructions, similar to that of the C programming language is presented.

Journal ArticleDOI
15 Feb 1999
TL;DR: A 250-MHz microprocessor intended for home computer entertainment consists of a CPU core with 128-b multimedia extensions, two single-instruction, multiple-data (SIMD) very long instruction word (VLIW) vector processors, an MPEG-2 decoder, a ten-channel direct memory access (DMA) controller, and other peripherals with 128 b internal buses on one die.
Abstract: A 250-MHz microprocessor intended for home computer entertainment consists of a CPU core with 128-b multimedia extensions, two single-instruction, multiple-data (SIMD) very long instruction word (VLIW) vector processors containing ten floating-point multiplier accelerators and four floating-point dividers, an MPEG-2 decoder, a ten-channel direct memory access (DMA) controller, and other peripherals with 128 b internal buses on one die. The core is a two-way superscalar MIPS-compatible microprocessor with 16-kB scratch-pad RAM. Each vector processor is a five-way SIMD-VLIW architecture, which is tightly dedicated for specific applications concerning three-dimensional geometry calculation and physical simulation. A DMA controller connects between main memory and each processor's local memory to conceal memory access penalty. It contains 10.5 M transistors in 17/spl times/14.1 mm and dissipates 15 W at 1.8 V.

Journal ArticleDOI
TL;DR: This work presents a simple coherence protocol that eliminates passive sharing using information from the compiler that is normally available in operating system kernels, and further limits the coherence-maintaining overhead by using information about access patterns to shared data exhibited in parallel applications.
Abstract: In high-performance general-purpose workstations and servers, the workload can be typically constituted of both sequential and parallel applications. Shared-bus shared-memory multiprocessor can be used to speed-up the execution of such workload. In this environment, the scheduler takes care of the load balancing by allocating a ready process on the first available processor, thus producing process migration. Process migration and the persistence of private data into different caches produce an undesired sharing, named passive sharing. The copies due to passive sharing produce useless coherence traffic on the bus and coping with such a problem may represent a challenging design problem for these machines. Many protocols use smart solutions to limit the overhead to maintain coherence among shared copies. None of these studies treats passive-sharing directly, although some indirect effect is present while dealing with the other kinds of sharing. Affinity scheduling can alleviate this problem, but this technique does not adapt to all load conditions, especially when the effects of migration are massive. We present a simple coherence protocol that eliminates passive sharing using information from the compiler that is normally available in operating system kernels. We evaluate the performance of this protocol and compare it against other solutions proposed in the literature by means of enhanced trace-driven simulation. We evaluate the complexity in terms of the number of protocol states, additional bus lines, and required software support. Our protocol further limits the coherence-maintaining overhead by using information about access patterns to shared data exhibited in parallel applications.

Patent
Jerald N. Hall1
12 Nov 1999
TL;DR: In this paper, a tenant processor module is shown comprising a processor core, a plurality of strapping devices, and an input bus, coupled to the processor core at a first time and at a second time the input bus receives operational data from the circuit board assembly and provides them to the Processor core.
Abstract: A tenant processor module is shown comprising a processor core, a plurality of strapping devices, and an input bus. The plurality of strapping devices are configured to indicate configuration information to a receiving circuit board assembly coupled to the processor module. The input bus, coupled to the processor core, receives the configuration information back from the circuit board assembly and provides them to the processor core at a first time. At a second time, the input bus receives operational data from the circuit board assembly and provides them to the processor core.

Patent
03 Dec 1999
TL;DR: A processor has a flexible architecture that efficiently handles computing applications having a range of instruction-level parallelism from a very low degree to a very high degree of parallelism.
Abstract: A processor has a flexible architecture that efficiently handles computing applications having a range of instruction-level parallelism from a very low degree to a very high degree of instruction-level parallelism. The processor includes a plurality of processing units, an individual processing unit of the plurality of processing units including a multiple-instruction parallel execution path. For computing applications having a low degree of instruction-level parallelism, the processor includes control logic that controls the plurality of processing units to execute instructions mutually independently in a plurality of independent execution threads. For computing applications having a high degree of instruction-level parallelism, the processor further includes control logic that controls the plurality of processing units with a low thread synchronization to operate in combination using spatial software pipelining in the manner of a single wide-issue processor. The control logic in the processor alternatively controls the plurality of processing units to operate: (1) in a multiple-thread operation on the basis of a highly parallel structure including multiple independent parallel execution paths for executing in parallel across threads and a multiple-instruction parallel pathway within a thread, and (2) in a single-thread wide-issue operation on the basis of the highly parallel structure including multiple parallel execution paths with low level synchronization for executing the single wide-issue thread. The multiple independent parallel execution paths include functional units that execute an instruction set including special data-handling instructions that are advantageous in a multiple-thread environment.

Proceedings ArticleDOI
07 Nov 1999
TL;DR: This paper addresses the problem of how to partition a set of applications among processors, such that all the individual QoS requirements are met and the total energy consumption is minimized, and exploits the advantages provided by the variable voltage design methodology.
Abstract: Design systems to provide various quality of service (QoS) guarantees has received a lot of attentions due to the increasing popularity of real-time multimedia and wireless communication applications. Meanwhile, low power consumption is always one of the goals for system design, especially for battery-operated systems. With the design trend of integrating multiple processor cores and memory on a single chip, we address the problem of how to partition a set of applications among processors, such that all the individual QoS requirements are met and the total energy consumption is minimized. We exploit the advantages provided by the variable voltage design methodology to choose the voltage for each application on the same processor optimally for this purpose. We also discuss how to partition applications among the processors to achieve the same goal. We formulate the problem on an abstract QoS model and present how to allocate resources (e.g., CPU time) and determine the voltage profile for every single processor. Experiments on media benchmarks have also been studied.

Patent
18 Jun 1999
TL;DR: In this paper, the authors propose a method to estimate the power consumption of a semiconductor integrated circuit when this circuit is designed with reuse of a processor core, etc., by calculating power consumption that is needed for a processor to execute a program code from the power consumed of a CPU core part and a cache part.
Abstract: PROBLEM TO BE SOLVED: To provide a method which can fast estimate the power consumption of a semiconductor integrated circuit when this circuit is designed with reuse of a processor core, etc., by calculating the power consumption that is needed for a processor to execute a program code from the power consumption of a CPU core part and a cache part. SOLUTION: In a cache part power consumption calculation process 104, the electric power consumed at a cache part is calculated from the power consumption information on every cache operation and in accordance with the operation of the cache part which is simulated in an instruction reading process 103. In an instruction execution process 105, the processor operation of a read instruction is simulated. In a CPU part power consumption calculation process 106, the power consumption is calculated in response to the instruction that is simulated in the process 105. In a power consumption calculation process 107, the power consumption necessary for the current execution cycle of a processor model is calculated from the power consumption which are calculated in both processes 104 and 106.


Proceedings ArticleDOI
07 Apr 1999
TL;DR: The FPPA principle, i.e., fault-tolerant large array of cells interconnected with an asynchronous communication scheme, is applicable on alternative structures for the cell architecture and makes it particularly interesting for portable devices that require quite complex algorithms.
Abstract: The crossbreeding between advanced microprocessor design and Field Programmable Gate Arrays (FPGAs) has produced the Field Programmable Processor Array, code named FPPA. The first integrated version has been targeted for low power consumption parallel processing. The FPPA is composed of a 10/spl times/10 array of RISC microcontrollers offering up to 500 MIPS at 5 MHz for processors (20 MHz for communications). The very low power feature of the core processor results in a 1 Watt power consumption for the whole array at 5 MHz and makes it particularly interesting for portable devices that require quite complex algorithms. In addition, the FPPA principle, i.e., fault-tolerant large array of cells interconnected with an asynchronous communication scheme, is applicable on alternative structures for the cell architecture.

Patent
08 Mar 1999
TL;DR: In this paper, a processor core (102) is provided that is a programmable digital signal processor (DSP) with variable instruction length, offering both high code density and easy programming.
Abstract: A processor core (102) is provided that is a programmable digital signal processor (DSP) with variable instruction length, offering both high code density and easy programming. Architecture and instruction set are optimized for low power consumption and high efficiency execution of DSP algorithms, such as for wireless telephones, as well as pure control tasks. A cache (814) located within a megacell on a single integrated circuit (800) is provided to reduce instruction access time. Performance monitoring circuitry (852) is included within the megacell and monitors selected signals to collect benchmark events. The performance monitoring circuitry can be interrogated via a JTAG interface (850). A cache miss signal (816) is provided by the cache to the performance monitoring circuitry in order to determine the performance of the internal cache. Windowing circuitry (824) within the megacell allows benchmark events to be collected during selected windows of execution.

Proceedings ArticleDOI
20 Oct 1999
TL;DR: This processor has a 2-issue VLIW architecture with 64-bit SIMD arithmetic functional units to exploit the instruction-level and subword data parallelism found in multimedia applications and shows a comparable or higher performance when compared to the 8-issue TMS320C62xx.
Abstract: As the complexity of multimedia applications increases, the need for efficient and compiler-friendly processor architectures also grows. In this paper, a new multimedia processor architecture is proposed. This processor has a 2-issue VLIW architecture with 64-bit SIMD arithmetic functional units to exploit the instruction-level and subword data parallelism found in multimedia applications. Moreover, densely encoded instructions supporting memory operands, DSP-like addressing modes, and SIMD capability boost the performance while keeping the code size and hardware cost small. To maximally utilize this architecture, a software environment including a code converter, a VLIW compiler system, and a compiled simulator has also been developed. The processor core has been synthesized for LSI logic 0.25 /spl mu/m library, which results in the total gate count of 102 K. In spite of the relatively smaller issue rate, the proposed processor shows a comparable or higher performance in terms of both the cycle count and the code size when compared to the 8-issue TMS320C62xx, for DSP benchmark kernels and an H.263 video encoder.

Patent
04 Aug 1999
TL;DR: In this paper, the cast out portion of a combined operation including a data access related to the cast-out is canceled, and the combined response logic explicitly directs the storage device initiating the combined operation not to allocate storage for the target of the data access, thus defers any latency associated with writing the castout victim to system memory while maximizing utilization of available storage with acceptable tradeoffs in data access latency.
Abstract: In cancelling the cast out portion of a combined operation including a data access related to the cast out, the combined response logic explicitly directs the storage device initiating the combined operation not to allocate storage for the target of the data access. Instead, the target of the data access may be passed directly to an in-line processor core without storage, may be stored in a horizontal storage device, or may be stored in an in-line, noninclusive, lower level storage device. Cancellation of the cast out thus defers any latency associated with writing the cast out victim to system memory while maximizing utilization of available storage with acceptable tradeoffs in data access latency.

Proceedings ArticleDOI
21 Feb 1999
TL;DR: The dedicated processor called MBP (Memory Based Processor)-light to manage the DSM of JUMP-1 is introduced, and its preliminary performance with two protocol policies-update/invalidate-is evaluated.
Abstract: A massively parallel processor called JUMP-1 has been developed to build an efficient cache coherent-distributed shared memory (DSM) on a large system with more than 1000 processors. Here, the dedicated processor called MBP (Memory Based Processor)-light to manage the DSM of JUMP-1 is introduced, and its preliminary performance with two protocol policies-update/invalidate-is evaluated. From results of its simulation, it appears that simple operations like the tag check and the collection/generation of acknowledgment packets are mostly processed by the hardware mechanisms in MBP-light without the aids of the core processor with both policies. Also, the buffer-register architecture adopted by the core processor in MBP-light is exploited enough to process a protocol transaction for both policies.

Book ChapterDOI
01 Dec 1999
TL;DR: Systems-on-a-chip will be suitable for embedded applications, such as consumer electronics products that perform sophisticated data and information processing, telecommunication equipment that perform movie picture and audio transmission, control systems for industrial manufacturing, automobile, and avionics.
Abstract: Due to the advancing semiconductor technology it is becoming possible within ten years to fabricate a highly complex and high performance VLSI that includes more than hundred million transistors on a single silicon chip [1]. Using such a technology, so-called systems-on-a-chip, that includes CPU cores, DSPs, memory blocks (RAM and ROM), application specific hardware modules, FPGA blocks, as well as analog and radio frequency blocks, as shown Figure 1. Systems-on-a-chip will be suitable for embedded applications, such as consumer electronics products that perform sophisticated data and information processing, telecommunication equipment that perform movie picture and audio transmission, control systems for industrial manufacturing, automobile, and avionics.

Proceedings ArticleDOI
23 Aug 1999
TL;DR: In this article, a chip level redundant self-checking fail-safe microprocessor using a 0.35 /spl mu/m CMOS embedded gate array was developed using two synthesizable processor cores.
Abstract: A chip level redundant self-checking fail-safe microprocessor has been developed using a 0.35 /spl mu/m CMOS embedded gate array. The microprocessor integrates two synthesizable processor cores and a self-checking comparator in a single chip. A full-custom processor core was transformed into each of the synthesizable cores for this purpose. Design methodologies suitable for reusing synthesizable processor cores has also been developed. Developed synthesizable processor cores and design methodologies reduce the cost of process migration of the chip. Migrating to the newer process improves the performance of the developed microprocessor with low development cost.

Proceedings ArticleDOI
18 Jan 1999
TL;DR: Experimental results demonstrate that the system synthesizes processor cores effectively according to the features of an application program/data.
Abstract: A hardware/software cosynthesis system for processor cores of digital signal processing has been developed. This paper focuses on a hardware/software partitioning algorithm which is one of the key issues in the system. Given an input assembly code generated by the compiler in the system, the proposed hardware/software partitioning algorithm first determines the types and the numbers of required hardware units, such as multiple functional units, hardware loop units, and particular addressing units, for a processor core (initial resource allocation). Second, the hardware units determined at initial resource allocation are reduced one by one while the assembly code meets a given timing constraint (configuration of a processor core). The execution time of the assembly code becomes longer but the hardware costs for a processor core to execute it becomes smaller. Finally, it outputs an optimized assembly code and a processor configuration. Experimental results demonstrate that the system synthesizes processor cores effectively according to the features of an application program/data.

Patent
11 May 1999
TL;DR: In this paper, a processor core is retrofitted to support multiple machine states by converting 1-bit flip-flops in storage cells of the stalling vertical thread to an N-bit global flipflop where N is the number of vertical threads.
Abstract: A processor improves throughput efficiency and exploits increased parallelism by introducing multithreading to an existing and mature processor core. The multithreading is implemented in two steps including vertical multithreading and horizontal multithreading. The processor core is retrofitted to support multiple machine states. System embodiments that exploit retrofitting of an existing processor core advantageously leverage hundreds of man-years of hardware and software development by extending the lifetime of a proven processor pipeline generation. A processor implements N-bit flip-flop global substitution. To implement multiple machine states, the processor converts 1-bit flip-flops in storage cells of the stalling vertical thread to an N-bit global flip-flop where N is the number of vertical threads.

Patent
29 Apr 1999
TL;DR: In this article, a cooperative initialization protocol is proposed to initialize the computer system at power-on/reset. But the protocol requires the first and second programming instructions to cooperate with each other to initialize each other.
Abstract: A computer system is provided with a processor and a system board. The processor includes a processor core, at least one other non-processor core electronic component and a first non-volatile memory device. Stored inside the first non-volatile memory includes first programming instructions that provide initialization support for the at least one other non-processor core electronic component of the processor. The system board includes at least one non-processor electronic component and a second non-volatile memory device. Stored inside the second non-volatile memory device includes second programming instructions that provide initialization support for the at least one non-processor electronic component of the system board. Both the first and the second programming instructions further support a cooperative initialization protocol under which the first and second programming instructions cooperate with each other to initialize the computer system at power-on/reset.

Patent
18 Jun 1999
TL;DR: In this paper, a multicore DSP (digital signal processor) circuit which is obtained by mounting a plurality of DSP cores on one LSI and efficiently increases the number of processing channels is presented.
Abstract: PROBLEM TO BE SOLVED: To provide a multicore DSP(digital signal processor) circuit which is obtained by mounting a plurality of DSP cores on one LSI and efficiently increases the number of processing channels. SOLUTION: DSPs 5 to 8 execute digital signal processing. A ROM 13 stores a program making the DSPs operate. RAMs 9 to 12 maintain results that have been subjected to digital signal processing by the respective DSPs and are also used as working areas. A clock generator 20 generates a system clock 15 making the DSP cores operate. Program counter PCs 1 to 4 read a program making the respective DSP cores operate. A PC clock generator 21 generates a PC clock 14.