scispace - formally typeset
Search or ask a question

Showing papers on "Multi-core processor published in 2001"


Patent
01 Jun 2001
TL;DR: An integrated design environment (IDE) is disclosed for forming virtual embedded systems as discussed by the authors, which includes a design language for forming finite state machine models of hardware components that are coupled to simulators of processor cores, preferably instruction set accurate simulators.
Abstract: An integrated design environment (IDE) is disclosed for forming virtual embedded systems The IDE includes a design language for forming finite state machine models of hardware components that are coupled to simulators of processor cores, preferably instruction set accurate simulators A software debugger interface permits a software application to be loaded and executed on the virtual embedded system A virtual test bench may be coupled to the simulation to serve as a human-machine interface In one embodiment, the IDE is provided as a web-based service for the evaluation, development and procurement phases of an embedded system project IP components, such as processor cores, may be evaluated using a virtual embedded system In one embodiment, a virtual embedded system is used as an executable specification for the procurement of a good or service related to an embedded system

231 citations


Proceedings ArticleDOI
01 Jul 2001
TL;DR: This work proposes a fault-tolerant approach to reliable microprocessor design that provides significant resistance to core processor design errors and operational faults such as supply voltage noise and energetic particle strikes, and shows through cycle-accurate simulation and timing analysis of a physical checker design that it preserves system performance while keeping area overheads and power demands low.
Abstract: We propose a fault-tolerant approach to reliable microprocessor design. Our approach, based on the use of an online checker component in the processor pipeline, provides significant resistance to core processor design errors and operational faults such as supply voltage noise and energetic particle strikes. We show through cycle-accurate simulation and timing analysis of a physical checker design that our approach preserves system performance while keeping area overheads and power demands low. Furthermore, analyses suggest that the checker is a fairly simple state machine that can be formally verified, scaled in performance, and reused. Further simulation analyses show virtually no performance impacts when our simple checker design is coupled with a high-performance microprocessor model. Timing analyses indicate that a fully synthesized unpipelined 4-wide checker component in 0.25 /spl mu/m technology is capable of checking Alpha instructions at 288 MHz. Physical analyses also confirm that costs are quite modest; our prototype checker requires less than 6% the area and 1.5% the power of an Alpha 21264 processor in the same technology. Additional improvements to the checker component are described which allow for improved detection of design, fabrication and operational faults.

154 citations


Patent
02 Nov 2001
TL;DR: In this article, a multi-processor network processing environment is provided in which parallel processing may occur, while still maintaining ordered serialization between the input and the output of the network processor.
Abstract: A multi-processor network processing environment is provided in which parallel processing may occur. In one embodiment, a network processor having multiple processor cores may be utilized. Parallel processing at the front end of the network processor is encouraged while still maintaining ordered serialization between the input and the output of the network processor. The disclosed order serialization techniques obtain the benefits of parallel processing at the front end of the system while minimizing blocking times at the output.

137 citations


Patent
28 Dec 2001
TL;DR: In this article, a reconfigurable channel CODEC (encoder and decoder) processor for a wireless communication system is presented, which includes processor cores (210, 250) and algorithm-specific kernels (212, 214, 216, 252, 254, 256).
Abstract: A reconfigurable channel CODEC (encoder and decoder) processor for a wireless communication system is disclosed. A high degree of user programmability and reconfigurability is provided by the channel CODEC processor (200). In particular, the reconfigurable channel CODEC processor includes processor cores (210, 250) and algorithm-specific kernels (212, 214, 216, 252, 254, 256) that contain logic circuits tailored for carrying out predetermined but user-configurable decoding and encoding algorithms. The interconnects (230, 270) between the processor cores and the algorithm-specific kernels are also user-configurable. Thus, the same hardware can be reconfigured for many different wireless communication standards.

96 citations


Proceedings ArticleDOI
25 Apr 2001
TL;DR: OCAPI-xl is developed, a methodology in which the HW/SW partitioning decision can be made anywhere in the design flow, even just prior to doing code-generation for both HW and SW, made possible thanks to a refinable, implementable, architecture independent system description.
Abstract: The implementation of embedded networked appliances requires a mix of processor cores and HW accelerators on a single chip. When designing such complex and heterogeneous SoCs, the HW / SW partitioning decision needs to be made prior to refining the system description. With OCAPI-xl, we developed a methodology in which the partitioning decision can be made anywhere in the design flow, even just prior to doing code-generation for both HW and SW. This is made possible thanks to a refinable, implementable, architecture independent system description. The OCAPI-xl model was used to develop a stand alone, networked camera, with on-board GIF engine and network layer.

74 citations


Journal ArticleDOI
TL;DR: This work describes the use of a reconfigurable processor core based on an RISC architecture as starting point for application-specific processor design and shows how hardware emulation based on programmable logic can be integrated into the hardware/software codesign flow.
Abstract: Application-specific processors offer an attractive option in the design of embedded systems by providing high performance for a specific application domain. In this work, we describe the use of a reconfigurable processor core based on an RISC architecture as starting point for application-specific processor design. By using a common base instruction set, development cost can be reduced and design space exploration is focused on the application-specific aspects of performance. An important aspect of deploying any new architecture is verification which usually requires lengthy software simulation of a design model. We show how hardware emulation based on programmable logic can be integrated into the hardware/software codesign flow. While previously hardware emulation required massive investment in design effort and special purpose emulators, an emulation approach based on high-density field-programmable gate array (FPGA) devices now makes hardware emulation practical and cost effective for embedded processor designs. To reduce development cost and avoid duplication of design effort, FPGA prototypes and ASIC implementations are derived from a common source: We show how to perform targeted optimizations to fully exploit the capabilities of the target technology while maintaining a common source base.

70 citations


Proceedings ArticleDOI
29 Mar 2001
TL;DR: A novel test methodology for testing IP cores in SoCs with embedded processor cores that supports at-speed testing for delay faults and stuck-at testing of IP cores implementing full-scan is presented.
Abstract: We present a novel test methodology for testing IP cores in SoCs with embedded processor cores. A test program is run on the processor core that generates and delivers test patterns to the target IP cores in the SoC and analyzes the test responses. This provides tremendous flexibility in the type of patterns that can be applied to the IP cores without incurring significant hardware overhead. We use a bus based SoC simulation model to validate our test methodology. The test methodology involves addition of a test wrapper that can be configured for specific test needs. The methodology supports at-speed testing for delay faults and stuck-at testing of IP cores implementing full-scan.

64 citations


Patent
10 Dec 2001
TL;DR: In this article, the address depth of the global block RAM and the number of wait states of the local block RAM are selected by a user, and they can be set either prior to configuration of the FPGA or programmed using instructions of the processor core.
Abstract: A data processing system having a user configurable memory controller, local block RAMs, global block RAMs and a processor core can be configured in a single field programmable gate array (FPGA). The address depth of the global block RAMs and the number of wait states can be selected by a user, and they can be set either prior to configuration of the FPGA or programmed using instructions of the processor core. The number of wait states of the local block RAM is also user selectable. An algorithm that can optimize the address depth and the number of wait states to achieve a performance level is also disclosed. The present invention can be applied to designs having separate instruction and data sides.

62 citations


Patent
22 May 2001
TL;DR: In this article, a network interface unit is presented including a microcontroller having multiple blocks of programmable logic that are variably configurable to perform selected functions, such as assembly, transmit, and receive data units (i.e., frames) of one communication protocol, then later reconfigured to assemble, transmit and receive frames of another protocol.
Abstract: A network interface unit is presented including a microcontroller having multiple blocks of programmable logic that are variably configurable to perform selected functions. The network interface unit may be configured to assemble, transmit, and receive data units (i.e., frames) of one communication protocol, then later reconfigured to assemble, transmit, and receive frames of another protocol. The microcontroller includes several components formed upon a single monolithic semiconductor substrate, among them an execution unit. The execution unit includes a processor core and multiple configurable logic blocks (CLBs) coupled to the processor core. The processor core is configured to execute instructions, for example x86 instructions. Each of the multiple CLBs includes programmable logic which may be, for example, PLA circuitry, PAL circuitry, or FPGA circuitry. The programmable logic includes programmable switching elements such as, for example, EPROM elements, EEPROM elements, or SRAM elements. During instruction execution, the processor core produces output signals. During a programming operation, the output signals include programming signals which configure the programmable logic within one or more of the multiple CLBs to perform selected functions. Once programmed, each CLB performs the selected function in response to output signals produced by the processor core. The network interface unit also includes one or more memory devices and an electrical interface unit. The one or more memory devices store instructions and data used by the processor core. The electrical interface unit is adapted for coupling to the network transmission medium and performs as an interface between the microcontroller and the network transmission medium.

59 citations


Proceedings ArticleDOI
13 Mar 2001
TL;DR: A deterministic software-based self-testing methodology for processor cores is introduced that efficiently tests the processor datapath modules without any modification of the processor structure to provide high fault coverage without repetitive fault simulation experiments.
Abstract: A deterministic software-based self-testing methodology for processor cores is introduced that efficiently tests the processor datapath modules without any modification of the processor structure. It provides a guaranteed high fault coverage without repetitive fault simulation experiments which is necessary in pseudorandom software-based processor self-testing approaches. Test generation and output analysis are performed by utilizing the processor functional modules like accumulators (arithmetic part of ALU) and shifters (if they exist) through processor instructions. No extra hardware is required and there is no performance degradation.

55 citations


Proceedings ArticleDOI
17 Jun 2001
TL;DR: Detailed analysis indicates that the dominant overheads in an implicitly-threaded CMP are speculation state overflow due to limited L1 cache capacity, and load imbalance and data dependences in fine-grain threads.
Abstract: Recent proposals for Chip Multiprocessors (CMPs) advocate speculative, or implicit, threading in which the hardware employs prediction to peel off instruction sequences (i.e., implicit threads) from the sequential execution stream and speculatively executes them in parallel on multiple processor cores. These proposals augment a conventional multiprocessor, which employs explicit threading, with the ability to handle implicit threads. Current proposals focus on only implicitly-threaded code sections. This paper identifies, for the first time, the issues in combining explicit and implicit threading. We present the Multiplex architecture to combine the two threading models. Multiplex exploits the similarities between implicit and explicit threading, and provides a unified support for the two threading models without additional hardware. Multiplex groups a subset of protocol states in an implicitly-threaded CMP to provide a write-invalidate protocol for explicit threads.Using a fully-integrated compiler infrastructure for automatic generation of Multiplex code, this paper presents a detailed performance analysis for entire benchmarks, instead of just implicitly-threaded sections, as done in previous papers. We show that neither threading models alone performs consistently better than the other across the benchmarks. A CMP with four dual-issue CPUs achieves a speedup of 1.48 and 2.17 over one dual-issue CPU, using implicit-only and explicit-only threading, respectively. Multiplex matches or outperforms the better of the two threading models for every benchmark, and a four-CPU Multiplex achieves a speedup of 2.63. Our detailed analysis indicates that the dominant overheads in an implicitly-threaded CMP are speculation state overflow due to limited L1 cache capacity, and load imbalance and data dependences in fine-grain threads.

Patent
Hara Katsuhiko1
12 Jul 2001
TL;DR: System Bus Bridge (SBB) as mentioned in this paper is a multichannel bidirectional bus bridge and provides a mutual connection between a B bus (I/O bus), a G bus (graphic bus), an SC bus (processor bus), and an MC bus (local bus) by using a crossbar switch.
Abstract: The invention relates to an image processing apparatus which has a coding function of image data and inputs and outputs an image and a method for such an apparatus, and has an object to simultaneously perform coding of the image data in parallel in both a CPU and a coder/decoder A system bus bridge (SBB) is a multichannel bidirectional bus bridge and provides a mutual connection between a B bus (I/O bus), a G bus (graphic bus), an SC bus (processor bus), and an MC bus (local bus) by using a crossbar switch The connections of two systems can be simultaneously established by the crossbar switch A high speed data transfer of high parallel performance can be realized among a CPU core, a CODEC, and a DRAM

Proceedings ArticleDOI
16 Nov 2001
TL;DR: SimBed, an execution-driven simulation testbed that measures the execution behavior and power consumption of embedded applications and RTOSs by executing them on an accurate architectural model of a microcontroller with simulated real-time stimuli is presented.
Abstract: This paper presents the modeling of embedded systems with SimBed, an execution-driven simulation testbed that measures the execution behavior and power consumption of embedded applications and RTOSs by executing them on an accurate architectural model of a microcontroller with simulated real-time stimuli. We briefly describe the simulation environment and present a study that compares three RTOSs: mC/OS-II, a popular public-domain embedded real-time operating system; Echidna, a sophisticated, industrial-strength (commercial) RTOS; and NOS, a bare-bones multi-rate task scheduler reminiscent of typical "roll-your-own" RTOSs found in many commercial embedded systems. The microcontroller simulated in this study is the Motorola M-CORE processor: a low-power, 32-bit CPU core with 16-bit instructions, running at 20MHz.

Journal ArticleDOI
TL;DR: Data cache and direct memory access address the challenge of transferring data between off- and on-chip memories without slowing down the core processor's performance.
Abstract: Mediaprocessors provide high performance by using both instruction- and data-level parallelism. Because of the increased computing power, transferring data between off- and on-chip memories without slowing down the core processor's performance is challenging. Two methods, data cache and direct memory access, address this problem in different ways.

Patent
22 Jun 2001
TL;DR: In this article, a system and method for enabling multithreading in a embedded processor, invoking zero-time context switching in a multi-reading environment, scheduling multiple threads to permit numerous hard real-time and non-real-time priority levels, fetching data and instructions from multiple memory blocks in a multithreaded environment, and enabling a particular thread to modify the multiple states of the multiple threads in the processor core.
Abstract: A system and method for enabling multithreading in a embedded processor, invoking zero-time context switching in a multithreading environment, scheduling multiple threads to permit numerous hard-real time and non-real time priority levels, fetching data and instructions from multiple memory blocks in a multithreading environment, and enabling a particular thread to modify the multiple states of the multiple threads in the processor core.

Patent
05 Dec 2001
TL;DR: In this article, an instruction memory, an instruction storage portion that stores reserved instructions as F instructions, and stores the substantially equivalent processing contents to the F instructions as substitute instructions for processing by the CPU, a pre-fetch portion, a history storage portion, diagnosing portion for diagnosing the types of instructions, a reprogramming control portion for re-programming the instructions.
Abstract: A semiconductor integrated circuits can send and receive signals to and form a configuration memory. The semiconductor integrated circuits is provided therein wiht an instruction memory, an instruction storage portion that stores reserved instructions as F instructions, and stores the substantially equivalent processing contents to the F instructions as substitute instructions for processing by the CPU, a pre-fetch portion, a history storage portion, a diagnosing portion for diagnosing the types of instructions, a reprogramming control portion for reprogramming the instructions, a CPU, an FPGA, a configuration data memory, a built-in memory, and a configuration data tag. When the configuration data of the F instruction does not exist in the FPGA, the substantially equivalent processing by FPGA is executed by the CPU by making use of the substitute instructions.

Patent
27 Jul 2001
TL;DR: In this paper, the address depth of global block RAMs and the number of wait states can be selected by a user, and they can be set either prior to configuration of the FPGA or programmed using instructions of the processor core.
Abstract: A data processing system having a user configurable memory controller, one or more local block RAMs, one or more global block RAMs and a processor core can be configured in a single field programmable gate array (FPGA). The address depth of the global block RAMs and the number of wait states can be selected by a user, and they can be set either prior to configuration of the FPGA or programmed using instructions of the processor core. The number of wait states of the local block RAM is also user selectable. An algorithm that can optimize the address depth and the number of wait states to achieve a performance level is also disclosed. The present invention can be applied to designs having separate instruction and data sides.

Patent
12 Nov 2001
TL;DR: In this article, a multi-core digital signal processor with a shared program memory (132), an emulation logic module (141), and multiple processor cores (11, 21) are disclosed.
Abstract: A multi-core digital signal processor is disclosed having a shared program memory (132) with conditional write protection. In one embodiment, the digital signal processor includes a shared program memory (132), an emulation logic module (141), and multiple processor cores (11, 21) each coupled to the shared program memory (132) by corresponding instruction buses (P1, P2). The emulation logic module (141) preferably determines the operating modes of each of the processors, e.g., whether they are operating in a normal mode or an emulation mode. In the emulation mode, the emulation logic can alter the states of various processor hardware and the contents of various registers and memory. The instruction buses (P1, P2) each include a read/write signal that, while their corresponding processor cores (11, 21) are in normal mode, is maintained in a read state. On the other hand, when the processor cores (11, 21) are in the emulation mode, the processor cores (11, 21) are allowed to determine the state of the instruction bus read/write signals. Each instruction bus read/write signal is preferably generated by a logic gate that prevents the processor core (11, 21) from affecting the read/write signal value in normal mode, but allows the processor core to determine the read/write signal value in emulation mode. In this manner, the logic gate prevents write operations to the shared program memory (132) when the emulation logic (141) de-asserts a signal indicative of emulation mode, and allows write operations to the shared program memory (132) when the emulation logic (141) asserts the signal indicative of emulation mode. The logic gate is preferably included in a bus interface module (31) in each processor core (11, 21).

Patent
28 Mar 2001
TL;DR: In this article, a performance monitor system includes a core processor (115), a cache (123), and a first logic (127) coupled to the core processor associated device (123) and monitors the first signal (CACHE_PERF) in response to a second signal (WPT0,1).
Abstract: A performance monitor system includes a core processor (115), a core processor associated device, such as a cache (123), and first logic, such as performance logic (127). The core processor (115) is operable to execute information. The core processor associated device provides a first signal (CACHE_PERF), which defines performance of the core processor associated device (123) during operation of the core processor (115). The first logic (127) is coupled to the core processor associated device (123) and monitors the first signal (CACHE_PERF) in response to a second signal (WPT0,1), which defines a match of user-settable attributes associated with the operation of the core processor (115).

Patent
25 Jun 2001
TL;DR: In this paper, a data processing system that supports execution of both native instructions using a processor core and non-native instructions that are interpreted using either a hardware translator or a software interpreter is presented.
Abstract: A data processing system 118 is provided that supports execution of both native instructions using a processor core and non-native instructions that are interpreted using either a hardware translator 122 or a software interpreter. Separate explicit return to non-native instructions and return to native instructions are provided for terminating subroutines whereby intercalling between native and non-native code may be achieved with reduced processing overhead. Veneer non-native subroutines may be used between native code and non-native main subroutines. The veneer non-native subroutines may be dynamically created within the stack memory region of the native mode system.

Proceedings ArticleDOI
30 Sep 2001
TL;DR: An application specific multiprocessor system for SAT, utilizing the most recent results such as the development of highly efficient sequential SAT algorithms, the emergence of commercial configurable processor cores and the rapid progress in IC manufacturing techniques is presented.
Abstract: This paper presents our work in developing an application specific multiprocessor system for SAT, utilizing the most recent results such as the development of highly efficient sequential SAT algorithms, the emergence of commercial configurable processor cores and the rapid progress in IC manufacturing techniques. Based on an analysis of the basic SAT search algorithm, we propose a new parallel SAT algorithm that utilizes fine grain parallelism. This is then used to design a multiprocessor architecture in which each processing node consists of a processor and a communication assist node that deals with message processing. Each processor is an application specific processor built from a commercial configurable processor core. All the system configurations are determined based on the characteristics of SAT algorithms, and are supported by simulation results. While this hardware accelerator system does not change the inherent intractability of the SAT problems, it achieves a 30-60x speedup over and above the fastest known SAT solver - Chaff. We believe that this system can be used to expand the practical applicability of SAT in all its application areas.

Proceedings ArticleDOI
17 Dec 2001
TL;DR: An integrity checking architecture for superscalar processors that can achieve fault tolerance capability of a duplex system at much less cost than the traditional duplication approach is proposed.
Abstract: The paper proposes an integrity checking architecture for superscalar processors that can achieve fault tolerance capability of a duplex system at much less cost than the traditional duplication approach. The pipeline of the CPU core (P-pipeline) is combined in series with another pipeline (V-pipeline), which re-executes instructions processed in the P-pipeline. Operations in the two pipelines are compared and any mismatch triggers the recovery process. The V-pipeline design is based on replication of the P-pipeline, and minimized in size and functionality by taking advantage of control flow and data dependency resolved in the P-pipeline. Idle cycles propagated from the P-pipeline become extra time for the V-pipeline to keep up with program re-execution. For a large-scale superscalar processor, the proposed architecture can bring up to 61.4% reduction in die area and the average-execution time increase is 0.3%.

Patent
21 Mar 2001
TL;DR: A low power reconfigurable processor core includes one or more processing units, each unit having a clock input that controls the performance of the unit; a high-density memory array core coupled to the processing units.
Abstract: A low power reconfigurable processor core includes one or more processing units, each unit having a clock input that controls the performance of the unit; one or more clock controllers having clock outputs coupled to the clock inputs of the processing units, the controller operating varying the clock frequency of each processing unit to optimize speed and processing power for a task; and a high-density memory array core coupled to the processing units.

Journal ArticleDOI
TL;DR: A dynamic instruction steering logic for these clustered architectures that decides at decode time the cluster where each instruction is executed to achieve an average speed-up of 35% over a conventional 8-way issue (4 int + 4 fp) machine and that it outperforms other previous proposals, either static or dynamic.
Abstract: Recent works^(1) show that delays introduced in the issue and bypass logic will become critical for wide issue superscalar processors. One of the proposed solutions is clustering the processor core. Clustered architectures benefit from a less complex partitioned processor core and thus, incur in less critical delays. In this paper, we propose a dynamic instruction steering logic for these clustered architectures that decides at decode time the cluster where each instruction is executed. The performance of clustered architectures depends on the inter-cluster communication overhead and the workload balance. We present a scheme that uses runtime information to optimize the trade-off between these figures. The evaluation shows that this scheme can achieve an average speed-up of 35% over a conventional 8-way issue (4eint+4efp) machine and that it outperforms other previous proposals, either static or dynamic.

Proceedings ArticleDOI
04 Nov 2001
TL;DR: The heart of the proposed design exploration framework is a two-level simulation engine that combines detailed simulation for critical portions of the code with fast profiling for the rest, which is completely general and applicable to any microarchitectural power/performance simulation engine.
Abstract: This paper presents an efficient design exploration environment for high-end core processors. The heart of the proposed design exploration framework is a two-level simulation engine that combines detailed simulation for critical portions of the code with fast profiling for the rest. Our two-level simulation methodology relies on the inherent clustered structure of application programs and is completely general and applicable to any microarchitectural power/performance simulation engine. The proposed simulation methodology is 3-17/spl times/ faster, while being sufficiently accurate (within 5%) when compared to the fully detailed simulator The design exploration environment is able to vary different microarchitectural configurations and find the optimal one as far as energy/spl times/delay product is concerned in a matter of minutes. The parameters that are found to affect drastically the core processor power/performance metrics are issue width, instruction window size, and pipeline depth, along with correlated clock frequency. For very high-end configurations for which balanced pipelining, may not be possible, opportunities for running faster stages at lower voltage exist. In such cases, by using up to 3 voltage levels, the energy/spl times/delay product is reduced by 23-30% when compared to the single voltage implementation.

Patent
01 Mar 2001
TL;DR: In this article, a method of deallocating multiple processor cores sharing a failing bank of memory is proposed, which allows new multiple-processor integrated circuits with on-chip shared memory to be de-allocated using existing technology designed for use with singleprocessor integrated circuit technology.
Abstract: A method of de-allocating multiple processor cores sharing a failing bank of memory is disclosed. The method allows new multiple-processor integrated circuits with on-chip shared memory to be de-allocated using existing technology designed for use with single-processor integrated circuit technology.

Journal ArticleDOI
01 Sep 2001
TL;DR: The mAgic-FPU core architecture satisfies the requisite of portability among silicon foundries and fit the requirements of ‘Smart Antenna for Adaptive Beam-Forming processing’ and ‘Physical Sound Synthesis’.
Abstract: mAgic-FPU is the architecture of a family of VLIW cores for configurable system level integration of floating and fixed point computing power. mAgic customization permits the designer to tune basic parameters, such as the computing power/memory access ratio of the core processor, the number of available arithmetic operation per cycle, the register file size and number of port, as well as of the number of arithmetic operators. The reconfiguration (e.g., of register file size and number of port, as well as of the number of arithmetic operators) is supported by the software environment MADE (Modular VLIW processor Architecture and Assembler Description Environment). MADE reads an architecture description file and produces a customized assembler-scheduler for the target VLIW architecture, configuring a general purpose VLIW optimizer-scheduler engine. The mAgic-FPU core architecture satisfies the requisite of portability among silicon foundries. The first members of the mAgic FPU core family architecture fit the requirements of ‘Smart Antenna for Adaptive Beam-Forming processing’ and ‘Physical Sound Synthesis’. The first 1 GigaFlops mAgic core will run at 100 MHz within an area of 40 mm 2 in 0.25 μ m ATMEL CMOS technology in first half 2002.

Patent
22 Aug 2001
TL;DR: In this article, the virtual machine interpreter identifies an initial virtual machine instruction from a body of virtual machine instructions, where the body is expected to be executed repeatedly and writes native instructions for the body into the memory from said memory location.
Abstract: A data processing system has a processor core, memory and a virtual machine interpreter. The virtual machine interpreter receives virtual machine instructions selected dependent on program flow during execution of a virtual machine program. The virtual machine interpreter generates native machine instructions that implement the virtual machine instructions for execution by the processor core. The virtual machine interpreter identifies an initial virtual machine instruction from a body of virtual machine instructions, where the body is expected to be executed repeatedly. The virtual machine interpreter records a correspondence between the initial virtual machine instructionin the body and a memory location in the memory and writes native instructions for the body into the memory from said memory location. The processor core executes the native instructions for the body and repeats execution of the native instructions for the body by executing the written native machine instructions for the body from memory starting from said memory location.

Patent
23 Feb 2001
TL;DR: In this article, an application specific signal processor (ASSP) performs vectorized and nonvectorized operations using a saturated multiplication and accumulation operation, which is used in telecommunication interface devices such as a gateway.
Abstract: An application specific signal processor (ASSP) performs vectorized and nonvectorized operations. Nonvectorized operations may be performed using a saturated multiplication and accumulation operation. The ASSP includes a serial interface, a buffer memory, a core processor for performing digital signal processing which includes a reduced instruction set computer (RISC) processor and four signal processing units. The four signal processing units execute the digital signal processing algorithms in parallel including the execution of the saturated multiplication and accumulation operation. The ASSP is utilized in telecommunication interface devices such as a gateway. The ASSP is well suited to handling voice and data compression/decompression in telecommunication systems where a packetized network is used to transceive packetized data and voice.

Proceedings ArticleDOI
29 Mar 2001
TL;DR: Experimental results show that, for testing interconnects between a processor core and any other on-chip core, a 3 K-byte program is sufficient to achieve the complete coverage for crosstalk-induced logical and delay faults.
Abstract: Crosstalk effects on long interconnects are becoming significant for high-speed circuits. This paper addresses the problem of testing crosstalk-induced faults at on-chip buses in system-on-a-chip (SOC) designs. We propose a method to self-test on-chip buses at-speed, by executing an automatically synthesized program using on-chip processor cores. The test program, executed at system operational speed, can activate and capture the worst-case crosstalk effects on buses and achieve a complete coverage of crosstalk-induced logical and delay faults. This paper discusses the method and the framework for synthesizing such a test program. Based on the bus protocol, the instruction set architecture of an on-chip processor core, and the system specification, the method generates deterministic tests in the form of instruction sequences. The synthesized test program is highly modularized and compact. The experimental results show that, for testing interconnects between a processor core and any other on-chip core, a 3 K-byte program is sufficient to achieve the complete coverage for crosstalk-induced logical and delay faults.