scispace - formally typeset
Search or ask a question

Showing papers on "Multi-core processor published in 2005"


Journal ArticleDOI
Herb Sutter1, James R. Larus1
TL;DR: The introductory article in this issue describes the hardware imperatives behind this shift in computer architecture from uniprocessors to multicore processors, also known as CMPs.
Abstract: Concurrency has long been touted as the "next big thing" and "the way of the future," but for the past 30 years, mainstream software development has been able to ignore it. Our parallel future has finally arrived: new machines will be parallel machines, and this will require major changes in the way we develop software. The introductory article in this issue describes the hardware imperatives behind this shift in computer architecture from uniprocessors to multicore processors, also known as CMPs.

565 citations


Journal ArticleDOI
TL;DR: Chip makers AMD, IBM, Intel, and Sun are now introducing multicore chips for servers, desktops, and laptops, which improve overall performance by handling more work in parallel.
Abstract: Computer performance has been driven largely by decreasing the size of chips while increasing the number of transistors they contain. In accordance with Moore's law, this has caused chip speeds to rise and prices to drop. This ongoing trend has driven much of the computing industry for years. Manufacturers are building chips with multiple cooler-running, more energy-efficient processing cores instead of one increasingly powerful core. The multicore chips don't necessarily run as fast as the highest performing single-core models, but they improve overall performance by handling more work in parallel. Multicores are a way to extend Moore's law so that the user gets more performance out of a piece of silicon. Chip makers AMD, IBM, Intel, and Sun are now introducing multicore chips for servers, desktops, and laptops.

501 citations


Journal ArticleDOI
01 May 2005
TL;DR: Examination of the area, power, performance, and design issues for the on-chip interconnects on a chip multiprocessor shows that designs that treat interconnect as an entity that can be independently architected and optimized would not arrive at the best multi-core design.
Abstract: This paper examines the area, power, performance, and design issues for the on-chip interconnects on a chip multiprocessor, attempting to present a comprehensive view of a class of interconnect architectures. It shows that the design choices for the interconnect have significant effect on the rest of the chip, potentially consuming a significant fraction of the real estate and power budget. This research shows that designs that treat interconnect as an entity that can be independently architected and optimized would not arrive at the best multi-core design. Several examples are presented showing the need for careful co-design. For instance, increasing interconnect bandwidth requires area that then constrains the number of cores or cache sizes, and does not necessarily increase performance. Also, shared level-2 caches become significantly less attractive when the overhead of the resulting crossbar is accounted for. A hierarchical bus structure is examined which negates some of the performance costs of the assumed base-line architecture.

402 citations


Journal ArticleDOI
01 May 2005
TL;DR: This paper evaluates the performance benefits of an EPI throttle on an asymmetric multiprocessor (AMP) prototyped from a physical 4-way Xeon SMP server, and makes a compelling case for varying the amount of energy expended to process instructions according to theamount of available parallelism.
Abstract: This paper is motivated by three recent trends in computer design. First, chip multi-processors (CMPs) with increasing numbers of CPU cores per chip are becoming common. Second, multi-threaded software that can take advantage of CMPs will soon become prevalent. Due to the nature of the algorithms, these multi-threaded programs inherently will have phases of sequential execution; Amdahlýs law dictates that the speedup of such parallel programs will be limited by the sequential portion of the computation. Finally, increasing levels of on-chip integration coupled with a slowing rate of reduction in supply voltage make power consumption a first order design constraint. Given this environment, our goal is to minimize the execution times of multi-threaded programs containing nontrivial parallel and sequential phases, while keeping the CMPýs total power consumption within a fixed budget. In order to mitigate the effects of Amdahlýs law, in this paper we make a compelling case for varying the amount of energy expended to process instructions according to the amount of available parallelism. Using the equation, Power=Energy per instruction (EPI) * Instructions per second (IPS), we propose that during phases of limited parallelism (low IPS) the chip multi-processor will spend more EPI; similarly, during phases of higher parallelism (high IPS) the chip multi-processor will spend less EPI; in both scenarios power is fixed. We evaluate the performance benefits of an EPI throttle on an asymmetric multiprocessor (AMP) prototyped from a physical 4-way Xeon SMP server. Using a wide range of multi-threaded programs, we show a 38% wall clock speedup on an AMP compared to a standard SMP that uses the same power. We also measure the supply current on a 4-way SMP server while running the multi-threaded programs and use the measured data as input to a software simulator that implements a more flexible EPI throttle. The results from the measurement-driven simulation show performance benefits comparable to the AMP prototype. We analyze the results from both techniques, explain why and when an EPI throttle works well, and conclude with a discussion of the challenges in building practical EPI throttles.

209 citations


Journal ArticleDOI
01 May 2005
TL;DR: This paper presents the design of a plug-and-play transparent accelerator system and evaluates the cost/performance implications of the design.
Abstract: Instruction set customization is an effective way to improve processor performance. Critical portions of applicationdata-flow graphs are collapsed for accelerated execution on specialized hardware. Collapsing dataflow subgraphs will compress the latency along critical paths and reduces the number of intermediate results stored in the register file. While custom instructions can be effective, the time and cost of designing a new processor for each application is immense. To overcome this roadblock, this paper proposes a flexible architectural framework to transparently integrate custom instructions into a general-purpose processor. Hardware accelerators are added to the processor to execute the collapsed subgraphs. A simple microarchitectural interface is provided to support a plug-and-play model for integrating a wide range of accelerators into a pre-designed and verified processor core. The accelerators are exploited using an approach of static identification and dynamic realization. The compiler is responsible for identifying profitable subgraphs, while the hardware handles discovery, mapping, and execution of compatible subgraphs. This paper presents the design of a plug-and-play transparent accelerator system and evaluates the cost/performance implications of the design.

143 citations


Patent
30 Jun 2005
TL;DR: In this paper, the authors present a multiple-core processor with support for multiple virtual processors, where a processor may include a cache including a cache banks, a number of cache banks and processor cores, and a core/bank mapping logic coupled to the cache banks.
Abstract: A multiple-core processor with support for multiple virtual processors. In one embodiment, a processor may include a cache including a number of cache banks, a number of processor cores and core/bank mapping logic coupled to the cache banks and processor cores. During a first mode of processor operation, each of the processor cores may be configurable to access any of the cache banks, and during a second mode of processor operation, the core/bank mapping logic may be configured to implement a plurality of virtual processors within the processor. A first virtual processor may include a first subset of the processor cores and a first subset of the banks, and a second virtual processor may include a second subset of the processor cores and a second subset of the cache banks. Subsets of processor cores and cache banks included in the first and second virtual processors may be distinct.

126 citations


Journal ArticleDOI
TL;DR: The experimental results show that the performance loss due to activity migration on a multi-core with private L1s and a shared L2 can be minimized if: (a) a migrating thread continues its execution on a core that was previously visited by the thread, and (b) cores remember their predictor state since their previous activation.
Abstract: High performance multi-core processors are becoming an industry reality. Although multi-cores are suited for multithreaded and multi-programmed workloads, many applications are still mono-thread and multi-core performance with a single thread workload is an important issue. Furthermore, recent studies suggest that performance, power and temperature considerations of future multi-cores may necessitate activity-migration between cores.Motivated by the above, this paper investigates the performance implications of single thread migration on a multi-core. Specifically, the study considers the influence on the performance of a single thread of the following migration and multi-core parameters: frequency of migration, core warm-up modes, subset of resources that are warmed-up, number of cores, and cache hierarchy organization. The results of this study can provide insight to architects on how to design performance-efficient power and thermal strategies for a multi-core chip.The experimental results, for the benchmarks and microarchitectures used in this study, show that the performance loss due to activity migration on a multi-core with private L1s and a shared L2 can be minimized if: (a) a migrating thread continues its execution on a core that was previously visited by the thread, and (b) cores remember their predictor state since their previous activation (all other core resources can be cold). The analogous conclusions for a multi-core with private L1s and L2s and a shared L3 are: remembering the predictor state, maintaining the tags of the various L2 caches coherent and allowing L2-L2 data transfers from inactive cores to the active core.The data also show that when migration period is at least every 160K cycles, the transfer of register state between two cores and the flushing of dirty private L1 data have a negligible performance overhead.

120 citations


Proceedings ArticleDOI
07 Mar 2005
TL;DR: W warp processing is proposed, a technique capable of optimizing a software application by dynamically and transparently re-implementing critical software kernels as custom circuits in on-chip configurable logic, and it is demonstrated that the soft-core based warp processor achieves average speedups of 5.8 and energy reductions of 57% compared to the soft core alone.
Abstract: Field programmable gate arrays (FPGAs) provide designers with the ability to quickly create hardware circuits. Increases in FPGA configurable logic capacity and decreasing FPGA costs have enabled designers to more readily incorporate FPGAs in their designs. FPGA vendors have begun providing configurable soft processor cores that can be synthesized onto their FPGA products. While FPGAs with soft processor cores provide designers with increased flexibility, such processors typically have degraded performance and energy consumption compared to hard-core processors. Previously, we proposed warp processing, a technique capable of optimizing a software application by dynamically and transparently re-implementing critical software kernels as custom circuits in on-chip configurable logic. In this paper, we study the potential of a MicroBlaze soft-core based warp processing system to eliminate the performance and energy overhead of a soft-core processor compared to a hard-core processor. We demonstrate that the soft-core based warp processor achieves average speedups of 5.8 and energy reductions of 57% compared to the soft core alone. Our data shows that a soft-core based warp processor yields performance and energy consumption competitive with existing hard-core processors, thus expanding the usefulness of soft processor cores on FPGAs to a broader range of applications.

117 citations


Patent
02 Aug 2005
TL;DR: In this article, the operating system may cause the first processor core to operate at performance level that is lower than a system maximum performance level in response to detecting the second processor core operating below a utilization threshold.
Abstract: A processing node that is integrated onto a single integrated circuit chip includes a first processor core and a second processor core. The processing node also includes an operating system executing on either of the first processor core and the second processor core. The operating system may monitor a current utilization of the first processor core and the second processor core. The operating system may cause the first processor core to operate at performance level that is lower than a system maximum performance level and the second processor core to operate at performance level that is higher than the system maximum performance level in response to detecting the first processor core operating below a utilization threshold.

116 citations


Proceedings ArticleDOI
12 Nov 2005
TL;DR: A new solution by exploiting helper threaded prefetching through dynamic optimization on the latest UltraSPARC chip-multiprocessing (CMP) processor and shows that by utilizing the otherwise idle processor core, a single user-level helper thread is sufficient to improve the runtime performance of the main thread without triggering multiple thread slices.
Abstract: Data prefetching via helper threading has been extensively investigated on Simultaneous Multi- Threading (SMT) or Virtual Multi-Threading (VMT) architectures. Although reportedly large cache latency can be hidden by helper threads at runtime, most techniques rely on hardware support to reduce context switch overhead between the main thread and helper thread as well as rely on static profile feedback to construct the help thread code. This paper develops a new solution by exploiting helper threaded prefetching through dynamic optimization on the latest UltraSPARC Chip-Multiprocessing (CMP) processor. Our experiments show that by utilizing the otherwise idle processor core, a single user-level helper thread is sufficient to improve the runtime performance of the main thread without triggering multiple thread slices. Moreover, since the multiple cores are physically decoupled in the CMP, contention introduced by helper threading is minimal. This paper also discusses several key technical challenges of building a lightweight dynamic optimization/software scouting system on the UltraSPARC/Solaris platform.

111 citations


Journal ArticleDOI
TL;DR: In this article, the authors have shown that the microarchitecture necessary to support threads on a CMT can also achieve high single-thread performance, up to 40 percent.
Abstract: CMT processors offer a way to significantly improve the performance of computer systems. The return on investment for multithreading is among the highest in computer microarchitectural techniques. If you design a core from scratch to support multithreading, gains as high as 3/spl times/ are possible for just a 20 percent increase in area. Even with throughput performance as the main target, we have shown that the microarchitecture necessary to support threads on a CMT can also achieve high single-thread performance. Hardware scouting, which Sun is implementing on the Rock microprocessor, can increase the single-thread performance of applications by up to 40 percent. Alternatively, scouting is a technique that makes the on-chip caches appear much larger, performance robustness technique, making up for code tailored for different on-chip cache sizes or even a different number and levels of caches.

Patent
08 Sep 2005
TL;DR: In this article, a processor for real-time DFA traversing deterministic finite automata (DFA) graphs with incoming packet data is presented. The processor includes at least one processor core and a DFA module operating asynchronous.
Abstract: A processor for traversing deterministic finite automata (DFA) graphs with incoming packet data in real-time. The processor includes at least one processor core and a DFA module operating asynchronous to the at least one processor core for traversing at least one DFA graph stored in a non-cache memory with packet data stored in a cache-coherent memory.

Journal ArticleDOI
TL;DR: To provide a virtual connection between the processor core in the SoC and its corresponding probe control, MED (multicore embedded debugging) software tool is proposed, which allows a contiguous analysis flow from the system level simulation models of SoC systems through FPGA and emulation prototyping and finally it debug the silicon hardware.
Abstract: Multiple cores embedded debugging architecture for system on chip design (SOC) is presented. It presents an asymmetrical functional test problem. To analyze the problem and optimize performance in multicore operation, debug tools with interfaces are exercised for several cores. HyperJTAG (joint test action group) interface reduces the IO pin interfaces required for debugging several cores. To overcome the wiring problem in hyperJTAG, wire routing and debugging synchronization is proposed. Hyper debug action nodes at each core initiate global or local control actions that synchronously reset the cores. To provide a virtual connection between the processor core in the SoC and its corresponding probe control, MED (multicore embedded debugging) software tool is proposed. This allows a contiguous analysis flow from the system level simulation models of SoC systems through FPGA and emulation prototyping and finally it debug the silicon hardware.

Patent
15 Mar 2005
TL;DR: In this paper, a multi-core processor with active and inactive execution cores is presented, where a plurality of execution cores in a single integrated circuit is used to distinguish whether one corresponding core among the plurality of cores is active or inactive.
Abstract: PROBLEM TO BE SOLVED: To provide a multi-core processor having active and inactive execution cores. SOLUTION: This device in one embodiment has a processor provided with a plurality of execution cores in a single integrated circuit and a plurality of core distinguishing registers for letting each of core distinguishing registers correspond to one of the plurality of execution cores and distinguishing whether one corresponding execution core among the plurality of execution cores is active. COPYRIGHT: (C)2006,JPO&NCIPI

Proceedings ArticleDOI
17 Sep 2005
TL;DR: This paper presents the helper threading experience on SUN's second dual-core SPARC microprocessor, the UltraSPARC IV+.
Abstract: Helper threading is a technique that utilizes a second core or logical processor in a multi-threaded system to improve the performance of the main thread. A helper thread executes in parallel with the main thread that it attempts to accelerate. In this paper, the helper thread merely prefetches data into a shared cache and does not incur any other programmer visible effects. Helper thread prefetching has been proposed as a viable solution in various scenarios where it is difficult to prefetch efficiently within the main thread itself. This paper presents our helper threading experience on SUN's second dual-core SPARC microprocessor, the UltraSPARC IV+. The two cores on this processor share an on-chip L2 and an off-chip L3 cache. We present a compiler framework to automatically construct helper threads and evaluate our scheme on the UltraSPARC IV+ processor. Our preliminary results using helper threads on the SPEC CPU2000 suite show gains of up to 22% on programs that suffer substantial L2 cache misses while at the same time incurring negligible losses on programs that do not suffer L2 cache misses.

Patent
Jan Gray1
04 Feb 2005
TL;DR: A priority register is provided for each of a multiple processor cores of a chip multiprocessor, where the priority register stores values that are used to bias resources available to the multiple processors.
Abstract: A priority register is provided for each of a multiple processor cores of a chip multiprocessor, where the priority register stores values that are used to bias resources available to the multiple processor cores. Even though such multiple processor cores have their own local resources, they must compete for shared resources. These shared resources may be stored on the chip or off the chip. The priority register biases the arbitration process that arbitrates access to or ongoing use of the shared resources based on the values stored in the priority registers. The way it accomplishes such biasing is by tagging operations issued from the multiple processor cores with the priority values, and then comparing the values within each arbiter of the shared resources.

Patent
21 Oct 2005
TL;DR: In this article, the authors present a method for implementing parameterizable processor cores and peripherals on a programmable chip using a wizard interface that allows selection and parameterization of processor cores, peripherals, as well as other modules.
Abstract: Methods and apparatus are provided for implementing parameterizable processor cores and peripherals on a programmable chip. An input interface such as a wizard allows selection and parameterization of processor cores, peripherals, as well as other modules. The logic description for implementing the modules on a programmable chip can be dynamically generated, allowing extensive parameterization of various modules. Dynamic generation also allows the delivery of device driver logic onto a programmable chip. The logic description can include information for configuring a dynamically generated bus module to allow connectivity between the modules as well as connectivity with other on-chip and off-chip components. The logic description, possibly comprising HDL files, can then be automatically synthesized and provided to tools for downloading the logic description onto a programmable chip.

Proceedings ArticleDOI
19 Sep 2005
TL;DR: In this article, the authors proposed an alternative solution based on Spatial Division Multiplexing (SDM) for NoC networks, which can provide throughput and latency guarantees by establishing virtual circuits between source and destination.
Abstract: To ensure low power consumption while maintaining flexibility and performance, future Systems-on-Chip (SoC) will combine several types of processor cores and data memory units of widely different sizes. To interconnect the IPs of these heterogeneous platforms, Networks-on-Chip (NoC) have been proposed as an efficient and scalable alternative to shared buses. NoCs can provide throughput and latency guarantees by establishing virtual circuits between source and destination. State-of-the-art NoCs currently exploit Time-Division Multiplexing (TDM) to share network resources among virtual circuits, but this typically results in high network area and energy overhead with long circuit set-up time.We propose an alternative solution based on Spatial Division Multiplexing (SDM). This paper describes our first design of an SDM-based network, discusses design alternatives for network implementation and shows why SDM should be better adapted to NoCs than TDM for a limited number of circuits.Our case study clearly illustrates the advantages of our technique over TDM in terms of energy consumption, area overhead, and flexibility. SDM thus deserves to be explored in more depth, and in particular in combination with TDM in a hybrid scheme.

Patent
30 Jun 2005
TL;DR: In this paper, a fine-grained multithreading in a multipipelined processor core is described, where a processor may include instruction fetch logic configured to assign a given one of a plurality of threads to a corresponding one of the thread groups.
Abstract: An apparatus and method for fine-grained multithreading in a multipipelined processor core. According to one embodiment, a processor may include instruction fetch logic configured to assign a given one of a plurality of threads to a corresponding one of a plurality of thread groups, where each of the plurality of thread groups may comprise a subset of the plurality of threads, to issue a first instruction from one of the plurality of threads during one execution cycle, and to issue a second instruction from another one of the plurality of threads during a successive execution cycle. The processor may further include a plurality of execution units, each configured to execute instructions issued from a respective thread group.

Proceedings ArticleDOI
20 Feb 2005
TL;DR: This paper first presents a quantitative analysis of the data bandwidth limitation in configurable processors, and then proposes a novel low-cost architectural extension and associated compilation techniques to address the problem.
Abstract: Configurable processors are becoming increasingly popular for modern embedded systems (especially for the field-programmable system-on-a-chip). While steady progress has been made in the tools and methodologies of automatic instruction set extension for configurable processors, the limited data bandwidth available in the core processor (e.g., the number of simultaneous accesses to the register file) becomes a potential performance bottleneck. In this paper we first present a quantitative analysis of the data bandwidth limitation in configurable processors, and then propose a novel low-cost architectural extension and associated compilation techniques to address the problem. The application of our approach results in a promising performance improvement.

Proceedings ArticleDOI
10 Oct 2005
TL;DR: A flexible processor and compiler generation system, FPGA implementations of CUSTARD and performance/area results for media and cryptography benchmarks are presented.
Abstract: We propose CUSTARD - customisable threaded architecture - a soft processor design space that combines support for multiple hardware threads and automatically generated custom instructions. Multiple threads incur low additional hardware cost and allow fine-grained concurrency without multiple processor cores or software overhead. Custom instructions, generated for a specific application, accelerate frequently performed computations by implementing them as dedicated hardware. In this paper we present a flexible processor and compiler generation system, FPGA implementations of CUSTARD and performance/area results for media and cryptography benchmarks.

Book ChapterDOI
04 Apr 2005
TL;DR: A new approach in which a high-level program is separated from its partitioning into concurrent tasks, and an AMS script that partitions it into a form capable of running at 3Gb/s on an Intel IXP2400 Network Processor is presented.
Abstract: Network processors (NPs) typically contain multiple concurrent processing cores. State-of-the-art programming techniques for NPs are invariably low-level, requiring programmers to partition code into concurrent tasks early in the design process. This results in programs that are hard to maintain and hard to port to alternative architectures. This paper presents a new approach in which a high-level program is separated from its partitioning into concurrent tasks. Designers write their programs in a high-level, domain-specific, architecturally-neutral language, but also provide a separate Architecture Mapping Script (AMS). An AMS specifies semantics-preserving transformations that are applied to the program to re-arrange it into a set of tasks appropriate for execution on a particular target architecture. We (i) describe three such transformations: pipeline introduction, pipeline elimination and queue multiplexing; and (ii) specify when each can be safely applied. As a case study we describe an IP packet-forwarder and present an AMS script that partitions it into a form capable of running at 3Gb/s on an Intel IXP2400 Network Processor.

Proceedings ArticleDOI
12 Jun 2005
TL;DR: An existing hardware architecture—the network processor—already fits the model for multi-threaded, multi-core execution and is demonstrated to demonstrate the effectiveness of using many simple processor cores, each with hardware support for multiple thread contexts.
Abstract: Database management systems (DBMSs) do not take full advantage of modern microarchitectural resources, such as wide-issue out-of-order processor pipelines. Increases in processor clock rate and instruction-level parallelism have left memory accesses as the dominant bottleneck in DBMS execution. Prior research indicates that simultaneous multithreading (SMT) can hide memory access latency from a single thread and improve throughput by increasing the number of outstanding memory accesses. Rather than expend chip area and power on out-of-order execution, as in current SMT processors, we demonstrate the effectiveness of using many simple processor cores, each with hardware support for multiple thread contexts. This paper shows an existing hardware architecture—the network processor—already fits the model for multi-threaded, multi-core execution. Using an Intel IXP2400 network processor, we evaluate the performance of three key database operations and demonstrate improvements of 1.9X to 2.5X when compared to a generalpurpose processor.

Journal ArticleDOI
TL;DR: Multicore is the new hot topic in the latest round of CPUs from Intel, AMD, Sun, etc.
Abstract: Multicore is the new hot topic in the latest round of CPUs from Intel, AMD, Sun, etc. With clock speed increases becoming more and more difficult to achieve, vendors have turned to multicore CPUs as the best way to gain additional performance. Customers are excited about the promise of more performance through parallel processors for the same real estate investment.

Patent
10 Feb 2005
TL;DR: Recovery circuits as discussed by the authors remove the processor core from the logical configuration of the symmetric multiprocessor system, potentially reducing propagation of errors to other parts of the system, by waiting for an error-free completion of any pending store-conditional instruction or a cache-inhibited load before ceasing to checkpoint or backup progress of a processor core.
Abstract: Recovery circuits react to errors in a processor core by waiting for an error-free completion of any pending store-conditional instruction or a cache-inhibited load before ceasing to checkpoint or backup progress of a processor core. Recovery circuits remove the processor core from the logical configuration of the symmetric multiprocessor system, potentially reducing propagation of errors to other parts of the system. The processor core is reset and the checkpointed values may be restored to registers of the processor core. The core processor is allowed not just to resume execution just prior to the instructions that failed to execute correctly the first time, but is allowed to operate in a reduced execution mode for a preprogrammed number of groups. If the preprogrammed number of instruction groups execute without error, the processor core is allowed to resume normal execution.

Proceedings ArticleDOI
02 Oct 2005
TL;DR: This paper uses data collected from performance counters on two different hardware implementations of Pentium-4 hyper-threading processors to demonstrate the effects of thread interaction, and techniques are described for fitting linear regression models and recursive partitioning to use the counters to make online predictions of performance.
Abstract: Simultaneous multithreading (SMT) seeks to improve the computation throughput of a processor core by sharing primary resources such as functional units, issue bandwidth, and caches. SMT designs increase utilization and generally improve overall throughput, but the amount of improvement is highly dependent on competition for shared resources between the scheduled threads. This variability has implications that relate to operating system scheduling, simulation techniques, and fairness. Although these techniques recognize the implications of thread interaction, they do little to profile and predict this interaction. The modeling approach presented in this paper uses data collected from performance counters on two different hardware implementations of Pentium-4 hyper-threading processors to demonstrate the effects of thread interaction. Techniques are described for fitting linear regression models and recursive partitioning to use the counters to make online predictions of performance (expressed as instructions per cycle); these predictions can be used by the operating system to guide scheduling decisions. A detailed analysis of the effectiveness of each of these techniques is presented.

Book ChapterDOI
01 Jun 2005
TL;DR: It is concluded that out-of-the-box OpenMP code scales better on CMPs than SMTs, and new capabilities are required by the runtime environment and/or the programming interface to maximize the efficiency of OpenMP on SMTs.
Abstract: Multiprocessors based on simultaneous multithreaded (SMT) or multicore (CMP) processors are continuing to gain a significant share in both high-performance and mainstream computing markets. In this paper we evaluate the performance of OpenMP applications on these two parallel architectures. We use detailed hardware metrics to identify architectural bottlenecks. We find that the high level of resource sharing in SMTs results in performance complications, should more than 1 thread be assigned on a single physical processor. CMPs, on the other hand, are an attractive alternative. Our results show that the exploitation of the multiple processor cores on each chip results in significant performance benefits. We evaluate an adaptive, run-time mechanism which provides limited performance improvements on SMTs, however the inherent bottlenecks remain difficult to overcome. We conclude that out-of-the-box OpenMP code scales better on CMPs than SMTs. To maximize the efficiency of OpenMP on SMTs, new capabilities are required by the runtime environment and/or the programming interface.

Patent
29 Dec 2005
TL;DR: In this article, a method and apparatus to implement monitor primitives when a processor employs an inclusive shared last level cache is presented, where the processor is almost always able to complete a monitor transaction without requiring self snooping through the system interconnect.
Abstract: A method and apparatus to implement monitor primitives when a processor employs an inclusive shared last level cache. By the employing an inclusive last level cache, the processor is almost always able to complete a monitor transaction without requiring self snooping through the system interconnect.

Proceedings ArticleDOI
11 Dec 2005
TL;DR: Two schemes based on hard-wired PowerPC processor core and the MicroBlaze soft processor core have been compared and contrasted in terms of speed and FPGA resource usage.
Abstract: SRAM FPGAs are vulnerable to security breaches such as bitstream cloning, reverse-engineering, and tampering. Bitstream encryption and authentication are two most effective and practical solutions to improve the security of FPGAs. In this paper, we investigate a method to perform a secure dynamic partial reconfiguration of SRAM FPGAs using embedded processor cores. Two schemes based on hard-wired PowerPC processor core and the MicroBlaze soft processor core have been compared and contrasted in terms of speed and FPGA resource usage. A practical experiment, demonstrating feasibility, performance, and flexibility of both schemes has been conducted using Xilinx ML310 board with Xilinx Virtex-II Pro FPGA

Patent
01 Sep 2005
TL;DR: In this paper, a method and apparatus for minimizing stalls in a pipelined processor is presented, where instructions in an out-of-order instruction scheduler are executed in order without stalling the pipeline by sending store data to external memory through an ordering queue.
Abstract: A method and apparatus for minimizing stalls in a pipelined processor is provided. Instructions in an out-of-order instruction scheduler are executed in order without stalling the pipeline by sending store data to external memory through an ordering queue.