scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Micro in 2006"


Journal ArticleDOI
TL;DR: The M5 simulator provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsystem, and the ability to simulate multiple networked systems deterministically.
Abstract: The M5 simulator is developed specifically to enable research in TCP/IP networking. The M5 simulator provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsystem, and the ability to simulate multiple networked systems deterministically. M5's usefulness as a general-purpose architecture simulator and its liberal open-source license has led to its adoption by several academic and commercial groups

839 citations


Journal ArticleDOI
Cary Gunn1
TL;DR: Luxtera has demonstrated the technology required to implement CMOS photonics, and product development is underway as discussed by the authors for 10-Gbps operation, in addition to that required to scale to 100 Gbps and 1 Tbps.
Abstract: Luxtera has demonstrated the technology required to implement CMOS photonics, and product development is underway. It has also demonstrated all the technology required for 10-Gbps operation, in addition to that required to scale to 100 Gbps and 1 Tbps. A single 10-Gbps channel today integrates tens of optical components into a single die alongside circuitry of modest gate count, 100,000 per transceiver. For the first time, high-speed optical communications directly between silicon die are possible at a price-performance point competitive with traditional electrical interconnects

493 citations


Journal ArticleDOI
TL;DR: The streamlined architecture provides an efficient multithreaded execution environment for both scalar and SIMD threads and represents a reaffirmation of the RISC principles of combining leading edge architecture and compiler optimizations.
Abstract: Eight synergistic processor units enable the Cell Broadband Engine's breakthrough performance. The SPU architecture implements a novel, pervasively data-parallel architecture combining scalar and SIMD processing on a wide data path. A large number of SPUs per chip provide high thread-level parallelism. The streamlined architecture provides an efficient multithreaded execution environment for both scalar and SIMD threads and represents a reaffirmation of the RISC principles of combining leading edge architecture and compiler optimizations. These design decisions have enabled the Cell BE to deliver unprecedented supercomputer-class compute power for consumer applications

463 citations


Journal ArticleDOI
TL;DR: The authors analyze the cell processor's communication network, using a series of benchmarks involving various DMA traffic patterns and synchronization protocols to illuminate this important point in multicore processor design.
Abstract: Multicore designs promise various power-performance and area-performance benefits. But inadequate design of the on-chip communication network can deprive applications of these benefits. To illuminate this important point in multicore processor design, the authors analyze the cell processor's communication network, using a series of benchmarks involving various DMA traffic patterns and synchronization protocols

391 citations


Journal ArticleDOI
TL;DR: Statistical sampling makes simulation-based studies feasible by providing ten-thousand-fold reductions in simulation runtime and enabling thousand-way simulation parallelism.
Abstract: Timing-accurate full-system multiprocessor simulations can take years because of architecture and application complexity. Statistical sampling makes simulation-based studies feasible by providing ten-thousand-fold reductions in simulation runtime and enabling thousand-way simulation parallelism

339 citations


Journal ArticleDOI
TL;DR: A hardware implementation of unbounded transactional memory, called UTM, is described, which exploits the common case for performance without sacrificing correctness on transactions whose footprint can be nearly as large as virtual memory.
Abstract: This article advances the following thesis: transactional memory should be virtualized to support transactions of arbitrary footprint and duration. Such support should be provided through hardware and be made visible to software through the machines instruction set architecture. We call a transactional memory system unbounded if the system can handle transactions of arbitrary duration that have footprints nearly as big as the systems virtual memory. The primary goal of unbounded transactional memory is to make concurrent programming easier without incurring much implementation overhead. Unbounded transactional-memory architectures can achieve high performance in the common case of small transactions, without sacrificing correctness in large transactions

295 citations


Journal ArticleDOI
TL;DR: Richard Mateosian reviews old and new books, including Weinberg on Writing--The Fieldstone Method, The Art of Computer Programming, From Java to Ruby--Things Every Manager Should Know, and Introduction to DITA--A User Guide to the Darwin Information Typing Architecture.
Abstract: Richard Mateosian reviews old and new books, including Weinberg on Writing--The Fieldstone Method, The Art of Computer Programming, From Java to Ruby--Things Every Manager Should Know, and Introduction to DITA--A User Guide to the Darwin Information Typing Architecture.

228 citations


Journal ArticleDOI
TL;DR: This article challenges the commonly held view that IPC accurately reflects performance - at least for multithreaded workloads running on multiprocessors, and concludes that work-related metrics, such as time per transaction, are the most accurate and reliable way to estimate multip rocessor workload performance.
Abstract: Many architectural simulation studies use instructions per cycle (IPC) to analyze performance. In this article, we challenge the commonly held view that IPC accurately reflects performance - at least for multithreaded workloads running on multiprocessors. Work-related metrics, such as time per transaction, are the most accurate and reliable way to estimate multiprocessor workload performance

150 citations


Journal ArticleDOI
TL;DR: Digitally assisted analog circuits can exploit digital circuits' high density and low energy per computation to enable a new generation of interface electronics based on minimal-precision, low-complexity analog building blocks.
Abstract: Today's interfaces between digital and "real world" analog signals rely mainly on complex analog circuit components that strictly limit achievable power efficiency and throughput. Digitally assisted analog circuits can exploit digital circuits' high density and low energy per computation to enable a new generation of interface electronics based on minimal-precision, low-complexity analog building blocks

145 citations


Journal ArticleDOI
TL;DR: Leakage current in the nanometer regime has become a significant portion of power dissipation in CMOS circuits as threshold voltage, channel length, and gate oxide thickness scale downward.
Abstract: Leakage current in the nanometer regime has become a significant portion of power dissipation in CMOS circuits as threshold voltage, channel length, and gate oxide thickness scale downward. Various techniques are available to reduce leakage power in high-performance systems

137 citations


Journal ArticleDOI
TL;DR: Variability must be considered at both the circuit and micro-architectural design levels to keep pace with performance scaling and to keep power consumption within reasonable limits as mentioned in this paper, and an overview of the main sources of variability can be found in this paper.
Abstract: Parameter variations, which are increasing along with advances in process technologies, affect both timing and power. Variability must be considered at both the circuit and microarchitectural design levels to keep pace with performance scaling and to keep power consumption within reasonable limits. This article presents an overview of the main sources of variability and surveys variation-tolerant circuit and microarchitectural approaches

Journal ArticleDOI
TL;DR: Connecting a built-in current sensor in the design bulk of a digital system increases sensitivity for detecting transient upsets in combinational and sequential logic.
Abstract: Connecting a built-in current sensor in the design bulk of a digital system increases sensitivity for detecting transient upsets in combinational and sequential logic. SPICE simulations validate this approach and show only minor penalties in terms of area, performance, and power consumption

Journal ArticleDOI
TL;DR: The SeaStar was designed specifically to support Sandia National Laboratories' ASC Red Storm, a distributed-memory parallel computing platform containing more than 11,000 network end-points and presented designers with several challenging goals that were commensurate with a high-performance network for a system of that scale.
Abstract: The Seastar, a new ASIC from Cray, is a full system-on-chip design that integrates high-speed serial links, a 3D router, and traditional network interface functionality, including an embedded processor in a single chip. Cray Inc. designed the SeaStar specifically to support Sandia National Laboratories' ASC Red Storm, a distributed-memory parallel computing platform containing more than 11,000 network end-points. SeaStar presented designers with several challenging goals that were commensurate with a high-performance network for a system of that scale. The primary challenge was to provide a well-balanced, highly scalable, highly reliable network. From the Red Storm perspective, a balanced network is one that maximizes network performance relative to the computational power of the network end-points. A main challenge for SeaStar was to maximize the bytes-to-flops ratio of network bandwidth - that is, to maximize the amount of network bandwidth relative to each nodes floating-point capability

Journal ArticleDOI
TL;DR: The Xbox 360 contains an aggressive hardware architecture and implementation targeted at game console workloads that implements the product designers' goal of providing game developers a hardware platform to implement their next-generation game ambitions.
Abstract: This article covers the Xbox 360's high-level technical requirements, a short system overview, and details of the CPU and the GPU. The Xbox 360 contains an aggressive hardware architecture and implementation targeted at game console workloads. The core silicon implements the product designers' goal of providing game developers a hardware platform to implement their next-generation game ambitions. The core chips include the standard conceptual blocks of CPU, graphics processing unit (GPU), memory, and I/O. Each of these components and their interconnections are customized to provide a user-friendly game console product. The authors describe their architectural trade-offs and summarize the system's software programming support

Journal ArticleDOI
TL;DR: For runahead execution to be efficiently implemented in current or future high-performance processors which will be energy-constrained, processor designers must develop techniques to reduce these extra instructions.
Abstract: Today's high-performance processors face main-memory latencies on the order of hundreds of processor clock cycles. As a result, even the most aggressive processors spend a significant portion of their execution time stalling and waiting for main-memory accesses to return data to the execution core. Runahead execution is a promising way to tolerate long main-memory latencies because it has modest hardware cost and doesn't significantly increase processor complexity. Runahead execution improves a processors performance by speculatively pre-executing the application program while the processor services a long-latency (1,2) data cache miss, instead of stalling the processor for the duration of the L2 miss. For runahead execution to be efficiently implemented in current or future high-performance processors which will be energy-constrained, processor designers must develop techniques to reduce these extra instructions. Our solution to this problem includes both hardware and software mechanisms that are simple, implementable, and effective

Journal ArticleDOI
TL;DR: Two CGCT implementations are presented, RegionScout and Region Coherence Arrays, and simulation results for a broadcast-based multiprocessor system running commercial, scientific, and multiprogrammed workloads are provided.
Abstract: Cache-coherent shared-memory multiprocessors have wide-ranging applications, from commercial transaction processing and database services to large-scale scientific computing. Coarse-grain coherence tracking (CGCT) is a new technique that extends a conventional coherence mechanism and optimizes coherence enforcement. It monitors the coherence status of large regions of memory and uses that information to avoid unnecessary broadcasts and filter unnecessary cache tag lookups, thus improving system performance and power consumption. This article presents two CGCT implementations, RegionScout and Region Coherence Arrays, and provides simulation results for a broadcast-based multiprocessor system running commercial, scientific, and multiprogrammed workloads

Journal ArticleDOI
TL;DR: The Corning-IBM joint optical shared memory supercomputer interconnect system (Osmosis) project explores the opportunity to advance the role of optical-switching technologies in high-performance computing systems.
Abstract: A crucial part of any high-performance computing (HPC) system is its interconnection network. Corning and IBM are jointly developing a demonstration interconnect based on optical cell switching with electronic control. The Corning-IBM joint optical shared memory supercomputer interconnect system (Osmosis) project explores the opportunity to advance the role of optical-switching technologies in such systems. Key innovations in the scheduler architecture directly address the main HPC requirements: low latency, high throughput, efficient multicast support, and high reliability

Journal ArticleDOI
TL;DR: The authors target better coverage while incurring minimal performance degradation by opportunistically using redundancy in future commodity microprocessors.
Abstract: CMOS scaling continues to enable faster transistors and lower supply voltage, improving microprocessor performance and reducing per-transistor power. The downside of scaling is increased susceptibility to soft errors due to strikes by cosmic particles and radiation from packaging materials. The result is degraded reliability in future commodity microprocessors. The authors target better coverage while incurring minimal performance degradation by opportunistically using redundancy

Journal ArticleDOI
TL;DR: Among proposed strategies for congestion management, only the regional explicit congestion notification (RECN) mechanism achieves both the required efficiency and the scalability that emerging systems demand.
Abstract: Compared to the overdimensioned designs of the past, current interconnection networks operate closer to the point of saturation and run a higher risk of congestion. Among proposed strategies for congestion management, only the regional explicit congestion notification (RECN) mechanism achieves both the required efficiency and the scalability that emerging systems demand

Journal ArticleDOI
TL;DR: Through careful codesign and optimization of an architecture with a new string matching algorithm, the authors show it is possible to build a system that is almost 12 times more efficient than the currently best known approaches.
Abstract: String matching is a critical element of modern intrusion detection systems because it lets a system make decisions based not just on headers, but actual content flowing through the network. Through careful codesign and optimization of an architecture with a new string matching algorithm, the authors show it is possible to build a system that is almost 12 times more efficient than the currently best known approaches

Journal ArticleDOI
TL;DR: A dynamic-compiler-driven runtime voltage and frequency optimizer is proposed for microprocessors that achieves energy savings of up to 70 percent and can be implemented and deployed in a real system.
Abstract: A general dynamic-compilation environment offers power and performance control opportunities for microprocessors. The authors propose a dynamic-compiler-driven runtime voltage and frequency optimizer. A prototype of their design, implemented and deployed in a real system, achieves energy savings of up to 70 percent

Journal ArticleDOI
TL;DR: Sirius, an thermal modeling and simulation framework combines with ThermalHerd, a distributed runtime scheme for thermal management to offer a path to thermally efficient on-chip network design.
Abstract: On-chip networks are becoming increasingly popular as a way to connect high-performance single-chip computer systems, but thermal issues greatly limit network design. Sirius, an thermal modeling and simulation framework combines with ThermalHerd, a distributed runtime scheme for thermal management to offer a path to thermally efficient on-chip network design

Journal ArticleDOI
TL;DR: Swich is introduced, an FPGA-based prototype of a new cache-level scheme that keeps two live checkpoints at all times, forming a sliding rollback window that maintains a large minimum and average length.
Abstract: Existing cache-level checkpointing schemes do not continuously support a large rollback window. Immediately after a checkpoint, the number of instructions that the processor can undo falls to zero. To address this problem, we introduce Swich, an FPGA-based prototype of a new cache-level scheme that keeps two live checkpoints at all times, forming a sliding rollback window that maintains a large minimum and average length

Journal ArticleDOI
TL;DR: A new memory scheduler is presented that makes decisions based on the history of recently scheduled operations, providing two advantages: it can better reason about the delays associated with complex DRAM structure, and it can adapt to different observed workload.
Abstract: Careful memory scheduling can increase memory bandwidth and overall system performance. We present a new memory scheduler that makes decisions based on the history of recently scheduled operations, providing two advantages: it can better reason about the delays associated with complex DRAM structure, and it can adapt to different observed workload

Journal ArticleDOI
TL;DR: With software's increasing complexity, providing efficient hardware support for software debugging is critical and will allow developers to deterministically replay and debug an application to pin-point the root cause of a bug.
Abstract: With software's increasing complexity, providing efficient hardware support for software debugging is critical. Hardware support is necessary to observe and capture, with little or no overhead, the exact execution of a program. Providing this ability to developers will allow them to deterministically replay and debug an application to pin-point the root cause of a bug

Journal ArticleDOI
TL;DR: This article refutes the claim that such a design of chip multiprocessors with thread-level speculation is necessarily too energy inefficient and proposes out-of-order task spawning to exploit more sources of speculative task-level parallelism.
Abstract: Chip multiprocessors with thread-level speculation have become the subject of intense research, this article refutes the claim that such a design is necessarily too energy inefficient. In addition, it proposes out-of-order task spawning to exploit more sources of speculative task-level parallelism

Journal ArticleDOI
TL;DR: A microarchitecture-based, software-transparent mechanism offers protection against stack-based buffer overflow attacks with moderate hardware cost and negligible performance overhead.
Abstract: Although researchers have proposed several software approaches to preventing buffer overflow attacks, adversaries still extensively exploit this vulnerability. A microarchitecture-based, software-transparent mechanism offers protection against stack-based buffer overflow attacks with moderate hardware cost and negligible performance overhead

Journal ArticleDOI
Pradip Bose1
TL;DR: Three articles in this general issue of IEEE Micro address the challenge of reliable designs of the future.
Abstract: Many electronics experts predicted that component failures (in particular, tube failures) in the pioneering ENIAC machine would be so frequent that the machine would never be useful. But the engineers (system architects) and component manufacturers improved their art over time to improve the system’s availability. Their achievement of remarkably low failure rates should serve as an inspiration to chip- and system-level designers today. Three articles in this general issue of IEEE Micro address the challenge of reliable designs of the future.

Journal ArticleDOI
TL;DR: The accuracy and speed of various sampling startup techniques are compared, introducing touched memory image and memory hierarchy state, to reduce sampled benchmark simulation times from hours to minutes.
Abstract: Sampling techniques dramatically shorten simulation times for industry-standard benchmarks, but establishing the correct architecture and microarchitecture states at the beginning of each sample can be time-consuming. This article compares the accuracy and speed of various sampling startup techniques, introducing touched memory image and memory hierarchy state. Together, these two techniques reduce sampled benchmark simulation times from hours to minutes

Journal ArticleDOI
TL;DR: A software-configurable processor combines a traditional RISC processor with a field-programmable instruction extension unit that lets the system designer tailor the processor to a particular application.
Abstract: A software-configurable processor combines a traditional RISC processor with a field-programmable instruction extension unit that lets the system designer tailor the processor to a particular application. To add application-specific instructions to the processor, the programmer adds a pragma before a C or C++ function declaration, and the compiler then turns the function into a single instruction