scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Micro in 2003"


Journal ArticleDOI
TL;DR: Googless architecture features clusters of more than 15,000 commodity-class PCs with fault tolerant software that achieves superior performance at a fraction of the cost of a system built from fewer, but more expensive, high-end servers.
Abstract: Amenable to extensive parallelization, Google's web search application lets different queries run on different processors and, by partitioning the overall index, also lets a single query use multiple processors. to handle this workload, Googless architecture features clusters of more than 15,000 commodity-class PCs with fault tolerant software. This architecture achieves superior performance at a fraction of the cost of a system built from fewer, but more expensive, high-end servers.

1,129 citations


Journal ArticleDOI
TL;DR: The main trends and challenges in circuit reliability are discussed, and evolving techniques for dealing with them are explained.
Abstract: Deep-submicron technology is having a significant impact on permanent, intermittent, and transient classes of faults. This article discusses the main trends and challenges in circuit reliability, and explains evolving techniques for dealing with them.

622 citations


Journal ArticleDOI
TL;DR: It is argued that instead of assuming average behavior, it is now time to model and optimize phase-based program behavior.
Abstract: Understanding program behavior is at the foundation of computer architecture and program optimization. Many programs have wildly different behavior on even the largest of scales (that is, over the program's complete execution). During one part of the execution, a program can be completely memory bound; in another, it can repeatedly stall on branch mispredicts. Average statistics gathered about a program might not accurately picture where the real problems lie. This realization has ramifications for many architecture and compiler techniques, from how to best schedule threads on a multithreaded machine, to feedback-directed optimizations, power management, and the simulation and test of architectures. Taking advantage of time-varying behavior requires a set of automated analytic tools and hardware techniques that can discover similarities and changes in program behavior on the largest of time scales. The challenge in building such tools is that during a program's lifetime it can execute billions or trillions of instructions. How can high-level behavior be extracted from this sea of instructions? Some programs change behavior drastically, switching between periods of high and low performance, yet system design and optimization typically focus on average system behavior. It is argued that instead of assuming average behavior, it is now time to model and optimize phase-based program behavior.

279 citations


Journal ArticleDOI
David A. Koufaty1, Scott Rodgers1
TL;DR: HowHyperthreading technology works, that is, how a single physical processor appear as multiple logical processors to operating systems and software is described and how this technology significantly improves performance on several relevant workloads is shown.
Abstract: Hyperthreading technology, which brings the concept of simultaneous multithreading to the Intel architecture, was first introduced on the Intel Xeon processor in early 2002 for the server market. In November 2002, Intel launched the technology on the Intel Pentium 4 at clock frequencies of 3.06 GHz and higher, making the technology widely available to the consumer market. This technology signals a new direction in microarchitecture development and fundamentally changes the cost-benefit tradeoffs of microarchitecture design choices. This article describes how the technology works, that is, how we make a single physical processor appear as multiple logical processors to operating systems and software. We highlight the additional structures and die area needed to implement the technology and discuss the fundamental ideas behind the technology and why we can get a 25-percent boost in performance from a technology that costs less than 5 percent in added die area. We illustrate the importance of choosing the right sharing policy for each shared resource by describing, examining, and comparing three different sharing policies: partitioned resources, threshold sharing, and full sharing. The choice of policy depends on the traffic pattern, complexity and size of the resource, potential deadlock/livelock scenarios, and other considerations. Finally, we show how this technology significantly improves performance on several relevant workloads.

253 citations


Journal ArticleDOI
TL;DR: Representing AMD's entry into 64-bit computing, Opteron combines the backwards compatibility of the X86-64 architecture with a DDR memory controller and hypertransport links to deliver server-class performance.
Abstract: Representing AMD's entry into 64-bit computing, Opteron combines the backwards compatibility of the X86-64 architecture with a DDR memory controller and hypertransport links to deliver server-class performance. These features also make Opteron a flexible, modular, and easily connectable component for various multiprocessor configurations.

247 citations


Journal ArticleDOI
TL;DR: The Tera-op reliable intelligently adaptive processing system (TRIPS) architecture seeks to deliver system-level configurability to applications and runtime systems by employing the concept of polymorphism.
Abstract: The Tera-op reliable intelligently adaptive processing system (TRIPS) architecture seeks to deliver system-level configurability to applications and runtime systems. It does so by employing the concept of polymorphism, which permits the runtime system to configure the hardware execution resources to match the mode of execution and demands of the compiler and application.

206 citations


Journal ArticleDOI
C. McNairy1, D. Soltis2
TL;DR: The Itanium 2 processor as discussed by the authors extends the processing power of the Itanium processor family with a capable and balanced microarchitecture. Executing up to six instructions at a time, it provides both performance and binary compatibility for Itaniumbased applications and operating systems.
Abstract: The Itanium 2 processor extends the processing power of the Itanium processor family with a capable and balanced microarchitecture. Executing up to six instructions at a time, it provides both performance and binary compatibility for Itanium-based applications and operating systems.

160 citations


Journal ArticleDOI
TL;DR: During the past decade, interconnects have replaced transistors as the dominant determiner of chip performance, but new and radically different interconnect technologies will become increasingly important to future gigascale microsystems.
Abstract: During the past decade, interconnects have replaced transistors as the dominant determiner of chip performance. To sustain the historical rate of advance in performance, monolithic interconnect technology has rapidly evolved to keep pace with advances in transistor density and performance. New and radically different interconnect technologies will become increasingly important to future gigascale microsystems.

160 citations


Journal ArticleDOI
TL;DR: Chip-level redundant threading with recovery for chip multiprocessors extends previous transient-fault detection schemes to provide fault recovery and uses the trailing thread state for recovery to hide interprocessor latency.
Abstract: Chip-level redundant threading with recovery (CRTR) for chip multiprocessors extends previous transient-fault detection schemes to provide fault recovery. To hide interprocessor latency, CRTR uses a long slack enabled by asymmetric commit and uses the trailing thread state for recovery. CRTR increases bandwidth supply by pipelining communication paths and reduces bandwidth demand by extending the dependence-based checking elision.

157 citations


Journal ArticleDOI
TL;DR: The authors define the role of architecture techniques and describe hotspot, an accurate yet fast thermal model suitable for computer architecture research.
Abstract: Temperature-aware design techniques have an important role to play in addition to traditional techniques like power-aware design and package- and board-level thermal engineering. The authors define the role of architecture techniques and describe hotspot, an accurate yet fast thermal model suitable for computer architecture research.

143 citations


Journal ArticleDOI
TL;DR: An architectural-level power model for interconnection network routers will let researchers and designers easily factor in power when exploring architectural tradeoffs.
Abstract: As interconnection networks proliferate to many new applications, a low-latency high-throughput fabric no longer suffices. An architectural-level power model for interconnection network routers will let researchers and designers easily factor in power when exploring architectural tradeoffs.

Journal ArticleDOI
TL;DR: In a comprehensive study using the Itsy pocket computer, the authors measure both total system power and power dissipated by individual subcircuits for representative workloads and suggest possible low-power design optimizations and power management strategies.
Abstract: In a comprehensive study using the Itsy pocket computer, the authors measure both total system power and power dissipated by individual subcircuits for representative workloads. The results suggest possible low-power design optimizations and power management strategies.

Journal ArticleDOI
TL;DR: Statistical simulation enables quick and accurate design decisions in the early stages of computer design, at the processor and system levels, reducing total design time and cost.
Abstract: Statistical simulation enables quick and accurate design decisions in the early stages of computer design, at the processor and system levels. it complements detailed but slower architectural simulations, reducing total design time and cost.

Journal ArticleDOI
TL;DR: An indispensable ingredient for future success is improvement in the design-manufacture interface and the semiconductor industry needs continuous reduction of the k/sub 1/ factor.
Abstract: With lithography parameters approaching their limits, continuous improvement requires increasing dialogues and compromises between the technology and design communities Only with such communication can semiconductor manufacturers reach the 30 nm physical-gate-length era with optical lithography Optical lithography is an enabling technology for transistor miniaturization With the wavelength and numerical aperture of exposure systems approaching their limits, the semiconductor industry needs continuous reduction of the k/sub 1/ factor Challenges include image quality improvement, proximity effect correction, and cost control An indispensable ingredient for future success is improvement in the design-manufacture interface

Journal ArticleDOI
TL;DR: The authors propose several designs that treat the cache as a network of banks and facilitate nonuniform accesses to different physical regions that offer low-latency access, increased scalability, and greater performance stability than conventional uniform access cache architectures.
Abstract: Nonuniform cache access designs solve the on-chip wire delay problem for future large integrated caches. By embedding a network in the cache, NUCA designs let data migrate within the cache, clustering the working set nearest the processor. The authors propose several designs that treat the cache as a network of banks and facilitate nonuniform accesses to different physical regions. NUCA architectures offer low-latency access, increased scalability, and greater performance stability than conventional uniform access cache architectures.

Journal ArticleDOI
TL;DR: A multiple clock domain (MCD) microarchitecture, which uses a globally asynchronous, locally synchronous (GALS) clocking style, permits future aggressive frequency increases, maintains a synchronous design methodology, and exploits the trend of making functional blocks more autonomous.
Abstract: Multiple clock domains is one solution to the increasing problem of propagating the clock signal across increasingly larger and faster chips. The ability to independently scale frequency and voltage in each domain creates a powerful means of reducing power dissipation. A multiple clock domain (MCD) microarchitecture, which uses a globally asynchronous, locally synchronous (GALS) clocking style, permits future aggressive frequency increases, maintains a synchronous design methodology, and exploits the trend of making functional blocks more autonomous. In MCD, each processor domain is internally synchronous, but domains operate asynchronously with respect to one another. Designers still apply existing synchronous design techniques to each domain, but global clock skew is no longer a constraint. Moreover, domains can have independent voltage and frequency control, enabling dynamic voltage scaling at the domain level.

Journal ArticleDOI
TL;DR: Runahead execution uses otherwise-idle clock cycles to achieve an average 22 percent performance improvement for processors with instruction windows of contemporary sizes.
Abstract: An instruction window that can tolerate latencies to DRAM memory is prohibitively complex and power hungry. To avoid having to build such large windows, runahead execution uses otherwise-idle clock cycles to achieve an average 22 percent performance improvement for processors with instruction windows of contemporary sizes. This technique incurs only a small hardware cost and does not significantly increase the processor's complexity.

Journal ArticleDOI
TL;DR: This work evaluates the Vector IRAM architecture and shows that a compiler can vectorize embedded tasks automatically without compromising code density, and describes a prototype vector processor that outperforms high-end superscalar and VLIW designs by 1.5x to 100x for media tasks, without compromising power consumption.
Abstract: For embedded applications with data-level parallelism, a vector processor offers high performance at low power consumption and low design complexity. Unlike superscalar and VLIW designs, a vector processor is scalable and can optimally match specific application requirements.To demonstrate that vector architectures meet the requirements of embedded media processing, we evaluate the Vector IRAM, or VIRAM (pronounced "V-IRAM"), architecture developed at UC Berkeley, using benchmarks from the Embedded Microprocessor Benchmark Consortium (EEMBC). Our evaluation covers all three components of the VIRAM architecture: the instruction set, the vectorizing compiler, and the processor microarchitecture. We show that a compiler can vectorize embedded tasks automatically without compromising code density. We also describe a prototype vector processor that outperforms high-end superscalar and VLIW designs by 1.5x to 100x for media tasks, without compromising power consumption. Finally, we demonstrate that clustering and modular design techniques let a vector processor scale to tens of arithmetic data paths before wide instruction-issue capabilities become necessary.

Journal ArticleDOI
TL;DR: This article presents a method for generating accurate soft-error estimates early in the design cycle to weigh the benefits of error protection techniques against their costs and presents a guide for using these estimates.
Abstract: The continuous exponential growth in transistors per chip as described by Moore's law has spurred tremendous progress in the functionality and performance of semiconductor devices, particularly microprocessors. At the same time, each succeeding technology generation has introduced new obstacles to maintaining this growth rate. Transient faults caused by single-event upsets have emerged as a key challenge likely to gain significantly more importance in the next few design generations. Techniques for dealing with these faults exist, but they come at a cost. Designers need accurate soft-error estimates early in the design cycle to weigh the benefits of error protection techniques against their costs. This article presents a method for generating these estimates.

Journal ArticleDOI
Francois Abel1, Cyriel Minkenberg1, Ronald P. Luijten1, Mitch Gusat1, Ilias Iliadis1 
TL;DR: This 4-TBPS packet switch uses a combined input- and crosspoint-queued (CICQ) structure with virtual output queuing at the ingress to achieve the scalability of input-buffered switches, the performance of output-buffering switches, and low latency.
Abstract: This 4-TBPS packet switch uses a combined input- and crosspoint-queued (CICQ) structure with virtual output queuing at the ingress to achieve the scalability of input-buffered switches, the performance of output-buffered switches, and low latency.

Journal ArticleDOI
TL;DR: This new approach characterizes power dissipation on complex dsps by relying on an initial functional-level power analysis of the target processor together with a characterization that qualifies the more significant architectural and algorithmic parameters for power Dissipation from a simple profiling of the assembly code.
Abstract: This new approach characterizes power dissipation on complex dsps. its processor model relies on an initial functional-level power analysis of the target processor together with a characterization that qualifies the more significant architectural and algorithmic parameters for power dissipation. these parameters come from a simple profiling of the assembly code. This functional model accounts for deeply pipelined, superscalar, and hierarchical memory architectures.

Journal ArticleDOI
TL;DR: This paper presents an algorithm that can sort in O(N) memory cycles using ternary content-addressable memories, a TCAM, and shows how this algorithm can be improved in the coming years.
Abstract: Sorting and searching are classic problems in computing. Although several RAM-based solutions exist, algorithms using ternary content-addressable memories offer performance benefits. Using these algorithms, a TCAM can sort in O(N) memory cycles.

Journal ArticleDOI
TL;DR: Transactional lock removal can dynamically eliminate synchronization operations and achieve transparent transactional execution by treating lock-based critical sections as lock-free optimistic transactions.
Abstract: Although lock-based critical sections are the synchronization method of choice, they have significant performance limitations and lack certain properties, such as failure atomicity and stability. Addressing both these limitations requires considerable software overhead. Transactional lock removal can dynamically eliminate synchronization operations and achieve transparent transactional execution by treating lock-based critical sections as lock-free optimistic transactions.

Journal ArticleDOI
TL;DR: This flow-monitoring circuit delivers an ordered byte stream to a client application for every TCP/IP connection it processes, using an active flow-processing algorithm.
Abstract: This flow-monitoring circuit delivers an ordered byte stream to a client application for every TCP/IP connection it processes. Using an active flow-processing algorithm, TCP Splitter is a lightweight, efficient design that supports the monitoring of an almost unlimited number of flows at multigigabit line rates.

Journal ArticleDOI
TL;DR: A new platform-based methodology can revolutionize the way a car is designed and help to provide entertainment and communication, and to ensure safety.
Abstract: Electronic components are now essential to control a car's movements and chemical, mechanical, and electrical processes; to provide entertainment and communication; and to ensure safety A new platform-based methodology can revolutionize the way a car is designed

Journal ArticleDOI
TL;DR: The token coherence framework directly enforces the coherence invariant by counting tokens, which enables more obviously correct protocols that do not rely on request ordering and can operate with alternative policies that seek to improve the performance of future multiprocessors.
Abstract: Commercial workload and technology trends are pushing existing shared-memory multiprocessor coherence protocols in divergent directions. Token coherence provides a framework for new coherence protocols that can reconcile these opposing trends. The token coherence framework directly enforces the coherence invariant by counting tokens (requiring all of a block's tokens to write and at least one token to read). This token-counting approach enables more obviously correct protocols that do not rely on request ordering and can operate with alternative policies that seek to improve the performance of future multiprocessors.

Journal ArticleDOI
TL;DR: The improved performance of current microprocessors brings with it increasingly complex and power-dissipating issue logic and a range of mechanisms for tackling this problem.
Abstract: The improved performance of current microprocessors brings with it increasingly complex and power-dissipating issue logic. Recent proposals introduce a range of mechanisms for tackling this problem.

Journal ArticleDOI
TL;DR: Market-related trends continue to drive innovation in the semiconductor industry today, they are particularly driving the design of systems on a chip, the new breed of complex, highly integrated systems.
Abstract: Market-related trends continue to drive innovation in the semiconductor industry today, they are particularly driving the design of systems on a chip, the new breed of complex, highly integrated systems.

Journal ArticleDOI
TL;DR: A new technique, checkpoint processing and recovery, offers an efficient means of increasing the instruction window size without requiring large, cycle-critical structures, and provides a promising microarchitecture for future high-performance processors.
Abstract: Processors require a combination of large instruction windows and high clock frequency to achieve high performance. Traditional processors use reorder buffers, but these structures do not scale efficiently as window size increases. A new technique, checkpoint processing and recovery, offers an efficient means of increasing the instruction window size without requiring large, cycle-critical structures, and provides a promising microarchitecture for future high-performance processors.

Journal ArticleDOI
TL;DR: By adapting the branch target buffer's size and dynamically disabling a hybrid predictor's components, the authors create a customized branch predictor that saves a significant amount of energy with little performance degradation.
Abstract: To exploit instruction-level parallelism, high-end processors use branch predictors consisting of many large, often underutilized structures that cause unnecessary energy waste and high power consumption. By adapting the branch target buffer's size and dynamically disabling a hybrid predictor's components, the authors create a customized branch predictor that saves a significant amount of energy with little performance degradation.