scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Micro in 2012"


Journal ArticleDOI
TL;DR: A comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.
Abstract: A key question for the microprocessor research and design community is whether scaling multicores will provide the performance and value needed to scale down many more technology generations. To provide a quantitative answer to this question, a comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

1,556 citations


Journal ArticleDOI
Efraim Rotem1, Alon Naveh1, Doron Rajwan1, Avinash N. Ananthakrishnan1, Eliezer Weissmann1 
TL;DR: This article describes power-management innovations introduced on Intel's Sandy Bridge microprocessor, and suggests an architectural approach that's adaptive to and cognizant of workload behavior and platform physical constraints is indispensable to meeting performance and efficiency goals.
Abstract: Modern microprocessors are evolving into system-on-a-chip designs with high integration levels, catering to ever-shrinking form factors. Portability without compromising performance is a driving market need. An architectural approach that's adaptive to and cognizant of workload behavior and platform physical constraints is indispensable to meeting these performance and efficiency goals. This article describes power-management innovations introduced on Intel's Sandy Bridge microprocessor.

452 citations


Journal ArticleDOI
TL;DR: The architecture and design of the Compute chip is examined, which combines processors, memory, and communication functions on a single chip to build a massively parallel high-performance computing system out of power-efficient processor chips.
Abstract: Blue Gene/Q aims to build a massively parallel high-performance computing system out of power-efficient processor chips, resulting in power-efficient, cost-efficient, and floor-space- efficient systems. Focusing on reliability during design helps with scaling to large systems and lowers the total cost of ownership. This article examines the architecture and design of the Compute chip, which combines processors, memory, and communication functions on a single chip.

280 citations


Journal ArticleDOI
TL;DR: The DySER (Dynamically Specializing Execution Resources) architecture supports both functionality specialization and parallelism specialization and outperforms an out-of-order CPU, Streaming SIMD Extensions (SSE) acceleration, and GPU acceleration while consuming less energy.
Abstract: The DySER (Dynamically Specializing Execution Resources) architecture supports both functionality specialization and parallelism specialization. By dynamically specializing frequently executing regions and applying parallelism mechanisms, DySER provides efficient functionality and parallelism specialization. It outperforms an out-of-order CPU, Streaming SIMD Extensions (SSE) acceleration, and GPU acceleration while consuming less energy. The full-system field-programmable gate array (FPGA) prototype of DySER integrated into OpenSparc demonstrates a practical implementation.

253 citations


Journal ArticleDOI
TL;DR: The Llano variant of the AMD Fusion accelerated processor unit (APU) deploys AMD Turbo CORE technology to maximize processor performance within the system's thermal design limits.
Abstract: The Llano variant of the AMD Fusion accelerated processor unit (APU) deploys AMD Turbo CORE technology to maximize processor performance within the system's thermal design limits. Low-power design and performance/watt ratio optimization were key design approaches, and power gating is implemented pervasively across the APU.

136 citations


Journal ArticleDOI
TL;DR: This article describes the IBM Blue Gene/Q interconnection network and message unit, which has new routing algorithms and techniques to parallelize the injection and reception of packets in the network interface.
Abstract: This article describes the IBM Blue Gene/Q interconnection network and message unit. Blue Gene/Q is the third generation in the IBM Blue Gene line of massively parallel supercomputers and can be scaled to 20 petaflops and beyond. For better application scalability and performance, Blue Gene/Q has new routing algorithms and techniques to parallelize the injection and reception of packets in the network interface.

103 citations


Journal ArticleDOI
Y. Ajima1, Tomohiro Inoue1, S. Hiramoto1, Toshiyuki Shimizu1, Y. Takagi 
TL;DR: The Tofu interconnect uses a 6D mesh/torus topology in which each cubic fragment of the network has the embeddability of a 3D torus graph, allowing users to run multiple topology-aware applications.
Abstract: The Tofu interconnect uses a 6D mesh/torus topology in which each cubic fragment of the network has the embeddability of a 3D torus graph, allowing users to run multiple topology-aware applications. This article describes the Tofu interconnect architecture, the Tofu network router, the Tofu network interface, and the Tofu barrier interface, and presents preliminary evaluation results.

97 citations


Journal ArticleDOI
TL;DR: This work efficiently enables conventional block sizes for very large die-stacked DRAM caches with two innovations: it makes hits faster with compound-access scheduling and misses faster with a MissMap.
Abstract: This work efficiently enables conventional block sizes for very large die-stacked DRAM caches with two innovations: it makes hits faster with compound-access scheduling and misses faster with a MissMap. The combination of these mechanisms enables the new organization to deliver performance comparable to that of an idealistic DRAM cache that employs an impractically large SRAM-based on-chip tag array.

68 citations


Journal ArticleDOI
TL;DR: This article demonstrates that the coming era of CPU and GPU integration requires the CPU to rethink the CPU's design and architecture, and shows that the code the CPU will run, once appropriate computations are mapped to the GPU, has significantly different characteristics than the original code.
Abstract: We've seen the quick adoption of GPUs as general-purpose computing engines in recent years, fueled by high computational throughput and energy efficiency. There is heavier integration of the CPU and GPU, including the GPU appearing on the same die, further decreasing barriers to the use of the GPU to offload the CPU. Much effort has been made to adapt GPU designs to anticipate this new partitioning of the computation space, including better programming models and more general processing units with support for control flow. However, researchers have placed little attention on the CPU and how it must adapt to this change. This article demonstrates that the coming era of CPU and GPU integration requires us to rethink the CPU's design and architecture. We show that the code the CPU will run, once appropriate computations are mapped to the GPU, has significantly different characteristics than the original code (which previously would have been mapped entirely to the CPU).

60 citations


Journal ArticleDOI
TL;DR: Pack and Cap is a novel, practical methodology to select thread packing and dynamic voltage and frequency scaling configurations by learning multithreaded workload characteristics and adapting to dynamic-power caps.
Abstract: Power capping in computer clusters enables energy budgeting, efficient power delivery, and management of operational and cooling costs. Pack and Cap is a novel, practical methodology to select thread packing and dynamic voltage and frequency scaling (DVFS) configurations by learning multithreaded workload characteristics and adapting to dynamic-power caps. Pack and Cap improves energy efficiency and achievable range of power caps.

56 citations


Journal ArticleDOI
TL;DR: Sparc T4's key features are described and the microarchitecture of the dynamically threaded S3 processor core, which is implemented on Sparc T4, is described.
Abstract: The Sparc T4 is the next generation of Oracle's multicore, multithreaded 64-bit Sparc server processor. It delivers significant performance improvements over its predecessor, the Sparc T3 processor. The authors describe Sparc T4's key features and detail the microarchitecture of the dynamically threaded S3 processor core, which is implemented on Sparc T4.

Journal ArticleDOI
TL;DR: The authors' message-passing service, based on scalable user-level communication and offloaded operations for large-scale, low-latency collective communication, has achieved a unidirectional bandwidth of 6,340 Mbytes/s.
Abstract: The petascale supercomputer Tianhe-1A, which features hybrid multicore CPU and GPU computing, achieves an optimized balance of computation and communication capabilities through a proprietary high-bandwidth, low-latency interconnect fabric. The authors' message-passing service, based on scalable user-level communication and offloaded operations for large-scale, low-latency collective communication, has achieved a unidirectional bandwidth of 6,340 Mbytes/s.

Journal ArticleDOI
TL;DR: In this article, a specialized Scale-Out Processor (SOP) architecture maximizes on-chip computing density to deliver the highest performance per TCO and performance per watt at the data-center level.
Abstract: Performance and total cost of ownership (TCO) are key optimization metrics in large-scale data centers. According to these metrics, data centers designed with conventional server processors are inefficient. Recently introduced processors based on low-power cores can improve both throughput and energy efficiency compared to conventional server chips. However, a specialized Scale-Out Processor (SOP) architecture maximizes on-chip computing density to deliver the highest performance per TCO and performance per watt at the data-center level.

Journal ArticleDOI
TL;DR: The authors focus on the resource management implications and propose a hierarchical approach for dynamically managing the real-time computing constraints of wireless communications systems that run on the SDR cloud.
Abstract: Software-defined radio (SDR) clouds combine SDR concepts with cloud computing technology for designing and managing future base stations. They provide a scalable solution for the evolution of wireless communications. The authors focus on the resource management implications and propose a hierarchical approach for dynamically managing the real-time computing constraints of wireless communications systems that run on the SDR cloud.

Journal ArticleDOI
TL;DR: The Vantage cache-partitioning technique enables configurability and quality-of-service guarantees in large-scale chip multiprocessors with shared caches.
Abstract: The Vantage cache-partitioning technique enables configurability and quality-of-service guarantees in large-scale chip multiprocessors with shared caches. Caches can have hundreds of partitions with sizes specified at cache line granularity, while maintaining high associativity and strict isolation among partitions.

Journal ArticleDOI
TL;DR: Bubble-Up enables the safe colocation of multiple workloads on a single machine for Web service applications that have quality of service constraints, thus greatly improving machine utilization in modern WSCs.
Abstract: Precisely predicting performance degradation due to colocating multiple executing applications on a single machine is critical for improving utilization in modern warehouse-scale computers (WSCs). Bubble-Up is the first mechanism for such precise prediction. As opposed to over-provisioning machines, Bubble-Up enables the safe colocation of multiple workloads on a single machine for Web service applications that have quality of service constraints, thus greatly improving machine utilization in modern WSCs.

Journal ArticleDOI
TL;DR: FabScalar aims to automate superscalar core design, opening up processor design to microarchitectural diversity and its many opportunities.
Abstract: Providing multiple superscalar core types on a chip, each tailored to different classes of instruction-level behavior, is an exciting direction for increasing processor performance and energy efficiency. Unfortunately, processor design and verification effort increases with each additional core type, limiting the microarchitectural diversity that can be practically implemented. FabScalar aims to automate superscalar core design, opening up processor design to microarchitectural diversity and its many opportunities.

Journal ArticleDOI
TL;DR: The D-RMTP SoC provides almost all functions required for the humanoid robots, including a real-time processing unit, a real -time inter-node communication link with error correction, and various I/O peripherals.
Abstract: This article illustrates the design and implementation of the Dependable Responsive Multithreaded Processor (D-RMTP) for distributed real-time systems, especially humanoid robots. Paper presents a humanoid robot Kojiro. Kojiro's controllers are currently implemented by using the 16-bit H8 microprocessor with a USB network. We are planning to replace them with the D-RMTPs with responsive link network to improve dependability, so that the small controllers with the D-RMTPs are embedded at every joint of the robot and are interconnected via a real-time network called responsive link for distributed control. Therefore, the D-RMTP is designed to meet severe requirements in terms of footprint, latency, scalability, and dependability. The dependable responsive multithreaded processor (D-RMTP) applies priority-based control to all computation and communication levels. it also implements a hardware-based logging mechanism and errorcorrecting code (ECC) for improving dependability. the system on a chip (SOC), memory modules, and thermal and voltage sensors are integrated into the system in a package (SIP).

Journal ArticleDOI
TL;DR: The algorithm operates on a program dependence graph in static-single-assignment form, extracting task, pipeline, and data parallelism from arbitrary control flow, and coarsening its granularity using a generalized form of typed fusion.
Abstract: This article presents a general algorithm for transforming sequential imperative programs into parallel data-flow programs. The algorithm operates on a program dependence graph in static-single-assignment form, extracting task, pipeline, and data parallelism from arbitrary control flow, and coarsening its granularity using a generalized form of typed fusion. A prototype based on GNU Compiler Collection (GCC) is applied to the automatic parallelization of recursive C programs.

Journal ArticleDOI
Michael Doggett1
TL;DR: This column examines the texture cache, an essential component of modern GPUs that plays an important role in achieving real-time performance when generating realistic images.
Abstract: This column examines the texture cache, an essential component of modern GPUs that plays an important role in achieving real-time performance when generating realistic images. GPUs have many components and the texture cache is only one of them. But it has a real impact on the performance of the GPU if rasterization and memory tiling are set up correctly.

Journal ArticleDOI
TL;DR: By creating a set of low-power modes, hardware mechanisms and software policies, MemScale trades memory bandwidth for energy savings while tightly limiting the associated performance impact.
Abstract: Main memory accounts for a growing fraction of server energy usage. Investigating active low-power modes for managing main memory, with a system called MemScale, the authors offer a solution for performance-aware energy management. By creating a set of low-power modes, hardware mechanisms and software policies, MemScale trades memory bandwidth for energy savings while tightly limiting the associated performance impact.

Journal ArticleDOI
David May1
TL;DR: The event-driven architecture supports energy-efficient multicore and multichip systems in which cores are active only when needed.
Abstract: The XMOS architecture scales from real-time systems with a single multithreaded processor to systems with thousands of processors. Concurrent processing, communications, and I/O are supported by the instruction set of the XCORE processors and by the message-routing techniques and protocols in the XMOS interconnect. The event-driven architecture supports energy-efficient multicore and multichip systems in which cores are active only when needed.

Journal ArticleDOI
TL;DR: Systematically exploring power, performance, and energy sheds new light on the clash of two trends that unfolded over the past decade: the rise of parallel processors in response to technology constraints on power, clock speed, and wire delay.
Abstract: Systematically exploring power, performance, and energy sheds new light on the clash of two trends that unfolded over the past decade: the rise of parallel processors in response to technology constraints on power, clock speed, and wire delay; and the rise of managed high-level, portable programming languages.

Journal ArticleDOI
TL;DR: Kremlin combines a novel dynamic program analysis, hierarchical critical-path analysis, with multicore processor models to evaluate thousands of potential parallelization strategies and estimate their performance outcomes.
Abstract: The Kremlin open-source tool helps programmers by automatically identifying regions in sequential programs that merit parallelization. Kremlin combines a novel dynamic program analysis, hierarchical critical-path analysis, with multicore processor models to evaluate thousands of potential parallelization strategies and estimate their performance outcomes.

Journal ArticleDOI
TL;DR: Godson-T is a research many-core processor designed for parallel scientific computing that delivers efficient performance and flexible programmability simultaneously and has many features to achieve high efficiency for on-chip resource utilization.
Abstract: Godson-T is a research many-core processor designed for parallel scientific computing that delivers efficient performance and flexible programmability simultaneously. It also has many features to achieve high efficiency for on-chip resource utilization, such as a region-based cache coherence protocol, data transfer agents, and hardware-supported synchronization mechanisms. Finally, it also features a highly efficient runtime system, a Pthreads-like programming model, and versatile parallel libraries, which make this many-core design flexibly programmable.

Journal ArticleDOI
TL;DR: Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers.
Abstract: Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers.

Journal ArticleDOI
TL;DR: New circuit-design techniques that drastically reduce the static RAM (SRAM) memories' energy consumption while still achieving tens of megahertz of operation are discussed.
Abstract: Medical diagnosis and healthcare are at the onset of a revolution fueled by improvements in smart sensors and body area networks. Those sensor nodes' computation and memory requirements are growing, but their energy resources do not increase; thus, more energy-efficient memories and processors are required. New circuit-design techniques that drastically reduce the static RAM (SRAM) memories' energy consumption while still achieving tens of megahertz of operation are discussed.

Journal ArticleDOI
TL;DR: A new low-power object-recognition processor achieves real-time robust recognition, satisfying modern mobile vision systems' requirements, and an attention-based object- recognition algorithm for energy efficiency, a heterogeneous multicore architecture for data- and thread-level parallelism, and a network on a chip for high on-chip bandwidth.
Abstract: A new low-power object-recognition processor achieves real-time robust recognition, satisfying modern mobile vision systems' requirements. The authors introduce an attention-based object-recognition algorithm for energy efficiency, a heterogeneous multicore architecture for data- and thread-level parallelism, and a network on a chip for high on-chip bandwidth. The fabricated chip achieves 30 frames/second throughput and an average 320 mW power consumption on test 720p video sequences, yielding 640 GOPS/W and 10.5 NJ/pixel energy efficiency.

Journal ArticleDOI
TL;DR: To improve GPUs' programmability and thus extend their usage to a wider range of applications, the authors propose to enable transactional memory (TM) on GPUs via Kilo TM, a novel hardware TM system that scales to thousands of concurrent transactions.
Abstract: Programming GPUs is challenging for applications with irregular fine-grained communication between threads. To improve GPUs' programmability and thus extend their usage to a wider range of applications, the authors propose to enable transactional memory (TM) on GPUs via Kilo TM, a novel hardware TM system that scales to thousands of concurrent transactions.

Journal ArticleDOI
TL;DR: A hybrid network-on-chip architecture called Kilo-NoC co-optimizes topology, flow control, and quality of service to achieve significant gains in efficiency.
Abstract: To meet rapidly growing performance demands and energy constraints, future chips will likely feature thousands of on-die resources. Existing network-on-chip solutions weren't designed for scalability and will be unable to meet future interconnect demands. A hybrid network-on-chip architecture called Kilo-NoC co-optimizes topology, flow control, and quality of service to achieve significant gains in efficiency.