Showing papers in "IEEE Micro in 2012"

PDF

Open Access

Journal Article•DOI•

Dark Silicon and the End of Multicore Scaling

[...]

Hadi Esmaeilzadeh¹, Emily Blem², R. St. Amant³, Karthikeyan Sankaralingam², Doug Burger⁴ - Show less +1 more•Institutions (4)

University of Washington¹, University of Wisconsin-Madison², University of Texas at Austin³, Microsoft⁴

01 May 2012-IEEE Micro

TL;DR: A comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

...read moreread less

Abstract: A key question for the microprocessor research and design community is whether scaling multicores will provide the performance and value needed to scale down many more technology generations. To provide a quantitative answer to this question, a comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

...read moreread less

1,556 citations

Journal Article•DOI•

Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge

[...]

Efraim Rotem¹, Alon Naveh¹, Doron Rajwan¹, Avinash N. Ananthakrishnan¹, Eliezer Weissmann¹ - Show less +1 more•Institutions (1)

Intel¹

01 Mar 2012-IEEE Micro

TL;DR: This article describes power-management innovations introduced on Intel's Sandy Bridge microprocessor, and suggests an architectural approach that's adaptive to and cognizant of workload behavior and platform physical constraints is indispensable to meeting performance and efficiency goals.

...read moreread less

Abstract: Modern microprocessors are evolving into system-on-a-chip designs with high integration levels, catering to ever-shrinking form factors. Portability without compromising performance is a driving market need. An architectural approach that's adaptive to and cognizant of workload behavior and platform physical constraints is indispensable to meeting these performance and efficiency goals. This article describes power-management innovations introduced on Intel's Sandy Bridge microprocessor.

...read moreread less

452 citations

Journal Article•DOI•

The IBM Blue Gene/Q Compute Chip

[...]

R. A. Haring¹, Martin Ohmacht¹, Thomas W. Fox¹, Michael K. Gschwind¹, David L. Satterfield¹, Krishnan Sugavanam¹, Paul W. Coteus¹, Philip Heidelberger¹, Matthias A. Blumrich¹, Robert W. Wisniewski¹, Alan Gara¹, George Liang-Tai Chiu¹, Peter Boyle², Norman H. Chist³, Changhoan Kim - Show less +11 more•Institutions (3)

IBM¹, University of Edinburgh², Columbia University³

01 Mar 2012-IEEE Micro

TL;DR: The architecture and design of the Compute chip is examined, which combines processors, memory, and communication functions on a single chip to build a massively parallel high-performance computing system out of power-efficient processor chips.

...read moreread less

Abstract: Blue Gene/Q aims to build a massively parallel high-performance computing system out of power-efficient processor chips, resulting in power-efficient, cost-efficient, and floor-space- efficient systems. Focusing on reliability during design helps with scaling to large systems and lowers the total cost of ownership. This article examines the architecture and design of the Compute chip, which combines processors, memory, and communication functions on a single chip.

...read moreread less

280 citations

Journal Article•DOI•

DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing

[...]

Venkatraman Govindaraju¹, Chen-Han Ho¹, Tony Nowatzki¹, Jatin Chhugani², Nadathur Satish², Karthikeyan Sankaralingam¹, Changkyu Kim² - Show less +3 more•Institutions (2)

University of Wisconsin-Madison¹, Intel²

01 Sep 2012-IEEE Micro

TL;DR: The DySER (Dynamically Specializing Execution Resources) architecture supports both functionality specialization and parallelism specialization and outperforms an out-of-order CPU, Streaming SIMD Extensions (SSE) acceleration, and GPU acceleration while consuming less energy.

...read moreread less

Abstract: The DySER (Dynamically Specializing Execution Resources) architecture supports both functionality specialization and parallelism specialization. By dynamically specializing frequently executing regions and applying parallelism mechanisms, DySER provides efficient functionality and parallelism specialization. It outperforms an out-of-order CPU, Streaming SIMD Extensions (SSE) acceleration, and GPU acceleration while consuming less energy. The full-system field-programmable gate array (FPGA) prototype of DySER integrated into OpenSparc demonstrates a practical implementation.

...read moreread less

253 citations

Journal Article•DOI•

AMD Fusion APU: Llano

[...]

Alexander J. Branover¹, Denis J. Foley¹, Maurice B. Steinman¹•Institutions (1)

Advanced Micro Devices¹

01 Mar 2012-IEEE Micro

TL;DR: The Llano variant of the AMD Fusion accelerated processor unit (APU) deploys AMD Turbo CORE technology to maximize processor performance within the system's thermal design limits.

...read moreread less

Abstract: The Llano variant of the AMD Fusion accelerated processor unit (APU) deploys AMD Turbo CORE technology to maximize processor performance within the system's thermal design limits. Low-power design and performance/watt ratio optimization were key design approaches, and power gating is implemented pervasively across the APU.

...read moreread less

136 citations

Journal Article•DOI•

The IBM Blue Gene/Q Interconnection Fabric

[...]

Dong Chen¹, Noel A. Eisley¹, P. Heidelberger¹, Robert M. Senger¹, Yutaka Sugawara¹, Sameer Kumar¹, Valentina Salapura¹, David L. Satterfield¹, Burkhard Steinmacher-Burow¹, Jeffrey J. Parker² - Show less +6 more•Institutions (2)

IBM¹, University of Rochester²

01 Jan 2012-IEEE Micro

TL;DR: This article describes the IBM Blue Gene/Q interconnection network and message unit, which has new routing algorithms and techniques to parallelize the injection and reception of packets in the network interface.

...read moreread less

Abstract: This article describes the IBM Blue Gene/Q interconnection network and message unit. Blue Gene/Q is the third generation in the IBM Blue Gene line of massively parallel supercomputers and can be scaled to 20 petaflops and beyond. For better application scalability and performance, Blue Gene/Q has new routing algorithms and techniques to parallelize the injection and reception of packets in the network interface.

...read moreread less

103 citations

Journal Article•DOI•

The Tofu Interconnect

[...]

Y. Ajima¹, Tomohiro Inoue¹, S. Hiramoto¹, Toshiyuki Shimizu¹, Y. Takagi - Show less +1 more•Institutions (1)

Fujitsu¹

01 Jan 2012-IEEE Micro

TL;DR: The Tofu interconnect uses a 6D mesh/torus topology in which each cubic fragment of the network has the embeddability of a 3D torus graph, allowing users to run multiple topology-aware applications.

...read moreread less

Abstract: The Tofu interconnect uses a 6D mesh/torus topology in which each cubic fragment of the network has the embeddability of a 3D torus graph, allowing users to run multiple topology-aware applications. This article describes the Tofu interconnect architecture, the Tofu network router, the Tofu network interface, and the Tofu barrier interface, and presents preliminary evaluation results.

...read moreread less

97 citations

Journal Article•DOI•

Supporting Very Large DRAM Caches with Compound-Access Scheduling and MissMap

[...]

Gabriel H. Loh¹, Mark D. Hill²•Institutions (2)

Advanced Micro Devices¹, University of Wisconsin-Madison²

01 May 2012-IEEE Micro

TL;DR: This work efficiently enables conventional block sizes for very large die-stacked DRAM caches with two innovations: it makes hits faster with compound-access scheduling and misses faster with a MissMap.

...read moreread less

Abstract: This work efficiently enables conventional block sizes for very large die-stacked DRAM caches with two innovations: it makes hits faster with compound-access scheduling and misses faster with a MissMap. The combination of these mechanisms enables the new organization to deliver performance comparable to that of an idealistic DRAM cache that employs an impractically large SRAM-based on-chip tag array.

...read moreread less

68 citations

Journal Article•DOI•

Redefining the Role of the CPU in the Era of CPU-GPU Integration

[...]

Manish Arora¹, Siddhartha Nath¹, Subhra Mazumdar¹, Scott B. Baden¹, Dean M. Tullsen¹ - Show less +1 more•Institutions (1)

University of California, San Diego¹

01 Nov 2012-IEEE Micro

TL;DR: This article demonstrates that the coming era of CPU and GPU integration requires the CPU to rethink the CPU's design and architecture, and shows that the code the CPU will run, once appropriate computations are mapped to the GPU, has significantly different characteristics than the original code.

...read moreread less

Abstract: We've seen the quick adoption of GPUs as general-purpose computing engines in recent years, fueled by high computational throughput and energy efficiency. There is heavier integration of the CPU and GPU, including the GPU appearing on the same die, further decreasing barriers to the use of the GPU to offload the CPU. Much effort has been made to adapt GPU designs to anticipate this new partitioning of the computation space, including better programming models and more general processing units with support for control flow. However, researchers have placed little attention on the CPU and how it must adapt to this change. This article demonstrates that the coming era of CPU and GPU integration requires us to rethink the CPU's design and architecture. We show that the code the CPU will run, once appropriate computations are mapped to the GPU, has significantly different characteristics than the original code (which previously would have been mapped entirely to the CPU).

...read moreread less

60 citations

Journal Article•DOI•

Adaptive Power Capping for Servers with Multithreaded Workloads

[...]

Sherief Reda¹, Ryan Cochran¹, Ayse K. Coskun²•Institutions (2)

Brown University¹, Boston University²

01 Sep 2012-IEEE Micro

TL;DR: Pack and Cap is a novel, practical methodology to select thread packing and dynamic voltage and frequency scaling configurations by learning multithreaded workload characteristics and adapting to dynamic-power caps.

...read moreread less

Abstract: Power capping in computer clusters enables energy budgeting, efficient power delivery, and management of operational and cooling costs. Pack and Cap is a novel, practical methodology to select thread packing and dynamic voltage and frequency scaling (DVFS) configurations by learning multithreaded workload characteristics and adapting to dynamic-power caps. Pack and Cap improves energy efficiency and achievable range of power caps.

...read moreread less

56 citations

Journal Article•DOI•

Sparc T4: A Dynamically Threaded Server-on-a-Chip

[...]

Manish K. Shah¹, Robert T. Golla¹, Gregory F. Grohoski¹, Paul J. Jordan¹, Jama I. Barreh¹, Jeffrey S. Brooks¹, M. Greenberg¹, G. Levinsky¹, Mark A. Luttrell¹, Christopher H. Olson¹, Zeid Hartuon Samoail¹, Matthew B. Smittle¹, Thomas Alan Ziaja¹ - Show less +9 more•Institutions (1)

Oracle Corporation¹

01 Mar 2012-IEEE Micro

TL;DR: Sparc T4's key features are described and the microarchitecture of the dynamically threaded S3 processor core, which is implemented on Sparc T4, is described.

...read moreread less

Abstract: The Sparc T4 is the next generation of Oracle's multicore, multithreaded 64-bit Sparc server processor. It delivers significant performance improvements over its predecessor, the Sparc T3 processor. The authors describe Sparc T4's key features and detail the microarchitecture of the dynamically threaded S3 processor core, which is implemented on Sparc T4.

...read moreread less

Journal Article•DOI•

Tianhe-1A Interconnect and Message-Passing Services

[...]

Min Xie¹, Yutong Lu¹, Ke-Fei Wang¹, Lu Liu¹, Hongjia Cao¹, Xuejun Yang¹ - Show less +2 more•Institutions (1)

National University of Defense Technology¹

01 Jan 2012-IEEE Micro

TL;DR: The authors' message-passing service, based on scalable user-level communication and offloaded operations for large-scale, low-latency collective communication, has achieved a unidirectional bandwidth of 6,340 Mbytes/s.

...read moreread less

Abstract: The petascale supercomputer Tianhe-1A, which features hybrid multicore CPU and GPU computing, achieves an optimized balance of computation and communication capabilities through a proprietary high-bandwidth, low-latency interconnect fabric. The authors' message-passing service, based on scalable user-level communication and offloaded operations for large-scale, low-latency collective communication, has achieved a unidirectional bandwidth of 6,340 Mbytes/s.

...read moreread less

Journal Article•DOI•

Optimizing Data-Center TCO with Scale-Out Processors

[...]

Boris Grot¹, Damien Hardy², Pejman Lotfi-Kamran¹, Babak Falsafi¹, Chrysostomos Nicopoulos², Yiannakis Sazeides² - Show less +2 more•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, University of Cyprus²

01 Sep 2012-IEEE Micro

TL;DR: In this article, a specialized Scale-Out Processor (SOP) architecture maximizes on-chip computing density to deliver the highest performance per TCO and performance per watt at the data-center level.

...read moreread less

Abstract: Performance and total cost of ownership (TCO) are key optimization metrics in large-scale data centers. According to these metrics, data centers designed with conventional server processors are inefficient. Recently introduced processors based on low-power cores can improve both throughput and energy efficiency compared to conventional server chips. However, a specialized Scale-Out Processor (SOP) architecture maximizes on-chip computing density to deliver the highest performance per TCO and performance per watt at the data-center level.

...read moreread less

Journal Article•DOI•

Resource Management for Software-Defined Radio Clouds

[...]

Ismael Gómez Miguelez, Vuk Marojevic, Antoni Gelonch Bosch

01 Jan 2012-IEEE Micro

TL;DR: The authors focus on the resource management implications and propose a hierarchical approach for dynamically managing the real-time computing constraints of wireless communications systems that run on the SDR cloud.

...read moreread less

Abstract: Software-defined radio (SDR) clouds combine SDR concepts with cloud computing technology for designing and managing future base stations. They provide a scalable solution for the evolution of wireless communications. The authors focus on the resource management implications and propose a hierarchical approach for dynamically managing the real-time computing constraints of wireless communications systems that run on the SDR cloud.

...read moreread less

Journal Article•DOI•

Scalable and Efficient Fine-Grained Cache Partitioning with Vantage

[...]

Daniel L. Sanchez¹, Christos Kozyrakis¹•Institutions (1)

Stanford University¹

01 May 2012-IEEE Micro

TL;DR: The Vantage cache-partitioning technique enables configurability and quality-of-service guarantees in large-scale chip multiprocessors with shared caches.

...read moreread less

Abstract: The Vantage cache-partitioning technique enables configurability and quality-of-service guarantees in large-scale chip multiprocessors with shared caches. Caches can have hundreds of partitions with sizes specified at cache line granularity, while maintaining high associativity and strict isolation among partitions.

...read moreread less

Journal Article•DOI•

Increasing Utilization in Modern Warehouse-Scale Computers Using Bubble-Up

[...]

Jason Mars¹, Lingjia Tang¹, Kevin Skadron¹, Mary Lou Soffa¹, Robert Hundt² - Show less +1 more•Institutions (2)

University of Virginia¹, Google²

01 May 2012-IEEE Micro

TL;DR: Bubble-Up enables the safe colocation of multiple workloads on a single machine for Web service applications that have quality of service constraints, thus greatly improving machine utilization in modern WSCs.

...read moreread less

Abstract: Precisely predicting performance degradation due to colocating multiple executing applications on a single machine is critical for improving utilization in modern warehouse-scale computers (WSCs). Bubble-Up is the first mechanism for such precise prediction. As opposed to over-provisioning machines, Bubble-Up enables the safe colocation of multiple workloads on a single machine for Web service applications that have quality of service constraints, thus greatly improving machine utilization in modern WSCs.

...read moreread less

Journal Article•DOI•

FabScalar: Automating Superscalar Core Design

[...]

Niket K. Choudhary¹, Salil V. Wadhavkar¹, Tanmay A. Shah¹, Hiran Mayukh¹, Jayneel Gandhi¹, Brandon H. Dwiel¹, Sandeep Navada¹, Hashem Hashemi Najaf-abadi¹, Eric Rotenberg¹ - Show less +5 more•Institutions (1)

North Carolina State University¹

01 May 2012-IEEE Micro

TL;DR: FabScalar aims to automate superscalar core design, opening up processor design to microarchitectural diversity and its many opportunities.

...read moreread less

Abstract: Providing multiple superscalar core types on a chip, each tailored to different classes of instruction-level behavior, is an exciting direction for increasing processor performance and energy efficiency. Unfortunately, processor design and verification effort increases with each additional core type, limiting the microarchitectural diversity that can be practically implemented. FabScalar aims to automate superscalar core design, opening up processor design to microarchitectural diversity and its many opportunities.

...read moreread less

Journal Article•DOI•

The Dependable Responsive Multithreaded Processor for Distributed Real-Time Systems

[...]

Kazutoshi Suito¹, Rikuhei Ueda¹, Kei Fujii¹, Takuma Kogo¹, Hiroki Matsutani¹, Nobuyuki Yamasaki¹ - Show less +2 more•Institutions (1)

Keio University¹

01 Nov 2012-IEEE Micro

TL;DR: The D-RMTP SoC provides almost all functions required for the humanoid robots, including a real-time processing unit, a real -time inter-node communication link with error correction, and various I/O peripherals.

...read moreread less

Abstract: This article illustrates the design and implementation of the Dependable Responsive Multithreaded Processor (D-RMTP) for distributed real-time systems, especially humanoid robots. Paper presents a humanoid robot Kojiro. Kojiro's controllers are currently implemented by using the 16-bit H8 microprocessor with a USB network. We are planning to replace them with the D-RMTPs with responsive link network to improve dependability, so that the small controllers with the D-RMTPs are embedded at every joint of the robot and are interconnected via a real-time network called responsive link for distributed control. Therefore, the D-RMTP is designed to meet severe requirements in terms of footprint, latency, scalability, and dependability. The dependable responsive multithreaded processor (D-RMTP) applies priority-based control to all computation and communication levels. it also implements a hardware-based logging mechanism and errorcorrecting code (ECC) for improving dependability. the system on a chip (SOC), memory modules, and thermal and voltage sensors are integrated into the system in a package (SIP).

...read moreread less

Journal Article•DOI•

Automatic Extraction of Coarse-Grained Data-Flow Threads from Imperative Programs

[...]

Feng Li¹, Antoniu Pop, Albert Cohen²•Institutions (2)

French Institute for Research in Computer Science and Automation¹, École Normale Supérieure²

01 Jul 2012-IEEE Micro

TL;DR: The algorithm operates on a program dependence graph in static-single-assignment form, extracting task, pipeline, and data parallelism from arbitrary control flow, and coarsening its granularity using a generalized form of typed fusion.

...read moreread less

Abstract: This article presents a general algorithm for transforming sequential imperative programs into parallel data-flow programs. The algorithm operates on a program dependence graph in static-single-assignment form, extracting task, pipeline, and data parallelism from arbitrary control flow, and coarsening its granularity using a generalized form of typed fusion. A prototype based on GNU Compiler Collection (GCC) is applied to the automatic parallelization of recursive C programs.

...read moreread less

Journal Article•DOI•

Texture Caches

[...]

Michael Doggett¹•Institutions (1)

Lund University¹

01 May 2012-IEEE Micro

TL;DR: This column examines the texture cache, an essential component of modern GPUs that plays an important role in achieving real-time performance when generating realistic images.

...read moreread less

Abstract: This column examines the texture cache, an essential component of modern GPUs that plays an important role in achieving real-time performance when generating realistic images. GPUs have many components and the texture cache is only one of them. But it has a real impact on the performance of the GPU if rasterization and memory tiling are set up correctly.

...read moreread less

Journal Article•DOI•

Active Low-Power Modes for Main Memory with MemScale

[...]

Qingyuan Deng¹, Luiz Ramos¹, Ricardo Bianchini¹, David Meisner², Thomas F. Wenisch² - Show less +1 more•Institutions (2)

Rutgers University¹, University of Michigan²

01 May 2012-IEEE Micro

TL;DR: By creating a set of low-power modes, hardware mechanisms and software policies, MemScale trades memory bandwidth for energy savings while tightly limiting the associated performance impact.

...read moreread less

Abstract: Main memory accounts for a growing fraction of server energy usage. Investigating active low-power modes for managing main memory, with a system called MemScale, the authors offer a solution for performance-aware energy management. By creating a set of low-power modes, hardware mechanisms and software policies, MemScale trades memory bandwidth for energy savings while tightly limiting the associated performance impact.

...read moreread less

Journal Article•DOI•

The XMOS Architecture and XS1 Chips

[...]

David May¹•Institutions (1)

University of Bristol¹

01 Nov 2012-IEEE Micro

TL;DR: The event-driven architecture supports energy-efficient multicore and multichip systems in which cores are active only when needed.

...read moreread less

Abstract: The XMOS architecture scales from real-time systems with a single multithreaded processor to systems with thousands of processors. Concurrent processing, communications, and I/O are supported by the instruction set of the XCORE processors and by the message-routing techniques and protocols in the XMOS interconnect. The event-driven architecture supports energy-efficient multicore and multichip systems in which cores are active only when needed.

...read moreread less

Journal Article•DOI•

What is Happening to Power, Performance, and Software?

[...]

Hadi Esmaeilzadeh¹, Ting Cao², Xi Yang², Stephen M. Blackburn², Kathryn S. McKinley³ - Show less +1 more•Institutions (3)

University of Washington¹, Australian National University², Microsoft³

01 May 2012-IEEE Micro

TL;DR: Systematically exploring power, performance, and energy sheds new light on the clash of two trends that unfolded over the past decade: the rise of parallel processors in response to technology constraints on power, clock speed, and wire delay.

...read moreread less

Abstract: Systematically exploring power, performance, and energy sheds new light on the clash of two trends that unfolded over the past decade: the rise of parallel processors in response to technology constraints on power, clock speed, and wire delay; and the rise of managed high-level, portable programming languages.

...read moreread less

Journal Article•DOI•

The Kremlin Oracle for Sequential Code Parallelization

[...]

Saturnino Garcia¹, Donghwan Jeon¹, Chris Louie¹, Michael Taylor¹•Institutions (1)

University of California, San Diego¹

01 Jul 2012-IEEE Micro

TL;DR: Kremlin combines a novel dynamic program analysis, hierarchical critical-path analysis, with multicore processor models to evaluate thousands of potential parallelization strategies and estimate their performance outcomes.

...read moreread less

Abstract: The Kremlin open-source tool helps programmers by automatically identifying regions in sequential programs that merit parallelization. Kremlin combines a novel dynamic program analysis, hierarchical critical-path analysis, with multicore processor models to evaluate thousands of potential parallelization strategies and estimate their performance outcomes.

...read moreread less

Journal Article•DOI•

Godson-T: An Efficient Many-Core Processor Exploring Thread-Level Parallelism

[...]

Dongrui Fan, Hao Zhang, Da Wang, Xiaochun Ye, Fenglong Song, Guojie Li, Ninghui Sun - Show less +3 more

01 Mar 2012-IEEE Micro

TL;DR: Godson-T is a research many-core processor designed for parallel scientific computing that delivers efficient performance and flexible programmability simultaneously and has many features to achieve high efficiency for on-chip resource utilization.

...read moreread less

Abstract: Godson-T is a research many-core processor designed for parallel scientific computing that delivers efficient performance and flexible programmability simultaneously. It also has many features to achieve high efficiency for on-chip resource utilization, such as a region-based cache coherence protocol, data transfer agents, and hardware-supported synchronization mechanisms. Finally, it also features a highly efficient runtime system, a Pthreads-like programming model, and versatile parallel libraries, which make this many-core design flexibly programmable.

...read moreread less

Journal Article•DOI•

Helix: Making the Extraction of Thread-Level Parallelism Mainstream

[...]

Simone Campanoni¹, Timothy M. Jones², Glenn Holloway¹, Gu-Yeon Wei¹, David Brooks¹ - Show less +1 more•Institutions (2)

Harvard University¹, University of Cambridge²

01 Jul 2012-IEEE Micro

TL;DR: Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers.

...read moreread less

Abstract: Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers.

...read moreread less

Journal Article•DOI•

Ultra Low-Energy SRAM Design for Smart Ubiquitous Sensors

[...]

Vibhu Sharma¹, Stefan Cosemans¹, M. Ashouie², Jos Huisken², F. Catthoor¹, Wim Dehaene¹ - Show less +2 more•Institutions (2)

Katholieke Universiteit Leuven¹, IMEC²

01 Sep 2012-IEEE Micro

TL;DR: New circuit-design techniques that drastically reduce the static RAM (SRAM) memories' energy consumption while still achieving tens of megahertz of operation are discussed.

...read moreread less

Abstract: Medical diagnosis and healthcare are at the onset of a revolution fueled by improvements in smart sensors and body area networks. Those sensor nodes' computation and memory requirements are growing, but their energy resources do not increase; thus, more energy-efficient memories and processors are required. New circuit-design techniques that drastically reduce the static RAM (SRAM) memories' energy consumption while still achieving tens of megahertz of operation are discussed.

...read moreread less

Journal Article•DOI•

Low-Power, Real-Time Object-Recognition Processors for Mobile Vision Systems

[...]

Jinwook Oh¹, Gyeonghoon Kim¹, Injoon Hong¹, Jun-Young Park¹, Seungjin Lee¹, Joo-Young Kim¹, Jeong-Ho Woo¹, Hoi-Jun Yoo¹ - Show less +4 more•Institutions (1)

KAIST¹

01 Nov 2012-IEEE Micro

TL;DR: A new low-power object-recognition processor achieves real-time robust recognition, satisfying modern mobile vision systems' requirements, and an attention-based object- recognition algorithm for energy efficiency, a heterogeneous multicore architecture for data- and thread-level parallelism, and a network on a chip for high on-chip bandwidth.

...read moreread less

Abstract: A new low-power object-recognition processor achieves real-time robust recognition, satisfying modern mobile vision systems' requirements. The authors introduce an attention-based object-recognition algorithm for energy efficiency, a heterogeneous multicore architecture for data- and thread-level parallelism, and a network on a chip for high on-chip bandwidth. The fabricated chip achieves 30 frames/second throughput and an average 320 mW power consumption on test 720p video sequences, yielding 640 GOPS/W and 10.5 NJ/pixel energy efficiency.

...read moreread less

Journal Article•DOI•

Kilo TM: Hardware Transactional Memory for GPU Architectures

[...]

Wilson W. L. Fung¹, Inderpreet Singh¹, A. Brownsword², Tor M. Aamodt¹•Institutions (2)

University of British Columbia¹, Electronic Arts²

01 May 2012-IEEE Micro

TL;DR: To improve GPUs' programmability and thus extend their usage to a wider range of applications, the authors propose to enable transactional memory (TM) on GPUs via Kilo TM, a novel hardware TM system that scales to thousands of concurrent transactions.

...read moreread less

Abstract: Programming GPUs is challenging for applications with irregular fine-grained communication between threads. To improve GPUs' programmability and thus extend their usage to a wider range of applications, the authors propose to enable transactional memory (TM) on GPUs via Kilo TM, a novel hardware TM system that scales to thousands of concurrent transactions.

...read moreread less

Journal Article•DOI•

A QoS-Enabled On-Die Interconnect Fabric for Kilo-Node Chips

[...]

Boris Grot¹, Joel Hestness², Stephen W. Keckler³, Onur Mutlu⁴•Institutions (4)

École Normale Supérieure¹, University of Texas at Austin², Nvidia³, Carnegie Mellon University⁴

01 May 2012-IEEE Micro

TL;DR: A hybrid network-on-chip architecture called Kilo-NoC co-optimizes topology, flow control, and quality of service to achieve significant gains in efficiency.

...read moreread less

Abstract: To meet rapidly growing performance demands and energy constraints, future chips will likely feature thousands of on-die resources. Existing network-on-chip solutions weren't designed for scalability and will be unable to meet future interconnect demands. A hybrid network-on-chip architecture called Kilo-NoC co-optimizes topology, flow control, and quality of service to achieve significant gains in efficiency.

...read moreread less