Conference

High-Performance Computer Architecture

About: High-Performance Computer Architecture is an academic conference. The conference publishes majorly in the area(s): Cache & Cache pollution. Over the lifetime, 1176 publications have been published by the conference receiving 78614 citations.

...read moreread less

Topics: Cache, Cache pollution, Cache algorithms, Cache coloring, Shared memory ...read more

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Evaluating MapReduce for Multi-core and Multiprocessor Systems

[...]

C. Ranger¹, R. Raghuraman¹, A. Penmetsa¹, Gary Bradski¹, Christos Kozyrakis¹ - Show less +1 more•Institutions (1)

Stanford University¹

10 Feb 2007

TL;DR: It is established that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code.

...read moreread less

Abstract: This paper evaluates the suitability of the MapReduce model for multi-core and multi-processor systems. MapReduce was created by Google for application development on data-centers with thousands of servers. It allows programmers to write functional-style code that is automatically parallelized and scheduled in a distributed system. We describe Phoenix, an implementation of MapReduce for shared-memory systems that includes a programming API and an efficient runtime system. The Phoenix runtime automatically manages thread creation, dynamic task scheduling, data partitioning, and fault tolerance across processor nodes. We study Phoenix with multi-core and symmetric multiprocessor systems and evaluate its performance potential and error recovery features. We also compare MapReduce code to code written in lower-level APIs such as P-threads. Overall, we establish that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code

...read moreread less

1,058 citations

Proceedings Article•DOI•

Dynamic thermal management for high-performance microprocessors

[...]

David Brooks¹, Margaret Martonosi¹•Institutions (1)

Princeton University¹

20 Jan 2001

TL;DR: This work investigates dynamic thermal management as a technique to control CPU power dissipation and explores the tradeoffs between several mechanisms for responding to periods of thermal trauma and the effects of hardware and software implementations.

...read moreread less

Abstract: With the increasing clock rate and transistor count of today's microprocessors, power dissipation is becoming a critical component of system design complexity. Thermal and power-delivery issues are becoming especially critical for high-performance computing systems. In this work, we investigate dynamic thermal management as a technique to control CPU power dissipation. With the increasing usage of clock gating techniques, the average power dissipation typically seen by common applications is becoming much less than the chip's rated maximum power dissipation. However system designers still must design thermal heat sinks to withstand the worse-case scenario. We define and investigate the major components of any dynamic thermal management scheme. Specifically we explore the tradeoffs between several mechanisms for responding to periods of thermal trauma and we consider the effects of hardware and software implementations. With approximate dynamic thermal management, the CPU can be designed for a much lower maximum power rating, with minimal performance impact for typical applications.

...read moreread less

882 citations

Proceedings Article•DOI•

System level analysis of fast, per-core DVFS using on-chip switching regulators

[...]

Wonyoung Kim¹, Meeta S. Gupta¹, Gu-Yeon Wei¹, David Brooks¹•Institutions (1)

Harvard University¹

24 Oct 2008

TL;DR: It is concluded that on-chip regulators can significantly improve DVFS effectiveness and lead to overall system energy savings in a CMP, but architects must carefully account for overheads and costs when designing next-generation DVFS systems and algorithms.

...read moreread less

Abstract: Portable, embedded systems place ever-increasing demands on high-performance, low-power microprocessor design. Dynamic voltage and frequency scaling (DVFS) is a well-known technique to reduce energy in digital systems, but the effectiveness of DVFS is hampered by slow voltage transitions that occur on the order of tens of microseconds. In addition, the recent trend towards chip-multiprocessors (CMP) executing multi-threaded workloads with heterogeneous behavior motivates the need for per-core DVFS control mechanisms. Voltage regulators that are integrated onto the same chip as the microprocessor core provide the benefit of both nanosecond-scale voltage switching and per-core voltage control. We show that these characteristics provide significant energy-saving opportunities compared to traditional off-chip regulators. However, the implementation of on-chip regulators presents many challenges including regulator efficiency and output voltage transient characteristics, which are significantly impacted by the system-level application of the regulator. In this paper, we describe and model these costs, and perform a comprehensive analysis of a CMP system with on-chip integrated regulators. We conclude that on-chip regulators can significantly improve DVFS effectiveness and lead to overall system energy savings in a CMP, but architects must carefully account for overheads and costs when designing next-generation DVFS systems and algorithms.

...read moreread less

758 citations

Proceedings Article•DOI•

LogTM: log-based transactional memory

[...]

Kevin E. Moore¹, Jayaram Bobba¹, M.J. Moravan¹, Mark D. Hill¹, Darien Wood¹ - Show less +1 more•Institutions (1)

University of Wisconsin-Madison¹

27 Feb 2006

TL;DR: This paper presents a new implementation of transactional memory, log-based transactionalMemory (LogTM), that makes commits fast by storing old values to a per-thread log in cacheable virtual memory and storing new values in place.

...read moreread less

Abstract: Transactional memory (TM) simplifies parallel programming by guaranteeing that transactions appear to execute atomically and in isolation. Implementing these properties includes providing data version management for the simultaneous storage of both new (visible if the transaction commits) and old (retained if the transaction aborts) values. Most (hardware) TM systems leave old values "in place" (the target memory address) and buffer new values elsewhere until commit. This makes aborts fast, but penalizes (the much more frequent) commits. In this paper, we present a new implementation of transactional memory, log-based transactional memory (LogTM), that makes commits fast by storing old values to a per-thread log in cacheable virtual memory and storing new values in place. LogTM makes two additional contributions. First, LogTM extends a MOESI directory protocol to enable both fast conflict detection on evicted blocks and fast commit (using lazy cleanup). Second, LogTM handles aborts in (library) software with little performance penalty. Evaluations running micro- and SPLASH-2 benchmarks on a 32-way multiprocessor support our decision to optimize for commit by showing that only 1-2% of transactions abort.

...read moreread less

724 citations

Proceedings Article•DOI•

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning

[...]

Linghao Song¹, Xuehai Qian², Hai Li³, Yi Chen⁴•Institutions (4)

University of Pittsburgh¹, University of Southern California², Nanyang Technological University³, Chinese Academy of Sciences⁴

01 Feb 2017

TL;DR: PipeLayer is presented, a ReRAM-based PIM accelerator for CNNs that support both training and testing and proposes highly parallel design based on the notion of parallelism granularity and weight replication, which enables the highly pipelined execution of bothTraining and testing, without introducing the potential stalls in previous work.

...read moreread less

Abstract: Convolution neural networks (CNNs) are the heart of deep learning applications. Recent works PRIME [1] and ISAAC [2] demonstrated the promise of using resistive random access memory (ReRAM) to perform neural computations in memory. We found that training cannot be efficiently supported with the current schemes. First, they do not consider weight update and complex data dependency in training procedure. Second, ISAAC attempts to increase system throughput with a very deep pipeline. It is only beneficial when a large number of consecutive images can be fed into the architecture. In training, the notion of batch (e.g. 64) limits the number of images can be processed consecutively, because the images in the next batch need to be processed based on the updated weights. Third, the deep pipeline in ISAAC is vulnerable to pipeline bubbles and execution stall. In this paper, we present PipeLayer, a ReRAM-based PIM accelerator for CNNs that support both training and testing. We analyze data dependency and weight update in training algorithms and propose efficient pipeline to exploit inter-layer parallelism. To exploit intra-layer parallelism, we propose highly parallel design based on the notion of parallelism granularity and weight replication. With these design choices, PipeLayer enables the highly pipelined execution of both training and testing, without introducing the potential stalls in previous work. The experiment results show that, PipeLayer achieves the speedups of 42.45x compared with GPU platform on average. The average energy saving of PipeLayer compared with GPU implementation is 7.17x.

...read moreread less

633 citations

Collapse

Performance

Metrics

1,176

Papers

78,614

Citations

No. of papers from the Conference in previous years
Year	Papers
2021	69
2020	54
2019	58
2018	68
2017	63
2016	65