•Journal•ISSN: 1544-3566

ACM Transactions on Architecture and Code Optimization

Association for Computing Machinery

About: ACM Transactions on Architecture and Code Optimization is an academic journal published by Association for Computing Machinery. The journal publishes majorly in the area(s): Cache & Compiler. It has an ISSN identifier of 1544-3566. It is also open access. Over the lifetime, 780 publications have been published receiving 13514 citations. The journal is also known as: Architecture and code optimization & Transactions on architecture and code optimization.

...read moreread less

Topics: Cache, Compiler, Computer science, Speedup, Multi-core processor ...read more

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Temperature-aware microarchitecture: Modeling and implementation

[...]

Kevin Skadron¹, Mircea R. Stan¹, Karthik Sankaranarayanan¹, Wei Huang¹, Sivakumar Velusamy¹, David Tarjan¹ - Show less +2 more•Institutions (1)

University of Virginia¹

01 Mar 2004-ACM Transactions on Architecture and Code Optimization

TL;DR: HotSpot is described, an accurate yet fast and practical model based on an equivalent circuit of thermal resistances and capacitances that correspond to microarchitecture blocks and essential aspects of the thermal package that shows that power metrics are poor predictors of temperature, that sensor imprecision has a substantial impact on the performance of DTM, and that the inclusion of lateral resistances for thermal diffusion is important for accuracy.

...read moreread less

Abstract: With cooling costs rising exponentially, designing cooling solutions for worst-case power dissipation is prohibitively expensive. Chips that can autonomously modify their execution and power-dissipation characteristics permit the use of lower-cost cooling solutions while still guaranteeing safe temperature regulation. Evaluating techniques for this dynamic thermal management (DTM), however, requires a thermal model that is practical for architectural studies.This paper describes HotSpot, an accurate yet fast and practical model based on an equivalent circuit of thermal resistances and capacitances that correspond to microarchitecture blocks and essential aspects of the thermal package. Validation was performed using finite-element simulation. The paper also introduces several effective methods for DTM: "temperature-tracking" frequency scaling, "migrating computation" to spare hardware units, and a "hybrid" policy that combines fetch gating with dynamic voltage scaling. The latter two achieve their performance advantage by exploiting instruction-level parallelism, showing the importance of microarchitecture research in helping control the growth of cooling costs.Modeling temperature at the microarchitecture level also shows that power metrics are poor predictors of temperature, that sensor imprecision has a substantial impact on the performance of DTM, and that the inclusion of lateral resistances for thermal diffusion is important for accuracy.

...read moreread less

786 citations

Journal Article•DOI•

An Evaluation of High-Level Mechanistic Core Models

[...]

Trevor E. Carlson¹, Wim Heirman², Stijn Eyerman¹, Ibrahim Hur², Lieven Eeckhout¹ - Show less +1 more•Institutions (2)

Ghent University¹, Katholieke Universiteit Leuven²

25 Aug 2014-ACM Transactions on Architecture and Code Optimization

TL;DR: This article explores, analyze, and compares the accuracy and simulation speed of high-abstraction core models, a potential solution to slow cycle-level simulation, and introduces the instruction-window centric (IW-centric) core model, a new mechanistic core model that bridges the gap between interval simulation and cycle-accurate simulation by enabling high-speed simulations with higher levels of detail.

...read moreread less

Abstract: Large core counts and complex cache hierarchies are increasing the burden placed on commonly used simulation and modeling techniques. Although analytical models provide fast results, they do not apply to complex, many-core shared-memory systems. In contrast, detailed cycle-level simulation can be accurate but also tends to be slow, which limits the number of configurations that can be evaluated. A middle ground is needed that provides for fast simulation of complex many-core processors while still providing accurate results. In this article, we explore, analyze, and compare the accuracy and simulation speed of high-abstraction core models as a potential solution to slow cycle-level simulation. We describe a number of enhancements to interval simulation to improve its accuracy while maintaining simulation speed. In addition, we introduce the instruction-window centric (IW-centric) core model, a new mechanistic core model that bridges the gap between interval simulation and cycle-accurate simulation by enabling high-speed simulations with higher levels of detail. We also show that using accurate core models like these are important for memory subsystem studies, and that simple, naive models, like a one-IPC core model, can lead to misleading and incorrect results and conclusions in practical design studies. Validation against real hardware shows good accuracy, with an average single-core error of 11.1p and a maximum of 18.8p for the IW-centric model with a 1.5× slowdown compared to interval simulation.

...read moreread less

283 citations

Journal Article•DOI•

CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories

[...]

Rajeev Balasubramonian¹, Andrew B. Kahng², Naveen Muralimanohar³, Ali Shafiee¹, Vaishnav Srinivas² - Show less +1 more•Institutions (3)

University of Utah¹, University of California, San Diego², Hewlett-Packard³

28 Jun 2017-ACM Transactions on Architecture and Code Optimization

TL;DR: A tool is designed that carefully models I/O power in the memory system, explores the design space, and gives the user the ability to define new types of memory interconnects/topologies, and a new relay-on-board chip that partitions a DDR channel into multiple cascaded channels is introduced.

...read moreread less

Abstract: Historically, server designers have opted for simple memory systems by picking one of a few commoditized DDR memory products. We are already witnessing a major upheaval in the off-chip memory hierarchy, with the introduction of many new memory products—buffer-on-board, LRDIMM, HMC, HBM, and NVMs, to name a few. Given the plethora of choices, it is expected that different vendors will adopt different strategies for their high-capacity memory systems, often deviating from DDR standards and/or integrating new functionality within memory systems. These strategies will likely differ in their choice of interconnect and topology, with a significant fraction of memory energy being dissipated in I/O and data movement. To make the case for memory interconnect specialization, this paper makes three contributions.First, we design a tool that carefully models I/O power in the memory system, explores the design space, and gives the user the ability to define new types of memory interconnects/topologies. The tool is validated against SPICE models, and is integrated into version 7 of the popular CACTI package. Our analysis with the tool shows that several design parameters have a significant impact on I/O power.We then use the tool to help craft novel specialized memory system channels. We introduce a new relay-on-board chip that partitions a DDR channel into multiple cascaded channels. We show that this simple change to the channel topology can improve performance by 22% for DDR DRAM and lower cost by up to 65% for DDR DRAM. This new architecture does not require any changes to DIMMs, and it efficiently supports hybrid DRAM/NVM systems.Finally, as an example of a more disruptive architecture, we design a custom DIMM and parallel bus that moves away from the DDR3/DDR4 standards. To reduce energy and improve performance, the baseline data channel is split into three narrow parallel channels and the on-DIMM interconnects are operated at a lower frequency. In addition, this allows us to design a two-tier error protection strategy that reduces data transfers on the interconnect. This architecture yields a performance improvement of 18% and a memory power reduction of 23%.The cascaded channel and narrow channel architectures serve as case studies for the new tool and show the potential for benefit from re-organizing basic memory interconnects.

...read moreread less

217 citations

Journal Article•DOI•

The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing

[...]

Sheng Li¹, Jung Ho Ahn², Richard Strong³, Jay B. Brockman⁴, Dean M. Tullsen³, Norman P. Jouppi¹ - Show less +2 more•Institutions (4)

Hewlett-Packard¹, Seoul National University², University of California, San Diego³, University of Notre Dame⁴

01 Apr 2013-ACM Transactions on Architecture and Code Optimization

TL;DR: Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks for manycore designs at the 22nm technology shows that 8-core clustering gives the best energy-delay product, whereas when die area is taken into account, 4-core clusters give the best EDA2P and EDAP.

...read moreread less

Abstract: This article introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At microarchitectural level, McPAT includes models for the fundamental components of a complete chip multiprocessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, and integrated system components such as memory controllers and Ethernet controllers. At circuit level, McPAT supports detailed modeling of critical-path timing, area, and power. At technology level, McPAT models timing, area, and power for the device types forecast in the ITRS roadmap. McPAT has a flexible XML interface to facilitate its use with many performance simulators.Combined with a performance simulator, McPAT enables architects to accurately quantify the cost of new ideas and assess trade-offs of different architectures using new metrics such as Energy-Delay-Area2 Product (EDA2P) and Energy-Delay-Area Product (EDAP). This article explores the interconnect options of future manycore processors by varying the degree of clustering over generations of process technologies. Clustering will bring interesting trade-offs between area and performance because the interconnects needed to group cores into clusters incur area overhead, but many applications can make good use of them due to synergies from cache sharing. Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks for manycore designs at the 22nm technology shows that 8-core clustering gives the best energy-delay product, whereas when die area is taken into account, 4-core clustering gives the best EDA2P and EDAP.

...read moreread less

201 citations

Journal Article•DOI•

Design of the Java HotSpot™ client compiler for Java 6

[...]

Thomas Kotzmann¹, Christian Wimmer¹, Hanspeter Mössenböck¹, Thomas Rodriguez², Kenneth B. Russell², David Cox² - Show less +2 more•Institutions (2)

Johannes Kepler University of Linz¹, Sun Microsystems²

29 May 2008-ACM Transactions on Architecture and Code Optimization

TL;DR: The new architecture of the client compiler is outlined and how it interacts with the VM is shown, including the intermediate representation that now uses static single-assignment (SSA) form and the linear scan algorithm for global register allocation.

...read moreread less

Abstract: Version 6 of Sun Microsystems' Java HotSpot™ VM ships with a redesigned version of the client just-in-time compiler that includes several research results of the last years. The client compiler is at the heart of the VM configuration used by default for interactive desktop applications. For such applications, low startup and pause times are more important than peak performance. This paper outlines the new architecture of the client compiler and shows how it interacts with the VM. It presents the intermediate representation that now uses static single-assignment (SSA) form and the linear scan algorithm for global register allocation. Efficient support for exception handling and deoptimization fulfills the demands that are imposed by the dynamic features of the Java programming language. The evaluation shows that the new client compiler generates better code in less time. The popular SPECjvm98 benchmark suite is executed 45p faster, while the compilation speed is also up to 40p better. This indicates that a carefully selected set of global optimizations can also be integrated in just-in-time compilers that focus on compilation speed and not on peak performance. In addition, the paper presents the impact of several optimizations on execution and compilation speed. As the source code is freely available, the Java HotSpot™ VM and the client compiler are the ideal basis for experiments with new feedback-directed optimizations in a production-level Java just-in-time compiler. The paper outlines research projects that add fast algorithms for escape analysis, automatic object inlining, and array bounds check elimination.

...read moreread less

177 citations

Collapse

Performance

Metrics

797

Papers

13,525

Citations

No. of papers from the Journal in previous years
Year	Papers
2023	19
2022	79
2021	48
2020	51
2019	67
2018	57