Home
/
Authors
/
Samantika Subramaniam

Author

Samantika Subramaniam

Other affiliations: Georgia Institute of Technology College of Computing

Bio: Samantika Subramaniam is an academic researcher from Georgia Institute of Technology. The author has contributed to research in topics: Cache & Out-of-order execution. The author has an hindex of 6, co-authored 8 publications receiving 264 citations. Previous affiliations of Samantika Subramaniam include Georgia Institute of Technology College of Computing.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Zesto: A cycle-level simulator for highly detailed microarchitecture exploration

[...]

Gabriel H. Loh¹, Samantika Subramaniam¹, Yuejian Xie¹•Institutions (1)

Georgia Institute of Technology College of Computing¹

26 Apr 2009

TL;DR: A new timing simulator is presented that models a modern x86 microarchitecture at a very low level, including out-of-order scheduling and execution that much more closely mirrors current implementations, a detailed cache/memory hierarchy, as well as many x86-specific microarch Architecture features (e.g., simple vs. complex decoders, micro-op decomposition and fusion).

...read moreread less

Abstract: For academic computer architecture research, a large number of publicly available simulators make use of relatively simple abstractions for the microarchitecture of the processor pipeline. For some types of studies, such as those for multi-core cache coherence designs, a simple pipeline model may suffice. For detailed microarchitecture research, such as those that are sensitive to the exact behavior of out-of-order scheduling, ALU and bypass network contention, and resource management (e.g., RS and ROB entries), an over-simplified model is not representative of modern processor organizations. We present a new timing simulator that models a modern x86 microarchitecture at a very low level, including out-of-order scheduling and execution that much more closely mirrors current implementations, a detailed cache/memory hierarchy, as well as many x86-specific microarchitecture features (e.g., simple vs. complex decoders, micro-op decomposition and fusion, microcode lookup overhead for long/complex x86 instructions).

...read moreread less

129 citations

Proceedings Article•DOI•

Fire-and-Forget: Load/Store Scheduling with No Store Queue at All

[...]

Samantika Subramaniam¹, Gabriel H. Loh¹•Institutions (1)

Georgia Institute of Technology¹

09 Dec 2006

TL;DR: The original goal for FnF was to design a more scalable memory scheduling microarchitecture than the previously proposed approaches without degrading performance, and the relative infrequency of store-to-load forwarding, accurate LQ index prediction, and speculative cloaking actually combine to enable FnF to slightly out-perform the competition.

...read moreread less

Abstract: Modern processors use CAM-based load and store queues (LQ/SQ) to support out-of-order memory scheduling and store-to-load forwarding. However, the LQ and SQ scale poorly for the sizes required for large-window, high- ILP processors. Past research has proposed ways to make the SQ more scalable by reorganizing the CAMs or using non-associative structures. In particular, the Store Queue Index Prediction (SQIP) approach allows for load instructions to predict the exact SQ index of a sourcing store and access the SQ in a much simpler and more scalable RAMbased fashion. The reason why SQIP works is that loads that receive data directly from stores will usually receive the data from the same store each time. In our work, we take a slightly different view on the underlying observation used by SQIP: a store that forwards data to a load usually forwards to the same load each time. This subtle change in perspective leads to our "Fire-and- Forget" (FnF) scheme for load/store scheduling and forwarding that results in the complete elimination of the store queue. The idea is that stores issue out of the reservation stations like regular instructions, and any store that forwards data to a load will use a predicted LQ index to directly write the value to the LQ entry without any associative logic. Any mispredictions/misforwardings are detected by a low-overhead pre-commit re-execution mechanism. Our original goal for FnF was to design a more scalable memory scheduling microarchitecture than the previously proposed approaches without degrading performance. The relative infrequency of store-to-load forwarding, accurate LQ index prediction, and speculative cloaking actually combine to enable FnF to slightly out-perform the competition. Specifically, our simulation results show that our SQless Fire-and-Forget provides a 3.3% speedup over a processor using a conventional fully-associative SQ.

...read moreread less

62 citations

Proceedings Article•DOI•

Criticality-based optimizations for efficient load processing

[...]

Samantika Subramaniam¹, Anne Bracy², Hong Wang², Gabriel H. Loh¹•Institutions (2)

Georgia Institute of Technology College of Computing¹, Intel²

06 Mar 2009

TL;DR: The idea of criticality is revisited, but several processor enhancements that can exploit criticality information and can be directly applied to modern x86 microarchitectures are proposed.

...read moreread less

Abstract: Some instructions have more impact on processor performance than others. Identification of these critical instructions can be used to modify and improve instruction processing. Previous work has shown that the criticality of instructions can be dynamically predicted with high accuracy, and that this information can be leveraged to optimize the performance of load value prediction and instruction steering for clustered architectures. In this work, we revisit the idea of criticality, but we propose several processor enhancements that can exploit criticality information and can be directly applied to modern x86 microarchitectures. For the investment of a small (less than 1KB) criticality predictor, we can make a conventional single-read-port data cache achieve the performance of an ideal dual-read-port cache, yielding an average 10% performance improvement. Our remaining techniques can reuse the predictor (i.e., no additional overhead) to further optimize other aspects of load processing (e.g., caching decisions, store-to-load forwarding, etc.), yielding an overall performance improvement of 16% over a conventional processor. Some of these techniques also allow us to decrease power and area costs for several related hardware structures.

...read moreread less

29 citations

Proceedings Article•DOI•

Store vectors for scalable memory dependence prediction and scheduling

[...]

Samantika Subramaniam¹, Gabriel H. Loh¹•Institutions (1)

Georgia Institute of Technology¹

27 Feb 2006

TL;DR: This paper uses the idea of dependency vectors from matrix schedulers for non-memory instructions, and adapt them to implement a new dependence prediction algorithm, which delivers an 8.4% speedup over blind speculation, achieves better performance than store sets, and the store vector algorithm's matrix implementation is considerably simpler.

...read moreread less

Abstract: Allowing loads to issue out-of-order with respect to earlier unresolved store addresses is very important for extracting parallelism in large-window superscalar processors. Blindly allowing all loads to issue as soon as their addresses are ready can lead to a net performance loss due to a large number of load-store ordering violations. Previous research has proposed memory dependence prediction algorithms to prevent only loads with true memory dependencies from issuing in the presence of unresolved stores. Techniques such as load-store pair identification and store sets have been very successful in achieving performance levels close to that attained by an oracle dependence predictor. These techniques tend to employ relatively complex CAM-based designs, which we believe have been obstacles to the industrial adoption of these algorithms. In this paper, we use the idea of dependency vectors from matrix schedulers for non-memory instructions, and adapt them to implement a new dependence prediction algorithm. For applications that experience frequent memory ordering violations, our "store vector" prediction algorithm delivers an 8.4% speedup over blind speculation (compared to 8.5% for perfect dependence prediction), achieves better performance than store sets (8.1%), and the store vector algorithm's matrix implementation is considerably simpler.

...read moreread less

28 citations

Proceedings Article•DOI•

PEEP: Exploiting predictability of memory dependences in SMT processors

[...]

Samantika Subramaniam¹, Milos Prvulovic¹, Gabriel H. Loh¹•Institutions (1)

Georgia Institute of Technology¹

24 Oct 2008

TL;DR: An early parole (EP) mechanism that exploits the predictability of dependence-resolution delays to restart fetch of an excluded thread so that the instructions reach the execution core just as the original dependence resolves.

...read moreread less

Abstract: Simultaneous multithreading (SMT) attempts to keep a dynamically scheduled processorpsilas resources busy with work from multiple independent threads. Threads with long-latency stalls, however, can lead to a reduction in overall throughput because they occupy many of the critical processor resources. In this work, we first study the interaction between stalls caused by ambiguous memory dependences and SMT processing. We then propose the technique of proactive exclusion (PE) where the SMT fetch unit stops fetching from a thread when a memory dependence is predicted to exist. However, after the dependence has been resolved, the thread is delayed waiting for new instructions to be fetched and delivered down the front-end pipeline. So we introduce an early parole (EP) mechanism that exploits the predictability of dependence-resolution delays to restart fetch of an excluded thread so that the instructions reach the execution core just as the original dependence resolves. We show that combining these two techniques (PEEP) yields a 16.9% throughput improvement on a 4-way SMT processor that supports speculative memory disambiguation. These strong results indicate that a fetch policy that is cognizant of future stalls considerably improves the throughput of an SMT machine.

...read moreread less

16 citations

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches

[...]

Yuejian Xie¹, Gabriel H. Loh¹•Institutions (1)

Georgia Institute of Technology¹

20 Jun 2009

TL;DR: This work proposes a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism.

...read moreread less

Abstract: Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that sharing-oblivious cache management policies (e.g., LRU) can lead to poor performance and fairness when the multiple cores compete for the limited LLC capacity. Different memory access patterns can cause cache contention in different ways, and various techniques have been proposed to target some of these behaviors. In this work, we propose a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism. By handling multiple types of memory behaviors, our proposed technique outperforms techniques that target only either capacity partitioning or adaptive insertion.

...read moreread less

334 citations

Journal Article•DOI•

An Evaluation of High-Level Mechanistic Core Models

[...]

Trevor E. Carlson¹, Wim Heirman², Stijn Eyerman¹, Ibrahim Hur², Lieven Eeckhout¹ - Show less +1 more•Institutions (2)

Ghent University¹, Katholieke Universiteit Leuven²

25 Aug 2014-ACM Transactions on Architecture and Code Optimization

TL;DR: This article explores, analyze, and compares the accuracy and simulation speed of high-abstraction core models, a potential solution to slow cycle-level simulation, and introduces the instruction-window centric (IW-centric) core model, a new mechanistic core model that bridges the gap between interval simulation and cycle-accurate simulation by enabling high-speed simulations with higher levels of detail.

...read moreread less

Abstract: Large core counts and complex cache hierarchies are increasing the burden placed on commonly used simulation and modeling techniques. Although analytical models provide fast results, they do not apply to complex, many-core shared-memory systems. In contrast, detailed cycle-level simulation can be accurate but also tends to be slow, which limits the number of configurations that can be evaluated. A middle ground is needed that provides for fast simulation of complex many-core processors while still providing accurate results. In this article, we explore, analyze, and compare the accuracy and simulation speed of high-abstraction core models as a potential solution to slow cycle-level simulation. We describe a number of enhancements to interval simulation to improve its accuracy while maintaining simulation speed. In addition, we introduce the instruction-window centric (IW-centric) core model, a new mechanistic core model that bridges the gap between interval simulation and cycle-accurate simulation by enabling high-speed simulations with higher levels of detail. We also show that using accurate core models like these are important for memory subsystem studies, and that simple, naive models, like a one-IPC core model, can lead to misleading and incorrect results and conclusions in practical design studies. Validation against real hardware shows good accuracy, with an average single-core error of 11.1p and a maximum of 18.8p for the IW-centric model with a 1.5× slowdown compared to interval simulation.

...read moreread less

283 citations

Journal Article•DOI•

The structural simulation toolkit

[...]

Arun Rodrigues¹, Karl Scott Hemmert¹, Brian W. Barrett¹, C. Kersey¹, Ron A. Oldfield¹, M. Weston¹, R. Risen¹, Jeanine Cook², Paul Rosenfeld³, Elliott Cooper-Balis³, Bruce Jacob³ - Show less +7 more•Institutions (3)

Sandia National Laboratories¹, New Mexico State University², University of Maryland, College Park³

29 Mar 2011

TL;DR: The Structural Simulation Toolkit (SST) as discussed by the authors is an open, modular, parallel, multi-criteria, multiscale simulation framework for HPC systems that includes a number of processor, memory, and network models.

...read moreread less

Abstract: As supercomputers grow, understanding their behavior and performance has become increasingly challenging. New hurdles in scalability, programmability, power consumption, reliability, cost, and cooling are emerging, along with new technologies such as 3D integration, GP-GPUs, silicon-photonics, and other "game changers". Currently, they HPC community lacks a unified toolset to evaluate these technologies and design for these challenges.To address this problem, a number of institutions have joined together to create the Structural Simulation Toolkit (SST), an open, modular, parallel, multi-criteria, multi-scale simulation framework. The SST includes a number of processor, memory, and network models. The SST has been used in a variety of network, memory, and application studies and aims to become the standard simulation framework for designing and procuring HPC systems.

...read moreread less

270 citations

Proceedings Article•DOI•

Dynamically Specialized Datapaths for energy efficient computing

[...]

Venkatraman Govindaraju¹, Chen-Han Ho¹, Karthikeyan Sankaralingam¹•Institutions (1)

University of Wisconsin-Madison¹

12 Feb 2011

TL;DR: D Dynamically Specialized Datapaths are proposed to improve the energy efficiency of general purpose programmable processors and show that in most cases two DySER blocks can achieve the same performance as having a specialized hardware module for each path-tree.

...read moreread less

Abstract: Due to limits in technology scaling, energy efficiency of logic devices is decreasing in successive generations. To provide continued performance improvements without increasing power, regardless of the sequential or parallel nature of the application, microarchitectural energy efficiency must improve. We propose Dynamically Specialized Datapaths to improve the energy efficiency of general purpose programmable processors. The key insights of this work are the following. First, applications execute in phases and these phases can be determined by creating a path-tree of basic-blocks rooted at the inner-most loop. Second, specialized datapaths corresponding to these path-trees, which we refer to as DySER blocks, can be constructed by interconnecting a set of heterogeneous computation units with a circuit-switched network. These blocks can be easily integrated with a processor pipeline. A synthesized RTL implementation using an industry 55nm technology library shows a 64-functional-unit DySER block occupies approximately the same area as a 64 KB single-ported SRAM and can execute at 2 GHz. We extend the GCC compiler to identify path-trees and code-mapping to DySER and evaluate the PAR-SEC, SPEC and Parboil benchmarks suites. Our results show that in most cases two DySER blocks can achieve the same performance (within 5%) as having a specialized hardware module for each path-tree. A 64-FU DySER block can cover 12% to 100% of the dynamically executed instruction stream. When integrated with a dual-issue out-of-order processor, two DySER blocks provide geometric mean speedup of 2.1X (1.15X to 10X), and geometric mean energy reduction of 40% (up to 70%), and 60% energy reduction if no performance improvement is required.

...read moreread less

229 citations

Proceedings Article•DOI•

Application-aware prioritization mechanisms for on-chip networks

[...]

Reetuparna Das¹, Onur Mutlu², Thomas Moscibroda³, Chita R. Das¹•Institutions (3)

Pennsylvania State University¹, Carnegie Mellon University², Microsoft³

12 Dec 2009

TL;DR: The idea is to divide processor execution time into phases, rank applications within a phase based on stall-time criticality, and have all routers in the network prioritize packets based on their applications' ranks.

...read moreread less

Abstract: Network-on-Chips (NoCs) are likely to become a critical shared resource in future many-core processors. The challenge is to develop policies and mechanisms that enable multiple applications to efficiently and fairly share the network, to improve system performance. Existing local packet scheduling policies in the routers fail to fully achieve this goal, because they treat every packet equally, regardless of which application issued the packet. This paper proposes prioritization policies and architectural extensions to NoC routers that improve the overall application-level throughput, while ensuring fairness in the network. Our prioritization policies are application-aware, distinguishing applications based on the stall-time criticality of their packets. The idea is to divide processor execution time into phases, rank applications within a phase based on stall-time criticality, and have all routers in the network prioritize packets based on their applications' ranks. Our scheme also includes techniques that ensure starvation freedom and enable the enforcement of system-level application priorities. We evaluate the proposed prioritization policies on a 64-core CMP with an 8x8 mesh NoC, using a suite of 35 diverse applications. For a representative set of case studies, our proposed policy increases average system throughput by 25.6% over age-based arbitration and 18.4% over round-robin arbitration. Averaged over 96 randomly-generated multiprogrammed workload mixes, the proposed policy improves system throughput by 9.1% over the best existing prioritization policy, while also reducing application-level unfairness.

...read moreread less

223 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

Collapse