scispace - formally typeset
Search or ask a question
Author

Samantika Subramaniam

Bio: Samantika Subramaniam is an academic researcher from Georgia Institute of Technology. The author has contributed to research in topics: Cache & Out-of-order execution. The author has an hindex of 6, co-authored 8 publications receiving 264 citations. Previous affiliations of Samantika Subramaniam include Georgia Institute of Technology College of Computing.

Papers
More filters
Proceedings ArticleDOI
26 Apr 2009
TL;DR: A new timing simulator is presented that models a modern x86 microarchitecture at a very low level, including out-of-order scheduling and execution that much more closely mirrors current implementations, a detailed cache/memory hierarchy, as well as many x86-specific microarch Architecture features (e.g., simple vs. complex decoders, micro-op decomposition and fusion).
Abstract: For academic computer architecture research, a large number of publicly available simulators make use of relatively simple abstractions for the microarchitecture of the processor pipeline. For some types of studies, such as those for multi-core cache coherence designs, a simple pipeline model may suffice. For detailed microarchitecture research, such as those that are sensitive to the exact behavior of out-of-order scheduling, ALU and bypass network contention, and resource management (e.g., RS and ROB entries), an over-simplified model is not representative of modern processor organizations. We present a new timing simulator that models a modern x86 microarchitecture at a very low level, including out-of-order scheduling and execution that much more closely mirrors current implementations, a detailed cache/memory hierarchy, as well as many x86-specific microarchitecture features (e.g., simple vs. complex decoders, micro-op decomposition and fusion, microcode lookup overhead for long/complex x86 instructions).

129 citations

Proceedings ArticleDOI
09 Dec 2006
TL;DR: The original goal for FnF was to design a more scalable memory scheduling microarchitecture than the previously proposed approaches without degrading performance, and the relative infrequency of store-to-load forwarding, accurate LQ index prediction, and speculative cloaking actually combine to enable FnF to slightly out-perform the competition.
Abstract: Modern processors use CAM-based load and store queues (LQ/SQ) to support out-of-order memory scheduling and store-to-load forwarding. However, the LQ and SQ scale poorly for the sizes required for large-window, high- ILP processors. Past research has proposed ways to make the SQ more scalable by reorganizing the CAMs or using non-associative structures. In particular, the Store Queue Index Prediction (SQIP) approach allows for load instructions to predict the exact SQ index of a sourcing store and access the SQ in a much simpler and more scalable RAMbased fashion. The reason why SQIP works is that loads that receive data directly from stores will usually receive the data from the same store each time. In our work, we take a slightly different view on the underlying observation used by SQIP: a store that forwards data to a load usually forwards to the same load each time. This subtle change in perspective leads to our "Fire-and- Forget" (FnF) scheme for load/store scheduling and forwarding that results in the complete elimination of the store queue. The idea is that stores issue out of the reservation stations like regular instructions, and any store that forwards data to a load will use a predicted LQ index to directly write the value to the LQ entry without any associative logic. Any mispredictions/misforwardings are detected by a low-overhead pre-commit re-execution mechanism. Our original goal for FnF was to design a more scalable memory scheduling microarchitecture than the previously proposed approaches without degrading performance. The relative infrequency of store-to-load forwarding, accurate LQ index prediction, and speculative cloaking actually combine to enable FnF to slightly out-perform the competition. Specifically, our simulation results show that our SQless Fire-and-Forget provides a 3.3% speedup over a processor using a conventional fully-associative SQ.

62 citations

Proceedings ArticleDOI
06 Mar 2009
TL;DR: The idea of criticality is revisited, but several processor enhancements that can exploit criticality information and can be directly applied to modern x86 microarchitectures are proposed.
Abstract: Some instructions have more impact on processor performance than others. Identification of these critical instructions can be used to modify and improve instruction processing. Previous work has shown that the criticality of instructions can be dynamically predicted with high accuracy, and that this information can be leveraged to optimize the performance of load value prediction and instruction steering for clustered architectures. In this work, we revisit the idea of criticality, but we propose several processor enhancements that can exploit criticality information and can be directly applied to modern x86 microarchitectures. For the investment of a small (less than 1KB) criticality predictor, we can make a conventional single-read-port data cache achieve the performance of an ideal dual-read-port cache, yielding an average 10% performance improvement. Our remaining techniques can reuse the predictor (i.e., no additional overhead) to further optimize other aspects of load processing (e.g., caching decisions, store-to-load forwarding, etc.), yielding an overall performance improvement of 16% over a conventional processor. Some of these techniques also allow us to decrease power and area costs for several related hardware structures.

29 citations

Proceedings ArticleDOI
27 Feb 2006
TL;DR: This paper uses the idea of dependency vectors from matrix schedulers for non-memory instructions, and adapt them to implement a new dependence prediction algorithm, which delivers an 8.4% speedup over blind speculation, achieves better performance than store sets, and the store vector algorithm's matrix implementation is considerably simpler.
Abstract: Allowing loads to issue out-of-order with respect to earlier unresolved store addresses is very important for extracting parallelism in large-window superscalar processors. Blindly allowing all loads to issue as soon as their addresses are ready can lead to a net performance loss due to a large number of load-store ordering violations. Previous research has proposed memory dependence prediction algorithms to prevent only loads with true memory dependencies from issuing in the presence of unresolved stores. Techniques such as load-store pair identification and store sets have been very successful in achieving performance levels close to that attained by an oracle dependence predictor. These techniques tend to employ relatively complex CAM-based designs, which we believe have been obstacles to the industrial adoption of these algorithms. In this paper, we use the idea of dependency vectors from matrix schedulers for non-memory instructions, and adapt them to implement a new dependence prediction algorithm. For applications that experience frequent memory ordering violations, our "store vector" prediction algorithm delivers an 8.4% speedup over blind speculation (compared to 8.5% for perfect dependence prediction), achieves better performance than store sets (8.1%), and the store vector algorithm's matrix implementation is considerably simpler.

28 citations

Proceedings ArticleDOI
24 Oct 2008
TL;DR: An early parole (EP) mechanism that exploits the predictability of dependence-resolution delays to restart fetch of an excluded thread so that the instructions reach the execution core just as the original dependence resolves.
Abstract: Simultaneous multithreading (SMT) attempts to keep a dynamically scheduled processorpsilas resources busy with work from multiple independent threads. Threads with long-latency stalls, however, can lead to a reduction in overall throughput because they occupy many of the critical processor resources. In this work, we first study the interaction between stalls caused by ambiguous memory dependences and SMT processing. We then propose the technique of proactive exclusion (PE) where the SMT fetch unit stops fetching from a thread when a memory dependence is predicted to exist. However, after the dependence has been resolved, the thread is delayed waiting for new instructions to be fetched and delivered down the front-end pipeline. So we introduce an early parole (EP) mechanism that exploits the predictability of dependence-resolution delays to restart fetch of an excluded thread so that the instructions reach the execution core just as the original dependence resolves. We show that combining these two techniques (PEEP) yields a 16.9% throughput improvement on a 4-way SMT processor that supports speculative memory disambiguation. These strong results indicate that a fetch policy that is cognizant of future stalls considerably improves the throughput of an SMT machine.

16 citations


Cited by
More filters
Proceedings ArticleDOI
20 Jun 2009
TL;DR: This work proposes a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism.
Abstract: Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that sharing-oblivious cache management policies (e.g., LRU) can lead to poor performance and fairness when the multiple cores compete for the limited LLC capacity. Different memory access patterns can cause cache contention in different ways, and various techniques have been proposed to target some of these behaviors. In this work, we propose a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism. By handling multiple types of memory behaviors, our proposed technique outperforms techniques that target only either capacity partitioning or adaptive insertion.

334 citations

Journal ArticleDOI
TL;DR: This article explores, analyze, and compares the accuracy and simulation speed of high-abstraction core models, a potential solution to slow cycle-level simulation, and introduces the instruction-window centric (IW-centric) core model, a new mechanistic core model that bridges the gap between interval simulation and cycle-accurate simulation by enabling high-speed simulations with higher levels of detail.
Abstract: Large core counts and complex cache hierarchies are increasing the burden placed on commonly used simulation and modeling techniques. Although analytical models provide fast results, they do not apply to complex, many-core shared-memory systems. In contrast, detailed cycle-level simulation can be accurate but also tends to be slow, which limits the number of configurations that can be evaluated. A middle ground is needed that provides for fast simulation of complex many-core processors while still providing accurate results. In this article, we explore, analyze, and compare the accuracy and simulation speed of high-abstraction core models as a potential solution to slow cycle-level simulation. We describe a number of enhancements to interval simulation to improve its accuracy while maintaining simulation speed. In addition, we introduce the instruction-window centric (IW-centric) core model, a new mechanistic core model that bridges the gap between interval simulation and cycle-accurate simulation by enabling high-speed simulations with higher levels of detail. We also show that using accurate core models like these are important for memory subsystem studies, and that simple, naive models, like a one-IPC core model, can lead to misleading and incorrect results and conclusions in practical design studies. Validation against real hardware shows good accuracy, with an average single-core error of 11.1p and a maximum of 18.8p for the IW-centric model with a 1.5× slowdown compared to interval simulation.

283 citations

Journal ArticleDOI
29 Mar 2011
TL;DR: The Structural Simulation Toolkit (SST) as discussed by the authors is an open, modular, parallel, multi-criteria, multiscale simulation framework for HPC systems that includes a number of processor, memory, and network models.
Abstract: As supercomputers grow, understanding their behavior and performance has become increasingly challenging. New hurdles in scalability, programmability, power consumption, reliability, cost, and cooling are emerging, along with new technologies such as 3D integration, GP-GPUs, silicon-photonics, and other "game changers". Currently, they HPC community lacks a unified toolset to evaluate these technologies and design for these challenges.To address this problem, a number of institutions have joined together to create the Structural Simulation Toolkit (SST), an open, modular, parallel, multi-criteria, multi-scale simulation framework. The SST includes a number of processor, memory, and network models. The SST has been used in a variety of network, memory, and application studies and aims to become the standard simulation framework for designing and procuring HPC systems.

270 citations

Proceedings ArticleDOI
12 Feb 2011
TL;DR: D Dynamically Specialized Datapaths are proposed to improve the energy efficiency of general purpose programmable processors and show that in most cases two DySER blocks can achieve the same performance as having a specialized hardware module for each path-tree.
Abstract: Due to limits in technology scaling, energy efficiency of logic devices is decreasing in successive generations. To provide continued performance improvements without increasing power, regardless of the sequential or parallel nature of the application, microarchitectural energy efficiency must improve. We propose Dynamically Specialized Datapaths to improve the energy efficiency of general purpose programmable processors. The key insights of this work are the following. First, applications execute in phases and these phases can be determined by creating a path-tree of basic-blocks rooted at the inner-most loop. Second, specialized datapaths corresponding to these path-trees, which we refer to as DySER blocks, can be constructed by interconnecting a set of heterogeneous computation units with a circuit-switched network. These blocks can be easily integrated with a processor pipeline. A synthesized RTL implementation using an industry 55nm technology library shows a 64-functional-unit DySER block occupies approximately the same area as a 64 KB single-ported SRAM and can execute at 2 GHz. We extend the GCC compiler to identify path-trees and code-mapping to DySER and evaluate the PAR-SEC, SPEC and Parboil benchmarks suites. Our results show that in most cases two DySER blocks can achieve the same performance (within 5%) as having a specialized hardware module for each path-tree. A 64-FU DySER block can cover 12% to 100% of the dynamically executed instruction stream. When integrated with a dual-issue out-of-order processor, two DySER blocks provide geometric mean speedup of 2.1X (1.15X to 10X), and geometric mean energy reduction of 40% (up to 70%), and 60% energy reduction if no performance improvement is required.

229 citations

Proceedings ArticleDOI
12 Dec 2009
TL;DR: The idea is to divide processor execution time into phases, rank applications within a phase based on stall-time criticality, and have all routers in the network prioritize packets based on their applications' ranks.
Abstract: Network-on-Chips (NoCs) are likely to become a critical shared resource in future many-core processors. The challenge is to develop policies and mechanisms that enable multiple applications to efficiently and fairly share the network, to improve system performance. Existing local packet scheduling policies in the routers fail to fully achieve this goal, because they treat every packet equally, regardless of which application issued the packet. This paper proposes prioritization policies and architectural extensions to NoC routers that improve the overall application-level throughput, while ensuring fairness in the network. Our prioritization policies are application-aware, distinguishing applications based on the stall-time criticality of their packets. The idea is to divide processor execution time into phases, rank applications within a phase based on stall-time criticality, and have all routers in the network prioritize packets based on their applications' ranks. Our scheme also includes techniques that ensure starvation freedom and enable the enforcement of system-level application priorities. We evaluate the proposed prioritization policies on a 64-core CMP with an 8x8 mesh NoC, using a suite of 35 diverse applications. For a representative set of case studies, our proposed policy increases average system throughput by 25.6% over age-based arbitration and 18.4% over round-robin arbitration. Averaged over 96 randomly-generated multiprogrammed workload mixes, the proposed policy improves system throughput by 9.1% over the best existing prioritization policy, while also reducing application-level unfairness.

223 citations