scispace - formally typeset
Search or ask a question

Showing papers by "Moinuddin K. Qureshi published in 2018"


Proceedings ArticleDOI
20 Oct 2018
TL;DR: This paper provides the key insight that randomized mapping can be accomplished efficiently by accessing the cache with an encrypted address, as encryption would cause the lines that map to the same set of a conventional cache to get scattered to different sets.
Abstract: Modern processors share the last-level cache between all the cores to efficiently utilize the cache space. Unfortunately, such sharing makes the cache vulnerable to attacks whereby an adversary can infer the access pattern of a co-running application by carefully orchestrating evictions using cache conflicts. Conflict-based attacks can be mitigated by randomizing the location of the lines in the cache. Unfortunately, prior proposals for randomized mapping require storage-intensive tables and are effective only if the OS can classify the applications into protected and unprotected groups. The goal of this paper is to mitigate conflict-based attacks while incurring negligible storage and performance overheads, and without relying on OS support. This paper provides the key insight that randomized mapping can be accomplished efficiently by accessing the cache with an encrypted address, as encryption would cause the lines that map to the same set of a conventional cache to get scattered to different sets. This paper proposes CEASE, a design that uses Low-Latency Block-Cipher (LLBC) to translate the physical line-address into an encrypted line-address, and accesses the cache with this encrypted line-address. We analyze efficient designs for LLBC that can perform encryption and decryption within two cycles. We also propose CEASER, a design that periodically changes the encryption key and performs dynamic-remapping to improve robustness. CEASER provides strong security (tolerates 100+ years of attack), has low performance overhead (1% slowdown), requires a storage overhead of less than 24 bytes for the newly added structures, and does not need any OS support.

195 citations


Proceedings ArticleDOI
20 Oct 2018
TL;DR: This paper proposes a scalable solution to this problem by proposing a compact integrity tree design that requires fewer memory accesses for its traversal, and enables this by proposing new storage-efficient representations for the counters used for encryption and integrity-tree in secure memories.
Abstract: Securing off-chip main memory is essential for protection from adversaries with physical access to systems. However, current secure-memory designs incur considerable performance overheads - a major cause being the multiple memory accesses required for traversing an integrity-tree, that provides protection against man-in-the-middle attacks or replay attacks. In this paper, we provide a scalable solution to this problem by proposing a compact integrity tree design that requires fewer memory accesses for its traversal. We enable this by proposing new storage-efficient representations for the counters used for encryption and integrity-tree in secure memories. Our Morphable Counters are more cacheable on-chip, as they provide more counters per cacheline than existing split counters. Additionally, they incur lower overheads due to counter-overflows, by dynamically switching between counter representations based on usage pattern. We show that using Morphable Counters enables a 128-ary integrity-tree, that can improve performance by 6.3% on average (up to 28.3%) and reduce system energy-delay product by 8.8% on average, compared to an aggressive baseline using split counters with a 64-ary integrity-tree. These benefits come without any additional storage or reduction in security and are derived from our compact counter representation, that reduces the integrity-tree size for a 16GB memory from 4MB in the baseline to 1MB. Compared to recently proposed VAULT [1], our design provides a speedup of 13.5% on average (up to 47.4%).

69 citations


Proceedings ArticleDOI
01 Feb 2018
TL;DR: This paper proposes Synergy, a reliability-security co-design that improves performance of secure execution while providing strong reliability for systems with 9-chip ECC-DIMMs and increases reliability by 185x compared to ECCs that provide Single-Error Correction, Double-Error Detection (SECDED) capability.
Abstract: Building trusted data-centers requires resilient memories which are protected from both adversarial attacks and errors. Unfortunately, the state-of-the-art memory security solutions incur considerable performance overheads due to accesses for security metadata like Message Authentication Codes (MACs). At the same time, commercial secure memory solutions tend to be designed oblivious to the presence of memory reliability mechanisms (such as ECC-DIMMs), that provide tolerance to memory errors. Fortunately, ECC-DIMMs possess an additional chip for providing error correction codes (ECC), that is accessed in parallel with data, which can be harnessed for security optimizations. If we can re-purpose the ECC-chip to store some metadata useful for security and reliability, it can prove beneficial to both. To this end, this paper proposes Synergy, a reliability-security co-design that improves performance of secure execution while providing strong reliability for systems with 9-chip ECC-DIMMs. Synergy uses the insight that MACs being capable of detecting data tampering are also useful for detecting memory errors. Therefore, MACs are best suited for being placed inside the ECC chip, to be accessed in parallel with each data access. By co-locating MAC and Data, Synergy is able to avoid a separate memory access for MAC and thereby reduce the overall memory traffic for secure memory systems. Furthermore, Synergy is able to tolerate 1 chip failure out of 9 chips by using a parity that is constructed over 9 chips (8 Data and 1 MAC), which is used for reconstructing the data of the failed chip. For memory intensive workloads, Synergy provides a speedup of 20% and reduces system Energy Delay Product by 31% compared to a secure memory baseline with ECC-DIMMs. At the same time, Synergy increases reliability by 185x compared to ECC-DIMMs that provide Single-Error Correction, Double-Error Detection (SECDED) capability. Synergy uses commercial ECC-DIMMs and does not incur any additional hardware overheads or reduction of security.

67 citations


Posted Content
TL;DR: This paper proposes Variation-Aware Qubit Movement (VQM), policies that optimize the movement and allocation of qubits to avoid the weaker qubits and links, and guide more operations towards the stronger qu bits and links.
Abstract: Recently, IBM, Google, and Intel showcased quantum computers ranging from 49 to 72 qubits. While these systems represent a significant milestone in the advancement of quantum computing, existing and near-term quantum computers are not yet large enough to fully support quantum error-correction. Such systems with few tens to few hundreds of qubits are termed as Noisy Intermediate Scale Quantum computers (NISQ), and these systems can provide benefits for a class of quantum algorithms. In this paper, we study the problems of Qubit-Allocation (mapping of program qubits to machine qubits) and Qubit-Movement(routing qubits from one location to another to perform entanglement). We observe that there exists variation in the error rates of different qubits and links, which can have an impact on the decisions for qubit movement and qubit allocation. We analyze characterization data for the IBM-Q20 quantum computer gathered over 52 days to understand and quantify the variation in the error-rates and find that there is indeed significant variability in the error rates of the qubits and the links connecting them. We define reliability metrics for NISQ computers and show that the device variability has the substantial impact on the overall system reliability. To exploit the variability in error rate, we propose Variation-Aware Qubit Movement (VQM) and Variation-Aware Qubit Allocation (VQA), policies that optimize the movement and allocation of qubits to avoid the weaker qubits and links and guide more operations towards the stronger qubits and links. We show that our Variation-Aware policies improve the reliability of the NISQ system up to 2.5x.

48 citations


Proceedings ArticleDOI
02 Jun 2018
TL;DR: Association via Coordinated Way-Install and Way-Prediction (ACCORD), a design that steers an incoming line to a "preferred way" based on the line address and uses the preferred way as the default way prediction, and extends ACCORD to support highly-associative caches using a Skewed Way-Steering (SWS) design.
Abstract: Stacked-DRAM technology has enabled high bandwidth gigascale DRAM caches Since DRAM caches require a tag store of several tens of megabytes, commercial DRAM cache designs typically co-locate tag and data within the DRAM array DRAM caches are organized as a direct-mapped structure so that the tag and data can be streamed out in a single access While direct-mapped DRAM caches provide low hit-latency, they suffer from low hit-rate due to conflict misses Ideally, we want the hit-rate of a set-associative DRAM cache, without incurring additional latency and bandwidth costs of increasing associativity To address this problem, way prediction can be applied to a set-associative DRAM cache to achieve the latency and bandwidth of a direct-mapped DRAM cache Unfortunately, conventional way prediction policies typically require per-set storage, causing multi-megabyte storage overheads for gigascale DRAM caches If we can obtain accurate way prediction without incurring significant storage overheads, we can efficiently enable set-associativity for DRAM caches This paper proposes Associativity via Coordinated Way-Install and Way-Prediction (ACCORD), a design that steers an incoming line to a "preferred way" based on the line address and uses the preferred way as the default way prediction We propose two way-steering policies that are effective for 2-way caches First, Probabilistic Way-Steering (PWS), which steers lines to a preferred way with high probability, while still allowing lines to be installed in an alternate way in case of conflicts Second, Ganged Way-Steering (GWS), which steers lines of a spatially contiguous region to the way where an earlier line from that region was installed On a 2-way cache, ACCORD (PWS+GWS) obtains a way prediction accuracy of 90% and retains a hit-rate similar to a baseline 2-way cache while incurring 320 bytes of storage overhead We extend ACCORD to support highly-associative caches using a Skewed Way-Steering (SWS) design that steers a line to at-most two ways in the highly-associative cache This design retains the low-latency of the 2-way ACCORD while obtaining most of the hit-rate benefits of a highly associative design Our studies with a 4GB DRAM cache backed by non-volatile memory shows that ACCORD provides an average of 11% speedup (up to 54%) across a wide range of workloads

31 citations


Posted Content
TL;DR: CRAM is proposed, a bandwidth-efficient design for memory compression that is entirely hardware based and does not require any OS support or changes to the memory modules or interfaces, and uses a novel implicit-metadata mechanism, whereby the compressibility of the line can be determined by scanning the line for a special marker word, eliminating the overheads of metadata access.
Abstract: This paper investigates hardware-based memory compression designs to increase the memory bandwidth. When lines are compressible, the hardware can store multiple lines in a single memory location, and retrieve all these lines in a single access, thereby increasing the effective memory bandwidth. However, relocating and packing multiple lines together depending on the compressibility causes a line to have multiple possible locations. Therefore, memory compression designs typically require metadata to specify the compressibility of the line. Unfortunately, even in the presence of dedicated metadata caches, maintaining and accessing this metadata incurs significant bandwidth overheads and can degrade performance by as much as 40%. Ideally, we want to implement memory compression while eliminating the bandwidth overheads of metadata accesses. This paper proposes CRAM, a bandwidth-efficient design for memory compression that is entirely hardware based and does not require any OS support or changes to the memory modules or interfaces. CRAM uses a novel implicit-metadata mechanism, whereby the compressibility of the line can be determined by scanning the line for a special marker word, eliminating the overheads of metadata access. CRAM is equipped with a low-cost Line Location Predictor (LLP) that can determine the location of the line with 98% accuracy. Furthermore, we also develop a scheme that can dynamically enable or disable compression based on the bandwidth cost of storing compressed lines and the bandwidth benefits of obtaining compressed lines, ensuring no degradation for workloads that do not benefit from compression. Our evaluations, over a diverse set of 27 workloads, show that CRAM provides a speedup of up to 73% (average 6%) without causing slowdown for any of the workloads, and consuming a storage overhead of less than 300 bytes at the memory controller.

4 citations


Posted Content
TL;DR: This paper summarizes the idea of Low-Cost Interlinked Subarrays (LISA), which was published in HPCA 2016, and examines the work's significance and future potential, and shows that each of LISA's three applications significantly improves performance and memory energy efficiency on a variety of workloads and system configurations.
Abstract: This paper summarizes the idea of Low-Cost Interlinked Subarrays (LISA), which was published in HPCA 2016, and examines the work's significance and future potential. Contemporary systems perform bulk data movement movement inefficiently, by transferring data from DRAM to the processor, and then back to DRAM, across a narrow off-chip channel. The use of this narrow channel results in high latency and energy consumption. Prior work proposes to avoid these high costs by exploiting the existing wide internal DRAM bandwidth for bulk data movement, but the limited connectivity of wires within DRAM allows fast data movement within only a single DRAM subarray. Each subarray is only a few megabytes in size, greatly restricting the range over which fast bulk data movement can happen within DRAM. Our HPCA 2016 paper proposes a new DRAM substrate, Low-Cost Inter-Linked Subarrays (LISA), whose goal is to enable fast and efficient data movement across a large range of memory at low cost. LISA adds low-cost connections between adjacent subarrays. By using these connections to interconnect the existing internal wires (bitlines) of adjacent subarrays, LISA enables wide-bandwidth data transfer across multiple subarrays with little (only 0.8%) DRAM area overhead. As a DRAM substrate, LISA is versatile, enabling a variety of new applications. We describe and evaluate three such applications in detail: (1) fast inter-subarray bulk data copy, (2) in-DRAM caching using a DRAM architecture whose rows have heterogeneous access latencies, and (3) accelerated bitline precharging by linking multiple precharge units together. Our extensive evaluations show that each of LISA's three applications significantly improves performance and memory energy efficiency on a variety of workloads and system configurations.

1 citations


Proceedings Article
20 Oct 2018
TL;DR: In navigating this performance-security tradeoff, SGX and SME end up at different ends of the spectrum as shown in Fig 1.
Abstract: Main-memories are prone to attacks [1], [4], [12] that allow an adversary to take control of the system by reading and tampering memory-contents. Commercial solutions like Intel’s Software Guard Extensions (SGX) [3] and AMD’s Secure Memory Encryption (SME) [5] attempt to secure memory against attacks. However, providing security requires accessing metadata, resulting in storage and performance overheads. In navigating this performance-security tradeoff, SGX and SME end up at different ends of the spectrum as shown in Fig 1.