scispace - formally typeset
Search or ask a question

Showing papers by "Srinivas Devadas published in 2014"


Journal ArticleDOI
30 May 2014
TL;DR: This paper motivates the use of PUFs versus conventional secure nonvolatile memories, defines the two primary PUF types, and describes strong and weak PUF implementations and their use for low-cost authentication and key generation applications.
Abstract: This paper describes the use of physical unclonable functions (PUFs) in low-cost authentication and key generation applications. First, it motivates the use of PUFs versus conventional secure nonvolatile memories and defines the two primary PUF types: “strong PUFs” and “weak PUFs.” It describes strong PUF implementations and their use for low-cost authentication. After this description, the paper covers both attacks and protocols to address errors. Next, the paper covers weak PUF implementations and their use in key generation applications. It covers error-correction schemes such as pattern matching and index-based coding. Finally, this paper reviews several emerging concepts in PUF technologies such as public model PUFs and new PUF implementation technologies.

977 citations


Journal ArticleDOI
01 Nov 2014
TL;DR: In this article, the authors evaluate concurrency control for on-line transaction processing (OLTP) workloads on many-core chips and show that the complexity of coordinating competing accesses to data will likely diminish the gains from increased core counts.
Abstract: Computer architectures are moving towards an era dominated by many-core machines with dozens or even hundreds of cores on a single chip. This unprecedented level of on-chip parallelism introduces a new dimension to scalability that current database management systems (DBMSs) were not designed for. In particular, as the number of cores increases, the problem of concurrency control becomes extremely challenging. With hundreds of threads running in parallel, the complexity of coordinating competing accesses to data will likely diminish the gains from increased core counts.To better understand just how unprepared current DBMSs are for future CPU architectures, we performed an evaluation of concurrency control for on-line transaction processing (OLTP) workloads on many-core chips. We implemented seven concurrency control algorithms on a main-memory DBMS and using computer simulations scaled our system to 1024 cores. Our analysis shows that all algorithms fail to scale to this magnitude but for different reasons. In each case, we identify fundamental bottlenecks that are independent of the particular database implementation and argue that even state-of-the-art DBMSs suffer from these limitations. We conclude that rather than pursuing incremental solutions, many-core chips may require a completely redesigned DBMS architecture that is built from ground up and is tightly coupled with the hardware.

239 citations


Journal ArticleDOI
TL;DR: Novel robust and low-overhead physical unclonable function (PUF) authentication and key exchange protocols that are resilient against reverse-engineering attacks are proposed and evaluated and confirmed by hardware implementation.
Abstract: This paper proposes novel robust and low-overhead physical unclonable function (PUF) authentication and key exchange protocols that are resilient against reverse-engineering attacks. The protocols are executed between a party with access to a physical PUF (prover) and a trusted party who has access to the PUF compact model (verifier). The proposed protocols do not follow the classic paradigm of exposing the full PUF responses or a transformation of them. Instead, random subsets of the PUF response strings are sent to the verifier so the exact position of the subset is obfuscated for the third-party channel observers. Authentication of the responses at the verifier side is done by matching the substring to the available full response string; the index of the matching point is the actual obfuscated secret (or key) and not the response substring itself. We perform a thorough analysis of resiliency of the protocols against various adversarial acts, including machine learning and statistical attacks. The attack analysis guides us in tuning the parameters of the protocol for an efficient and secure implementation. The low overhead and practicality of the protocols are evaluated and confirmed by hardware implementation.

160 citations


Proceedings ArticleDOI
19 Jun 2014
TL;DR: This paper shows how a secure processor can bound ORAM timing channel leakage to a user-controllable leakage limit, and presents a dynamic scheme that leaks at most 32 bits through the ORam timing channel and introduces only 20% performance overhead and 12% power overhead relative to a baseline ORAM that has no timing channel protection.
Abstract: Oblivious RAM (ORAM) is an established cryptographic technique to hide a program's address pattern to an untrusted storage system. More recently, ORAM schemes have been proposed to replace conventional memory controllers in secure processor settings to protect against information leakage in external memory and the processor I/O bus.

98 citations


Proceedings ArticleDOI
06 May 2014
TL;DR: This work presents the first architecture for linear additive physical functions where the noise seen by the adversary and the noise see by the verifier are bifurcated by using a randomized decimation technique and a novel response recovery method at an authentication verification server.
Abstract: Physical Unclonable Functions (PUFs) allow a silicon device to be authenticated based on its manufacturing variations using challenge/response evaluations. Popular realizations use linear additive functions as building blocks. Security is scaled up using non-linear mixing (e.g., adding XORs). Because the responses are physically derived and thus noisy, the resulting explosion in noise impacts both the adversary (which is desirable) as well as the verifier (which is undesirable). We present the first architecture for linear additive physical functions where the noise seen by the adversary and the noise seen by the verifier are bifurcated by using a randomized decimation technique and a novel response recovery method at an authentication verification server. We allow the adversary's noise η a → 0.50 while keeping the verifier's noise η v constant, using a parameter-based authentication modality that does not require explicit challenge/response pair storage at the server. We present supporting data using 28nm FPGA PUF noise results as well as machine learning attack results. We demonstrate that our architecture can also withstand recent side-channel attacks that filter the noise (to clean up training challenge/response labels) prior to machine learning.

72 citations



Proceedings ArticleDOI
19 Jun 2014
TL;DR: This work proposes a locality-aware selective data replication protocol for the last-level cache (LLC) that aims to lower memory access latency and energy by replicating only high locality cache lines in the LLC slice of the requesting core, while simultaneously keeping the off-chip miss rate low.
Abstract: Next generation multicores will process massive data with varying degree of locality. Harnessing on-chip data locality to optimize the utilization of cache and network resources is of fundamental importance. We propose a locality-aware selective data replication protocol for the last-level cache (LLC). Our goal is to lower memory access latency and energy by replicating only high locality cache lines in the LLC slice of the requesting core, while simultaneously keeping the off-chip miss rate low. Our approach relies on low overhead yet highly accurate in-hardware run-time classification of data locality at the cache line granularity, and only allows replication for cache lines with high reuse. Furthermore, our classifier captures the LLC pressure at the existing replica locations and adapts its replication decision accordingly. The locality tracking mechanism is decoupled from the sharer tracking structures that cause scalability concerns in traditional coherence protocols. Moreover, the complexity of our protocol is low since no additional coherence states are created. On a set of parallel benchmarks, our protocol reduces the overall energy by 16%, 14%, 13% and 21% and the completion time by 4%, 9%, 6% and 13% when compared to the previously proposed Victim Replication, Adaptive Selective Replication, Reactive-NUCA and Static-NUCA LLC management schemes.

36 citations


Patent
03 Jan 2014
TL;DR: In this article, the authors propose a verifier to verify the authenticity of a prover device using a probabilistic model of a physical unclonable function (PUF).
Abstract: Mechanisms for operating a prover device and a verifier device so that the verifier device can verify the authenticity of the prover device. The prover device generates a data string by: (a) submitting a challenge to a physical unclonable function (PUF) to obtain a response string, (b) selecting a substring from the response string, (c) injecting the selected substring into the data string, and (d) injecting random bits into bit positions of the data string not assigned to the selected substring. The verifier: (e) generates an estimated response string by evaluating a computational model of the PUF based on the challenge; (f) performs a search process to identify the selected substring within the data string using the estimated response string; and (g) determines whether the prover device is authentic based on a measure of similarity between the identified substring and a corresponding substring of the estimated response string.

29 citations



Posted Content
Abstract: This paper proposes a novel approach for automated implementation of an arbiter-based physical unclonable function (PUF) on field programmable gate arrays (FPGAs). We introduce a high resolution programmable delay logic (PDL) that is implemented by harnessing the FPGA lookup-table (LUT) internal structure. PDL allows automatic fine tuning of delays that can mitigate the timing skews caused by asymmetries in interconnect routing and systematic variations. To thwart the arbiter metastability problem, we present and analyze methods for majority voting of responses. A method to classify and group challenges into different robustness sets is introduced that enhances the corresponding responses’ stability in the face of operational variations. The trade-off between response stability and response entropy (uniqueness) is investigated through comprehensive measurements. We exploit the correlation between the impact of temperature and power supply on responses and perform less costly power measurements to predict the temperature impact on PUF. The measurements are performed on 12 identical Virtex 5 FPGAs across 9 different accurately controlled operating temperature and voltage supply points. A database of challenge response pairs (CRPs) are collected and made openly available for the research community.

26 citations


Proceedings ArticleDOI
10 Jun 2014
TL;DR: Measurement results show that up to 8.4× energy savings can be achieved with DVFS and self-adaptation, and enable a software self-aware computation engine (SEEC) to dynamically adapt the processor to meet performance and energy goals.
Abstract: This paper presents a self-aware processor with energy monitoring circuits that can measure actual energy consumption of the key blocks. The monitors are embedded into on-chip DC/DC converters and generate results within 10% of accuracy with minimal power (<;0.1%) and area (<;1%) overhead. Our system, which is implemented in 0.18μm technology, is designed to be voltage scalable from 1.8V down to 0.6V. Low-voltage SRAM operation is made possible through the use of 8T bit-cells and write-assists. The d-caches are designed to be re-configurable in associativity and size to adapt to compute- versus cache-bound phases of applications. Cache configuration is performed in <; 3 clock cycles including tag invalidation. These hardware features enable a software self-aware computation engine (SEEC) to dynamically adapt the processor to meet performance and energy goals. Measurement results show that up to 8.4× energy savings can be achieved with DVFS and self-adaptation.

Posted Content
TL;DR: Unified ORAM improves performance both asymptotically and empirically and reduces data movement from ORAM by half and improves benchmark performance by 61% as compared to recursive Path ORAM.
Abstract: Oblivious RAM (ORAM) is a cryptographic primitive that hides memory access patterns to untrusted storage. ORAM may be used in secure processors for encrypted computation and/or software protection. While recursive Path ORAM is currently the most practical ORAM for secure processors, it still incurs large performance and energy overhead and is the performance bottleneck of recently proposed secure processors. In this paper, we propose two optimizations to recursive Path ORAM. First, we identify a type of program locality in its operations to improve performance. Second, we use pseudorandom function to compress the position map. But applying these two techniques in recursive Path ORAM breaks ORAM security. To securely take advantage of the two ideas, we propose unified ORAM. Unified ORAM improves performance both asymptotically and empirically. Empirically, our experiments show that unified ORAM reduces data movement from ORAM by half and improves benchmark performance by 61% as compared to recursive Path ORAM.

Journal ArticleDOI
TL;DR: This approach can better exploit shared data locality for NUCA designs by effectively replacing multiple round-trip remote cache accesses with a smaller number of migrations, and improves the performance by 24% on average over the shared-NUCA design that only uses remote accesses.
Abstract: Chip-multiprocessors (CMPs) have become the mainstream parallel architecture in recent years; for scalability reasons, designs with high core counts tend towards tiled CMPs with physically distributed shared caches. This naturally leads to a Non-Uniform Cache Access (NUCA) design, where on-chip access latencies depend on the physical distances between requesting cores and home cores where the data is cached. Improving data locality is thus key to performance, and several studies have addressed this problem using data replication and data migration. In this paper, we consider another mechanism, hardware-level thread migration. This approach, we argue, can better exploit shared data locality for NUCA designs by effectively replacing multiple round-trip remote cache accesses with a smaller number of migrations. High migration costs, however, make it crucial to use thread migrations judiciously; we therefore propose a novel, on-line prediction scheme which decides whether to perform a remote access (as in traditional NUCA designs) or to perform a thread migration at the instruction level. For a set of parallel benchmarks, our thread migration predictor improves the performance by 24% on average over the shared-NUCA design that only uses remote accesses.

Posted Content
TL;DR: In this paper, the authors propose an ORAM prefetching technique called dynamic super block scheme and comprehensively explore its design space, which detects data locality in the program working set at runtime, and exploits the locality in a data-independent way.
Abstract: Oblivious RAM (ORAM) is an established technique to hide the access pattern to an untrusted storage system. With ORAM, a curious adversary cannot tell what data address the user is accessing when observing the bits moving between the user and the storage system. All existing ORAM schemes achieve obliviousness by adding redundancy to the storage system, i.e., each access is turned into multiple random accesses. Such redundancy incurs a large performance overhead. Though traditional data prefetching techniques successfully hide memory latency in DRAM based systems, it turns out that they do not work well for ORAM. In this paper, we exploit ORAM locality by taking advantage of the ORAM internal structures. Though it might seem apparent that obliviousness and locality are two contradictory concepts, we challenge this intuition by exploiting data locality in ORAM without sacrificing provable security. In particular, we propose an ORAM prefetching technique called dynamic super block scheme and comprehensively explore its design space. The dynamic super block scheme detects data locality in the program’s working set at runtime, and exploits the locality in a data-independent way. Our simulation results show that with dynamic super block scheme, ORAM performance without super blocks can be significantly improved. After adding timing protection to ORAM, the average performance gain is 25.5% (up to 49.4%) over the baseline ORAM and 16.6% (up to 30.1%) over the best ORAM prefetching technique proposed previously.

Proceedings ArticleDOI
23 Mar 2014
TL;DR: Improvements to the Graphite simulator designed to help explore current and emerging research topics are described, ideally suited to explore both power and performance in future multicore and manycore processors, especially those incorporating dynamic runtime monitoring and adaptation.
Abstract: This paper described recent improvements to the Graphite simulator designed to help explore current and emerging research topics. With these improvements, Graphite is ideally suited to explore both power and performance in future multicore and manycore processors, especially those incorporating dynamic runtime monitoring and adaptation. Separate validation of Graphite has shown performance results within about 6% on average (18% worst case) of a cycle-level simulator and normalized power trends are predicted to within 10%. This makes Graphite accurate enough for medium- to long-term studies while maintaining very high performance. Graphite is freely available for anyone to use: http://graphite.csail.mit.edu.

Posted Content
TL;DR: A method of cryptographically-secure key extraction from a noisy biometric source using a fuzzy commitment scheme and shows how keys can be extracted securely and efficiently even under extreme environmental variation.

Proceedings ArticleDOI
10 Jun 2014
TL;DR: AEGIS is a single-chip secure processor that can be used to protect the integrity and confidentiality of an application program from both physical and software attacks.
Abstract: AEGIS is a single-chip secure processor that can be used to protect the integrity and confidentiality of an application program from both physical and software attacks. We briefly describe the history behind this architecture and its key features, discuss main observations and lessons from the project, and list limitations of AEGIS and how recent research addresses them.

Proceedings ArticleDOI
01 Sep 2014
TL;DR: In this paper, an ILP formulation and two non-iterative heuristics for task-based application scheduling on a heterogeneous many-core architecture are presented, where the ILP convergence time may be too long.
Abstract: In this paper we present an Integer Linear Programming (ILP) formulation and two non-iterative heuristics for scheduling a task-based application onto a heterogeneous many-core architecture. Our ILP formulation is able to handle different application performance targets, e.g., low execution time, low memory miss rate, and different architectural features, e.g., cache sizes. For large size problem where the ILP convergence time may be too long, we propose a simple mapping algorithm which tries to spread tasks onto as many processing units as possible, and a more elaborate heuristic that shows good mapping performance when compared to the ILP formulation. We use two realistic power electronics applications to evaluate our mapping techniques on full RTL many-core systems consisting of eight different types of processor cores.

01 Sep 2014
TL;DR: This paper proposes a simple mapping algorithm which tries to spread tasks onto as many processing units as possible, and a more elaborate heuristic that shows good mapping performance when compared to the ILP formulation.
Abstract: In this paper we present an Integer Linear Programming (ILP) formulation and two non-iterative heuristics for scheduling a task-based application onto a heterogeneous many-core architecture. Our ILP formulation is able to handle different application performance targets, e.g., low execution time, low memory miss rate, and different architectural features, e.g., cache sizes. For large size problem where the ILP convergence time may be too long, we propose a simple mapping algorithm which tries to spread tasks onto as many processing units as possible, and a more elaborate heuristic that shows good mapping performance when compared to the ILP formulation. We use two realistic power electronics applications to evaluate our mapping techniques on full RTL many-core systems consisting of eight different types of processor cores.

Proceedings ArticleDOI
01 Sep 2014
TL;DR: In this paper, the authors proposed a network-on-chip router that provides predictable and deterministic communication latency for hard real-time data traffic while maintaining high concurrency and throughput for best-effort/general-purpose traffic with minimal hardware overhead.
Abstract: The increasing complexity of embedded systems is accelerating the use of multicore processors in these systems. This trend gives rise to new problems such as the sharing of on-chip network resources among hard real-time and normal best effort data traffic. We propose a network-on-chip router that provides predictable and deterministic communication latency for hard real-time data traffic while maintaining high concurrency and throughput for best-effort/general-purpose traffic with minimal hardware overhead. The proposed router requires less area than non-interfering networks, and provides better Quality of Service (QoS) in terms of predictability and determinism to hard real-time traffic than priority-based routers. We present a deadlock-free algorithm for decoupled routing of the two types of traffic. We compare the area and power estimates of three different router architectures with various QoS schemes using the IBM 45-nm SOI CMOS technology cell library. Performance evaluations are done using three realistic benchmark applications: a hybrid electric vehicle application, a utility grid connected photovoltaic converter system, and a variable speed induction motor drive application.

01 Jan 2014
TL;DR: This paper describes the use of physical unclon- able functions (PUFs) in low-cost authentication and key generation applications and defines the two primary PUF types: ''strong PUFs'' and ''weak PUFs.''
Abstract: This paper describes the use of physical unclon- able functions (PUFs) in low-cost authentication and key generation applications. First, it motivates the use of PUFs versus conventional secure nonvolatile memories and defines the two primary PUF types: ''strong PUFs'' and ''weak PUFs.'' It describes strong PUF implementations and their use for low- cost authentication. After this description, the paper covers both attacks and protocols to address errors. Next, the paper covers weak PUF implementations and their use in key gene- ration applications. It covers error-correction schemes such as pattern matching and index-based coding. Finally, this paper reviews several emerging concepts in PUF technologies such as public model PUFs and new PUF implementation technologies.

Proceedings ArticleDOI
01 Nov 2014
TL;DR: An EDA system is essential for an electronics design in due time with respect to continuously shorter design cycles in parallel to larger product spectra and high pressure on the development costs due to the increasing competition on the world market.
Abstract: Electronic systems in modern cars contribute with more than 80% to the innovation of the Automotive industry — probably being already the most complex systems in products of today. This complexity is not due to the sheer number of components in each device, but by the number of devices, and their heterogeneous nature combining analogue and digital circuits with sensors, actuators and software. In addition the very high demand on robustness and reliability to assure safety and availability at any time and everywhere under rough working conditions requires specific effort in the quality management of the electronics. While in the past a car was more or less a closed system today the use of any kind of multimedia, the communication with the internet and — increasingly — with all parts of the surrounding traffic has becoming a key asset of the development of modern cars. All these aspects have to be addressed by an EDA system which is essential for an electronics design in due time with respect to continuously shorter design cycles in parallel to larger product spectra and high pressure on the development costs due to the increasing competition on the world market. A further challenge for an EDA environment in the automotive design chain is the management of a large number of players with a different background over a broad spectrum of abstraction levels.

Journal ArticleDOI
TL;DR: PartiFold-Align as discussed by the authors exploits sparsity in the set of super-secondary structure pairings and alignment candidates to achieve an effectively cubic running time for simultaneous pairwise alignment and folding.
Abstract: Accurate comparative analysis tools for low-homology proteins remains a difficult challenge in computational biology, especially sequence alignment and consensus folding problems. We present partiFold-Align, the first algorithm for simultaneous alignment and consensus folding of unaligned protein sequences; the algorithm's complexity is polynomial in time and space. Algorithmically, partiFold-Align exploits sparsity in the set of super-secondary structure pairings and alignment candidates to achieve an effectively cubic running time for simultaneous pairwise alignment and folding. We demonstrate the efficacy of these techniques on transmembrane β-barrel proteins, an important yet difficult class of proteins with few known three-dimensional structures. Testing against structurally derived sequence alignments, partiFold-Align significantly outperforms state-of-the-art pairwise and multiple sequence alignment tools in the most difficult low-sequence homology case. It also improves secondary structur...

Posted Content
TL;DR: Ring ORAM as discussed by the authors is the first tree-based ORAM whose bandwidth is independent of the ORAM bucket size, a property that unlocks multiple performance improvements, such as 2.3× to 4× better than Path ORAM, the prior-art scheme for small client storage.
Abstract: Oblivious RAM (ORAM) is a cryptographic primitive that hides memory access patterns as seen by untrusted storage. This paper proposes Ring ORAM, the most bandwidth-efficient ORAM scheme for the small client storage setting in both theory and practice. Ring ORAM is the first tree-based ORAM whose bandwidth is independent of the ORAM bucket size, a property that unlocks multiple performance improvements. First, Ring ORAM's overall bandwidth is 2.3× to 4× better than Path ORAM, the prior-art scheme for small client storage. Second, if memory can perform simple untrusted computation, Ring ORAM achieves constant online bandwidth (∼ 60× improvement over Path ORAM for practical parameters). As a case study, we show Ring ORAM speeds up program completion time in a secure processor by 1.5× relative to Path ORAM. On the theory side, Ring ORAM features a tighter and significantly simpler analysis than Path ORAM.

Proceedings ArticleDOI
10 Jun 2014
TL;DR: The history of the work, primary observations and lessons that were learned from the modeling effort, and follow-up work to show how the research direction evolved over time are summarized.
Abstract: This paper presents the author retrospective on the analytical cache modeling work published in the 2001 International Conference on Supercomputing (ICS). We summarize the history of the work, revisit primary observations and lessons that we learned from the modeling effort, and also briefly describe follow-up work to show how the research direction evolved over time.Original Paper: http://dx.doi.org/10.1145/377792.377797