Key establishment in sensor networks is a challenging problem because asymmetric key cryptosystems are unsuitable for use in resource constrained sensor nodes, and also because the nodes could be physically compromised by an adversary. We present three new mechanisms for key establishment using the framework of pre-distributing a random set of keys to each node. First, in the q-composite keys scheme, we trade off the unlikeliness of a large-scale network attack in order to significantly strengthen random key predistribution's strength against smaller-scale attacks. Second, in the multipath-reinforcement scheme, we show how to strengthen the security between any two nodes by leveraging the security of other links. Finally, we present the random-pairwise keys scheme, which perfectly preserves the secrecy of the rest of the network when any node is captured, and also enables node-to-node authentication and quorum-based revocation.

/pdf/random-key-predistribution-schemes-for-sensor-networks-2w3j5ne7ef.pdf

Random key predistribution schemes for sensor networks

This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalar's multiple functional units in a single cycle. We present several models of simultaneous multithreading and compare them with alternative organizations: a wide superscalar, a fine-grain multithreaded processor, and single-chip, multiple-issue multiprocessing architectures. Our results show that both (single-threaded) superscalar and fine-grain multithreaded architectures are limited their ability to utilize the resources of a wide-issue processor. Simultaneous multithreading has the potential to achieve 4 times the throughput of a superscalar, and double that of fine-grain multithreading. We evaluate several cache configurations made possible by this type of organization and evaluate tradeoffs between them. We also show that simultaneous multithreading is an attractive alternative to single-chip multiprocessors; simultaneous multithreaded processors with a variety of organizations outperform corresponding conventional multiprocessors with similar execution resources.While simultaneous multithreading has excellent potential to increase processor utilization, it can add substantial complexity to the design. We examine many of these complexities and evaluate alternative organizations in the design space.

/pdf/simultaneous-multithreading-maximizing-on-chip-parallelism-5dmq3kue0t.pdf

Simultaneous multithreading: maximizing on-chip parallelism

The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.


* synthesizes a decade of research and development for practicing engineers, graduate students, and researchers in parallel computer architecture, system software, and applications development

* presents in-depth application case studies from computer graphics, computational science and engineering, and data mining to demonstrate sound quantitative evaluation of design trade-offs 

* describes the process of programming for performance, including both the architecture-independent and architecture-dependent aspects, with examples and case-studies

* illustrates bus-based and network-based parallel systems with case studies of more than a dozen important commercial designs

Table of Contents

1 Introduction 
2 Parallel Programs 
3 Programming for Performance 
4 Workload-Driven Evaluation 
5 Shared Memory Multiprocessors 
6 Snoop-based Multiprocessor Design 
7 Scalable Multiprocessors
8 Directory-based Cache Coherence
9 Hardware-Software Tradeoffs 
10 Interconnection Network Design 
11 Latency Tolerance 
12 Future Directions 
APPENDIX A Parallel Benchmark Suites

Parallel Computer Architecture: A Hardware/Software Approach

While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees.We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.

/pdf/distributed-graphlab-a-framework-for-machine-learning-and-3837teraz4.pdf

Distributed GraphLab: a framework for machine learning and data mining in the cloud

Processing-in-memory (PIM) is a promising solution to address the "memory wall" challenges for future computer systems. Prior proposed PIM architectures put additional computation logic in or near memory. The emerging metal-oxide resistive random access memory (ReRAM) has showed its potential to be used for main memory. Moreover, with its crossbar array structure, ReRAM can perform matrix-vector multiplication efficiently, and has been widely studied to accelerate neural network (NN) applications. In this work, we propose a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory. In PRIME, a portion of ReRAM crossbar arrays can be configured as accelerators for NN applications or as normal memory for a larger memory space. We provide microarchitecture and circuit designs to enable the morphable functions with an insignificant area overhead. We also design a software/hardware interface for software developers to implement various NNs on PRIME. Benefiting from both the PIM architecture and the efficiency of using ReRAM for NN computation, PRIME distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving. Our experimental results show that, compared with a state-of-the-art neural processing unit design, PRIME improves the performance by ~2360× and the energy consumption by ~895×, across the evaluated machine learning benchmarks.

PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory

: Instruction cache miss latency is becoming an increasingly important performance bottleneck, especially for commercial applications. Although instruction prefetching is an attractive technique for tolerating this latency, we find that existing prefetching schemes are insufficient for modern superscalar processors since they fail to issue prefetches early enough (particularly for non-sequential accesses). To overcome these limitations, we propose a new instruction prefetching technique whereby the hardware and software cooperate to hide the latency as follows. The hardware performs aggressive sequential prefetching combined with a novel prefetch filtering mechanism to allow it to get far ahead without polluting the cache. To hide the latency of non-sequential accesses, we propose and implement a novel compiler algorithm which automatically inserts instruction-prefetch instructions into the executable to prefetch the targets of control transfers far enough in advance. Our experimental results demonstrate that this new approach results in speedups ranging from 9.4% to 18.5% (13.3% on average) over the original execution time on an out-of-order superscalar processor, which is more than double the average speedup of the best existing schemes (6.5%). This is accomplished by hiding an average of 71% of the original instruction stall time, compared with only 36% for the best existing schemes. We find that both the prefetch filtering and compiler-inserted prefetching components of our design are essential and complementary, that the compiler can limit the code expansion to less than 10% on average, and that our scheme is robust with respect to variations in miss latency and bandwidth.

http://www.dtic.mil/dtic/tr/fulltext/u2/a352369.pdf

Compiler and Hardware Support for Automatic Instruction Prefetching: A Cooperative Approach

Modern graphics processing units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive.

A framework for accelerating bottlenecks in GPU execution with assist warps

Emerging sensing applications create an unprecedented need for energy efficiency in programmable processors. To achieve useful multi-year deployments on a small battery or energy harvester, these applications must avoid off-device communication and instead process most data locally. Recent work has proven coarse-grained reconfigurable arrays (CGRAs) as a promising architecture for this domain. Unfortunately, nearly all prior CGRAs support only computations with simple control flow and no memory aliasing (e.g., affine inner loops), causing an Amdahl efficiency bottleneck as non-trivial fractions of programs must run on an inefficient von Neumann core.RipTide is a co-designed compiler and CGRA architecture that achieves both high programmability and extreme energy efficiency, eliminating this bottleneck. RipTide provides a rich set of control-flow operators that support arbitrary control flow and memory access on the CGRA fabric. RipTide implements these primitives without tagged tokens to save energy; this requires careful ordering analysis in the compiler to guarantee correctness. RipTide further saves energy and area by offloading most control operations into its programmable on-chip network, where they can re-use existing network switches. RipTide’s compiler is implemented in LLVM, and its hardware is synthesized in Intel 22FFL. RipTide compiles applications written in C while saving 25% energy v. the state-of-the-art energy-minimal CGRA and 6.6 × energy v. a von Neumann core.

A programmable, energy-minimal dataflow compiler and architecture

Latency-tolerance techniques offer the potential for bridging the ever-increasing speed gap between the memory subsystem and today's high-performance processors. However, to fully exploit the benefit of these techniques, one must be careful to apply them only to the dynamic references that are likely to suffer cache misses-otherwise the runtime overheads can potentially offset any gains. In this paper, we focus on isolating dynamic miss instances in nonnumeric applications, which is a difficult but important problem. Although compilers cannot statically analyze data locality in nonnumeric applications, one viable approach is to use profiling information to measure the actual miss behavior. Unfortunately, the state-of-the-art in cache miss profiling (which we call summary profiling) is inadequate for references with intermediate miss ratios-it either misses opportunities to hide latency, or else inserts overhead that is unnecessary. To overcome this problem, we propose and evaluate a new profiling technique that helps predict which dynamic instances of a static memory reference will hit or miss in the cache: correlation profiling Our experimental results demonstrate that roughly half of the 21 nonnumeric applications we study can potentially enjoy significant reductions in memory stall time by exploiting at least one of the three forms of correlation profiling we consider: control-flow correlation, self correlation, and global correlation. In addition, our detailed case studies illustrate that self correlation succeeds because a given reference's cache outcomes often contain repeated patterns and control-flow correlation succeeds because cache outcomes are often call-chain dependent. Finally, we suggest a number of ways to exploit correlation profiling in practice and demonstrate that software prefetching can achieve better performance on a modern superscalar processor when directed by correlation profiling rather than summary profiling information.

Understanding why correlation profiling improves the predictability of data cache misses in nonnumeric applications

Computer systems have enjoyed an exponential growth in processor speed for the past 20 years, while main memory speed has improved only moderately. Today a cache miss to main memory takes hundreds of processor cycles. Recent studies have demonstrated that on commercial databases, about 50% or more of execution time in memory is often wasted due to cache misses. In light of this problem, a number of recent studies focused on reducing the number of cache misses of database algorithms. In this thesis, we investigate a different approach: reducing the impact of cache misses through a technique called cache prefetching. Since prefetching for sequential array accesses has been well studied, we are interested in studying non-contiguous access patterns found in two classes of database algorithms: the B+-Tree index algorithm and the hash join algorithm. We re-examine their designs with cache prefetching in mind, and combine prefetching and data locality optimizations to achieve good cache performance. 
For B+-Trees, we first propose and evaluate a novel main memory index structure, Prefetching B+Trees, which uses prefetching to accelerate two major access patterns of B+-Tree indices: searches and range scans. We then apply our findings in the development of a novel index structure, Fractal Prefetching B+-Trees, that optimizes index operations both for CPU cache performance and for disk performance in commercial database systems by intelligently embedding cache-optimized trees into disk pages. 
For hash joins, we first exploit cache prefetching separately for the I/O partition phase and the join phase of the algorithm. We propose and evaluate two techniques, Group Prefetching and Software-Pipelined Prefetching, that exploit inter-tuple parallelism to overlap cache misses across the processing of multiple tuples. Then we present a novel algorithm, Inspector Joins, that exploits the free information obtained from one pass of the hash join algorithm to improve the performance of a later pass. This new algorithm addresses the memory bandwidth sharing problem in shared-bus multiprocessor systems. 
We compare our techniques against state-of-the-art cache-friendly algorithms for B+-Trees and hash joins through both simulation studies and real machine experiments. Our experimental results demonstrate dramatic performance benefits of our cache prefetching enabled techniques.

Todd C. Mowry

Papers

Compiler and Hardware Support for Automatic Instruction Prefetching: A Cooperative Approach

A framework for accelerating bottlenecks in GPU execution with assist warps

A programmable, energy-minimal dataflow compiler and architecture

Understanding why correlation profiling improves the predictability of data cache misses in nonnumeric applications

Redesigning database systems in light of cpu cache prefetching