Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

Pattern Recognition and Machine Learning

Trusted execution environments, and particularly the Software Guard eXtensions (SGX) included in recent Intel x86 processors, gained significant traction in recent years. A long track of research papers, and increasingly also real-world industry applications, take advantage of the strong hardware-enforced confidentiality and integrity guarantees provided by Intel SGX. Ultimately, enclaved execution holds the compelling potential of securely offloading sensitive computations to untrusted remote platforms.

We present Foreshadow, a practical software-only microarchitectural attack that decisively dismantles the security objectives of current SGX implementations. Crucially, unlike previous SGX attacks, we do not make any assumptions on the victim enclave's code and do not necessarily require kernel-level access. At its core, Foreshadow abuses a speculative execution bug in modern Intel processors, on top of which we develop a novel exploitation methodology to reliably leak plaintext enclave secrets from the CPU cache. We demonstrate our attacks by extracting full cryptographic keys from Intel's vetted architectural enclaves, and validate their correctness by launching rogue production enclaves and forging arbitrary local and remote attestation responses. The extracted remote attestation keys affect millions of devices.

/pdf/foreshadow-extracting-the-keys-to-the-intel-sgx-kingdom-with-4pkx3bpx42.pdf

Foreshadow: extracting the keys to the intel SGX kingdom with transient out-of-order execution

At some point in the future, how far out we do not exactly know, wireless access to the Internet will outstrip all other forms of access bringing the freedom of mobility to the way we access the we...

ACM SIGCOMM computer communication review

GENI: A federated testbed for innovative network experiments

This paper presents a quantitative study on the performance of 3G mobile data offloading through WiFi networks. We recruited 97 iPhone users from metropolitan areas and collected statistics on their WiFi connectivity during a two-and-a-half-week period in February 2010. Our trace-driven simulation using the acquired whole-day traces indicates that WiFi already offloads about 65% of the total mobile data traffic and saves 55% of battery power without using any delayed transmission. If data transfers can be delayed with some deadline until users enter a WiFi zone, substantial gains can be achieved only when the deadline is fairly larger than tens of minutes. With 100-s delays, the achievable gain is less than only 2%-3%, whereas with 1 h or longer deadlines, traffic and energy saving gains increase beyond 29% and 20%, respectively. These results are in contrast to the substantial gain (20%-33%) reported by the existing work even for 100-s delayed transmission using traces taken from transit buses or war-driving. In addition, a distribution model-based simulator and a theoretical framework that enable analytical studies of the average performance of offloading are proposed. These tools are useful for network providers to obtain a rough estimate on the average performance of offloading for a given WiFi deployment condition.

Mobile data offloading: how much can WiFi deliver?

MICA is a scalable in-memory key-value store that handles 65.6 to 76.9 million key-value operations per second using a single general-purpose multi-core system. MICA is over 4-13.5x faster than current state-of-the-art systems, while providing consistently high throughput over a variety of mixed read and write workloads.MICA takes a holistic approach that encompasses all aspects of request handling, including parallel data access, network request handling, and data structure design, but makes unconventional choices in each of the three domains. First, MICA optimizes for multi-core architectures by enabling parallel access to partitioned data. Second, for efficient parallel data access, MICA maps client requests directly to specific CPU cores at the server NIC level by using client-supplied information and adopts a light-weight networking stack that bypasses the kernel. Finally, MICA's new data structures--circular logs, lossy concurrent hash indexes, and bulk chaining--handle both read-and write-intensive workloads at low overhead.

/pdf/mica-a-holistic-approach-to-fast-in-memory-key-value-storage-4k1ehzaddo.pdf

MICA: a holistic approach to fast in-memory key-value storage

Modern chip multiprocessor (CMP) systems employ multiple memory controllers to control access to main memory. The scheduling algorithm employed by these memory controllers has a significant effect on system throughput, so choosing an efficient scheduling algorithm is important. The scheduling algorithm also needs to be scalable — as the number of cores increases, the number of memory controllers shared by the cores should also increase to provide sufficient bandwidth to feed the cores. Unfortunately, previous memory scheduling algorithms are inefficient with respect to system throughput and/or are designed for a single memory controller and do not scale well to multiple memory controllers, requiring significant finegrained coordination among controllers. This paper proposes ATLAS (Adaptive per-Thread Least-Attained-Service memory scheduling), a fundamentally new memory scheduling technique that improves system throughput without requiring significant coordination among memory controllers. The key idea is to periodically order threads based on the service they have attained from the memory controllers so far, and prioritize those threads that have attained the least service over others in each period. The idea of favoring threads with least-attained-service is borrowed from the queueing theory literature, where, in the context of a single-server queue it is known that least-attained-service optimally schedules jobs, assuming a Pareto (or any decreasing hazard rate) workload distribution. After verifying that our workloads have this characteristic, we show that our implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput. Furthermore, since the periods over which we accumulate the attained service are long, the controllers coordinate very infrequently to form the ordering of threads, thereby making ATLAS scalable to many controllers. We evaluate ATLAS on a wide variety of multiprogrammed SPEC 2006 workloads and systems with 4–32 cores and 1–16 memory controllers, and compare its performance to five previously proposed scheduling algorithms. Averaged over 32 workloads on a 24-core system with 4 controllers, ATLAS improves instruction throughput by 10.8%, and system throughput by 8.4%, compared to PAR-BS, the best previous CMP memory scheduling algorithm. ATLAS's performance benefit increases as the number of cores increases.

/pdf/atlas-a-scalable-and-high-performance-scheduling-algorithm-586cczzqud.pdf

ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers

Scaling the performance of short TCP connections on multicore systems is fundamentally challenging Although many proposals have attempted to address various shortcomings, inefficiency of the kernel implementation still persists For example, even state-of-the-art designs spend 70% to 80% of CPU cycles in handling TCP connections in the kernel, leaving only small room for innovation in the user-level programThis work presents mTCP, a high-performance user-level TCP stack for multicore systems mTCP addresses the inefficiencies from the ground up--from packet I/O and TCP connection management to the application interface In addition to adopting well-known techniques, our design (1) translates multiple expensive system calls into a single shared memory reference, (2) allows efficient flow-level event aggregation, and (3) performs batched packet I/O for high I/O efficiency Our evaluations on an 8-core machine showed that mTCP improves the performance of small message transactions by a factor of 25 compared to the latest Linux TCP stack and a factor of 3 compared to the best-performing research system known so far It also improves the performance of various popular applications by 33% to 320% compared to those on the Linux stack

/pdf/mtcp-a-highly-scalable-user-level-tcp-stack-for-multicore-1s43pxactn.pdf

mTCP: a highly scalable user-level TCP stack for multicore systems

Many existing data center network (DCN) flow scheduling schemes minimize flow completion times (FCT) based on prior knowledge of flows and custom switch functions, making them superior in performance but hard to use in practice. By contrast, we seek to minimize FCT with no prior knowledge and existing commodity switch hardware.

To this end, we present PIAS, a DCN flow scheduling mechanism that aims to minimize FCT by mimicking Shortest Job First (SJF) on the premise that flow size is not known a priori. At its heart, PIAS leverages multiple priority queues available in existing commodity switches to implement a Multiple Level Feedback Queue (MLFQ), in which a PIAS flow is gradually demoted from higher-priority queues to lower-priority queues based on the number of bytes it has sent. As a result, short flows are likely to be finished in the first few high-priority queues and thus be prioritized over long flows in general, which enables PIAS to emulate SJF without knowing flow sizes beforehand.

We have implemented a PIAS prototype and evaluated PIAS through both testbed experiments and ns- 2 simulations. We show that PIAS is readily deployable with commodity switches and backward compatible with legacy TCP/IP stacks. Our evaluation results show that PIAS significantly outperforms existing information-agnostic schemes. For example, it reduces FCT by up to 50% and 40% over DCTCP [11] and L2DCT [27] respectively; and it only has a 4.9% performance gap to an ideal information-aware scheme, pFabric [13], for short flows under a production DCN workload.

/pdf/information-agnostic-flow-scheduling-for-commodity-data-1uu4bys2e7.pdf

Information-agnostic flow scheduling for commodity data centers

Traditional execution environments deploy Address Space Layout Randomization (ASLR) to defend against memory corruption attacks. However, Intel Software Guard Extension (SGX), a new trusted execution environment designed to serve security-critical applications on the cloud, lacks such an effective, well-studied feature. In fact, we find that applying ASLR to SGX programs raises non-trivial issues beyond simple engineering for a number of reasons: 1) SGX is designed to defeat a stronger adversary than the traditional model, which requires the address space layout to be hidden from the kernel; 2) the limited memory uses in SGX programs present a new challenge in providing a sufficient degree of entropy; 3) remote attestation conflicts with the dynamic relocation required for ASLR; and 4) the SGX specification relies on known and fixed addresses for key data structures that cannot be randomized. This paper presents SGX-Shield, a new ASLR scheme designed for SGX environments. SGX-Shield is built on a secure in-enclave loader to secretly bootstrap the memory space layout with a finer-grained randomization. To be compatible with SGX hardware (e.g., remote attestation, fixed addresses), SGX-Shield is designed with a software-based data execution protection mechanism through an LLVM-based compiler. We implement SGX-Shield and thoroughly evaluate it on real SGX hardware. It shows a high degree of randomness in memory layouts and stops memory corruption attacks with a high probability. SGX-Shield shows 7.61% performance overhead in running common microbenchmarks and 2.25% overhead in running a more realistic workload of an HTTPS server.

Dongsu Han

Papers

MICA: a holistic approach to fast in-memory key-value storage

ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers

mTCP: a highly scalable user-level TCP stack for multicore systems

Information-agnostic flow scheduling for commodity data centers

SGX-Shield: Enabling Address Space Layout Randomization for SGX Programs.