scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Computer Systems in 2017"


Journal ArticleDOI
TL;DR: ix is presented, a dataplane operating system that provides high I/O performance and high resource efficiency while maintaining the protection and isolation benefits of existing kernels.
Abstract: The conventional wisdom is that aggressive networking requirements, such as high packet rates for small messages and μs-scale tail latency, are best addressed outside the kernel, in a user-level networking stack. We present ix, a dataplane operating system that provides high I/O performance and high resource efficiency while maintaining the protection and isolation benefits of existing kernels. ix uses hardware virtualization to separate management and scheduling functions of the kernel (control plane) from network processing (dataplane). The dataplane architecture builds upon a native, zero-copy API and optimizes for both bandwidth and latency by dedicating hardware threads and networking queues to dataplane instances, processing bounded batches of packets to completion, and eliminating coherence traffic and multicore synchronization. The control plane dynamically adjusts core allocations and voltage/frequency settings to meet service-level objectives. We demonstrate that ix outperforms Linux and a user-space network stack significantly in both throughput and end-to-end latency. Moreover, ix improves the throughput of a widely deployed, key-value store by up to 6.4× and reduces tail latency by more than 2× . With three varying load patterns, the control plane saves 46%--54% of processor energy, and it allows background jobs to run at 35%--47% of their standalone throughput.

52 citations


Journal ArticleDOI
TL;DR: A new methodology to expose reliability issues in block devices under power faults is proposed, which includes specially designed hardware to inject power faults directly to devices, workloads to stress storage components, and techniques to detect various types of failures.
Abstract: Modern storage technology (solid-state disks (SSDs), NoSQL databases, commoditized RAID hardware, etc.) brings new reliability challenges to the already-complicated storage stack. Among other things, the behavior of these new components during power faults—which happen relatively frequently in data centers—is an important yet mostly ignored issue in this dependability-critical area. Understanding how new storage components behave under power fault is the first step towards designing new robust storage systems. In this article, we propose a new methodology to expose reliability issues in block devices under power faults. Our framework includes specially designed hardware to inject power faults directly to devices, workloads to stress storage components, and techniques to detect various types of failures. Applying our testing framework, we test 17 commodity SSDs from six different vendors using more than three thousand fault injection cycles in total. Our experimental results reveal that 14 of the 17 tested SSD devices exhibit surprising failure behaviors under power faults, including bit corruption, shorn writes, unserializable writes, metadata corruption, and total device failure.

30 citations


Journal ArticleDOI
TL;DR: An optimized transaction chopping algorithm is designed and implemented to decompose a set of large transactions into smaller pieces such that HTM is only required to protect each piece and this work describes how DrTM supports common database features like read-only transactions and logging for durability.
Abstract: DrTM is a fast in-memory transaction processing system that exploits advanced hardware features such as remote direct memory access (RDMA) and hardware transactional memory (HTM). To achieve high efficiency, it mostly offloads concurrency control such as tracking read/write accesses and conflict detection into HTM in a local machine and leverages the strong consistency between RDMA and HTM to ensure serializability among concurrent transactions across machines. To mitigate the high probability of HTM aborts for large transactions, we design and implement an optimized transaction chopping algorithm to decompose a set of large transactions into smaller pieces such that HTM is only required to protect each piece. We further build an efficient hash table for DrTM by leveraging HTM and RDMA to simplify the design and notably improve the performance. We describe how DrTM supports common database features like read-only transactions and logging for durability. Evaluation using typical OLTP workloads including TPC-C and SmallBank shows that DrTM has better single-node efficiency and scales well on a six-node cluster; it achieves greater than 1.51, 34 and 5.24, 138 million transactions per second for TPC-C and SmallBank on a single node and the cluster, respectively. Such numbers outperform a state-of-the-art single-node system (i.e., Silo) and a distributed transaction system (i.e., Calvin) by at least 1.9X and 29.6X for TPC-C.

29 citations


Journal ArticleDOI
TL;DR: An error was discovered in the setup of the experiment: memcached started in an empty state, causing some GET requests to require less computation than intended, and performance differences between the original and corrected results are shown.
Abstract: On page 21 of “The IX Operating System: Combining Low Latency, High Throughput and Efficiency in a Protected Dataplane” we describe our use of the tool mutilate to evaluate the latency and throughput of memcached. We discovered an error in our setup: we did not load the initial key-value state into memcached before the start of the experiment. Instead, memcached started in an empty state, causing some GET requests to require less computation than intended. Table 1 shows the performance differences between our original and corrected memcached results.

26 citations


Journal ArticleDOI
TL;DR: This article proposes to combine hardware customization and specialization techniques to improve the performance and energy efficiency of mobile web applications, and proposes, synthesize, and evaluate two new domain-specific specializations, called the Style Resolution Unit and the Browser Engine Cache.
Abstract: Mobile applications are increasingly being built using web technologies as a common substrate to achieve portability and to improve developer productivity. Unfortunately, web applications often incur large performance overhead, directly affecting the user quality-of-service (QoS) experience. Traditional techniques in improving mobile processor performance have mostly been adopting desktop-like design techniques such as increasing single-core microarchitecture complexity and aggressively integrating more cores. However, such a desktop-oriented strategy is likely coming to an end due to the stringent energy and thermal constraints that mobile devices impose. Therefore, we must pivot away from traditional mobile processor design techniques in order to provide sustainable performance improvement while maintaining energy efficiency.In this article, we propose to combine hardware customization and specialization techniques to improve the performance and energy efficiency of mobile web applications. We first perform design-space exploration (DSE) and identify opportunities in customizing existing general-purpose mobile processors, that is, tuning microarchitecture parameters. The thorough DSE also lets us discover sources of energy inefficiency in customized general-purpose architectures. To mitigate these inefficiencies, we propose, synthesize, and evaluate two new domain-specific specializations, called the Style Resolution Unit and the Browser Engine Cache. Our optimizations boost performance and energy efficiency at the same time while maintaining general-purpose programmability. As emerging mobile workloads increasingly rely more on web technologies, the type of optimizations we propose will become important in the future and are likely to have a long-lasting and widespread impact.

16 citations


Journal ArticleDOI
TL;DR: This work proposes S eer, a software scheduler that addresses precisely this restriction of HTM by leveraging on an online probabilistic inference technique that identifies the most likely conflict relations and establishes a dynamic locking scheme to serialize transactions in a fine-grained manner.
Abstract: The ubiquity of multicore processors has led programmers to write parallel and concurrent applications to take advantage of the underlying hardware and speed up their executions. In this context, Transactional Memory (TM) has emerged as a simple and effective synchronization paradigm, via the familiar abstraction of atomic transactions.After many years of intense research, major processor manufacturers (including Intel) have recently released mainstream processors with hardware support for TM (HTM).In this work, we study a relevant issue with great impact on the performance of HTM. Due to the optimistic and inherently limited nature of HTM, transactions may have to be aborted and restarted numerous times, without any progress guarantee. As a result, it is up to the software library that regulates the HTM usage to ensure progress and optimize performance. Transaction scheduling is probably one of the most well-studied and effective techniques to achieve these goals.However, these recent mainstream HTMs have some technical limitations that prevent the adoption of known scheduling techniques: unlike software implementations of TM used in the past, existing HTMs provide limited or no information on which memory regions or contending transactions caused the abort.To address this crucial issue for HTMs, we propose Seer, a software scheduler that addresses precisely this restriction of HTM by leveraging on an online probabilistic inference technique that identifies the most likely conflict relations and establishes a dynamic locking scheme to serialize transactions in a fine-grained manner. The key idea of our solution is to constrain the portions of parallelism that are affecting negatively the whole system. As a result, this not only prevents performance reduction but also in fact unveils further scalability and performance for HTM. Via an extensive evaluation study, we show that Seer improves the performance of the Intel’s HTM by up to 3.6×, and by 65% on average across all concurrency degrees and benchmarks on a large processor with 28 cores.

15 citations


Journal ArticleDOI
TL;DR: Hipster is introduced, a technique that combines heuristics and reinforcement learning to improve resource efficiency in cloud systems and improves the QoS guarantee for Web-Search and Memcached, while reducing the energy consumption by up to 18%.
Abstract: In 2013, U.S. data centers accounted for 2.2% of the country’s total electricity consumption, a figure that is projected to increase rapidly over the next decade. Many important data center workloads in cloud computing are interactive, and they demand strict levels of quality-of-service (QoS) to meet user expectations, making it challenging to optimize power consumption along with increasing performance demands.This article introduces Hipster, a technique that combines heuristics and reinforcement learning to improve resource efficiency in cloud systems. Hipster explores heterogeneous multi-cores and dynamic voltage and frequency scaling for reducing energy consumption while managing the QoS of the latency-critical workloads. To improve data center utilization and make best usage of the available resources, Hipster can dynamically assign remaining cores to batch workloads without violating the QoS constraints for the latency-critical workloads. We perform experiments using a 64-bit ARM big.LITTLE platform and show that, compared to prior work, Hipster improves the QoS guarantee for Web-Search from 80% to 96%, and for Memcached from 92% to 99%, while reducing the energy consumption by up to 18%. Hipster is also effective in learning and adapting automatically to specific requirements of new incoming workloads just enough to meet the QoS and optimize resource consumption.

11 citations


Journal ArticleDOI
TL;DR: The Supercloud encapsulates applications in a virtual cloud under users’ full control and can incorporate one or more availability zones within a cloud provider or across different providers, allowing applications to optimize performance.
Abstract: Infrastructure-as-a-Service (IaaS) cloud providers hide available interfaces for virtual machine (VM) placement and migration, CPU capping, memory ballooning, page sharing, and I/O throttling, limiting the ways in which applications can optimally configure resources or respond to dynamically shifting workloads. Given these interfaces, applications could migrate VMs in response to diurnal workloads or changing prices, adjust resources in response to load changes, and so on. This article proposes a new abstraction that we call a Library Cloud and that allows users to customize the diverse available cloud resources to best serve their applications.We built a prototype of a Library Cloud that we call the Supercloud. The Supercloud encapsulates applications in a virtual cloud under users’ full control and can incorporate one or more availability zones within a cloud provider or across different providers. The Supercloud provides virtual machine, storage, and networking complete with a full set of management operations, allowing applications to optimize performance. In this article, we demonstrate various innovations enabled by the Library Cloud.

11 citations


Journal ArticleDOI
TL;DR: This work proposes Adrenaline, an approach to leverage finer-granularity (tens of nanoseconds) voltage boosting to effectively rein in the tail latency with query-level precision, and demonstrates the effectiveness of the methodology under various workload configurations.
Abstract: Reducing the long tail of the query latency distribution in modern warehouse scale computers is critical for improving performance and quality of service (QoS) of workloads such as Web Search and Memcached. Traditional turbo boost increases a processor’s voltage and frequency during a coarse-grained sliding window, boosting all queries that are processed during that window. However, the inability of such a technique to pinpoint tail queries for boosting limits its tail reduction benefit. In this work, we propose Adrenaline, an approach to leverage finer-granularity (tens of nanoseconds) voltage boosting to effectively rein in the tail latency with query-level precision. Two key insights underlie this work. First, emerging finer granularity voltage/frequency boosting is an enabling mechanism for intelligent allocation of the power budget to precisely boost only the queries that contribute to the tail latency; second, per-query characteristics can be used to design indicators for proactively pinpointing these queries, triggering boosting accordingly. Based on these insights, Adrenaline effectively pinpoints and boosts queries that are likely to increase the tail distribution and can reap more benefit from the voltage/frequency boost. By evaluating under various workload configurations, we demonstrate the effectiveness of our methodology. We achieve up to a 2.50 × tail latency improvement for Memcached and up to a 3.03 × for Web Search over coarse-grained dynamic voltage and frequency scaling (DVFS) given a fixed boosting power budget. When optimizing for energy reduction, Adrenaline achieves up to a 1.81 × improvement for Memcached and up to a 1.99 × for Web Search over coarse-grained DVFS. By using the carefully chosen boost thresholds, Adrenaline further improves the tail latency reduction to 4.82 × over coarse-grained DVFS.

11 citations


Journal ArticleDOI
TL;DR: REEF provides mechanisms that facilitate resource reuse for data caching and state management abstractions that greatly ease the development of elastic data processing pipelines on cloud platforms that support a Resource Manager service.
Abstract: Resource Managers like YARN and Mesos have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low level. This flexibility comes at a high cost in terms of developer effort, as each application must repeatedly tackle the same challenges (e.g., fault tolerance, task scheduling and coordination) and reimplement common mechanisms (e.g., caching, bulk-data transfers). This article presents REEF, a development framework that provides a control plane for scheduling and coordinating task-level (data-plane) work on cluster resources obtained from a Resource Manager. REEF provides mechanisms that facilitate resource reuse for data caching and state management abstractions that greatly ease the development of elastic data processing pipelines on cloud platforms that support a Resource Manager service. We illustrate the power of REEF by showing applications built atop: a distributed shell application, a machine-learning framework, a distributed in-memory caching system, and a port of the CORFU system. REEF is currently an Apache top-level project that has attracted contributors from several institutions and it is being used to develop several commercial offerings such as the Azure Stream Analytics service.

10 citations


Journal ArticleDOI
Seyed Majid Zahedi1, Songchun Fan1, Matthew Faw1, Elijah Cole1, Benjamin C. Lee1 
TL;DR: This work describes a sprinting architecture in which many, independent chip multiprocessors share a power supply and sprints are constrained by the chips’ thermal limits and the rack’s power limits and presents the computational sprinting game, a multi-agent perspective on managing sprints.
Abstract: Computational sprinting is a class of mechanisms that boost performance but dissipate additional power. We describe a sprinting architecture in which many, independent chip multiprocessors share a power supply and sprints are constrained by the chips’ thermal limits and the rack’s power limits. Moreover, we present the computational sprinting game, a multi-agent perspective on managing sprints. Strategic agents decide whether to sprint based on application phases and system conditions. The game produces an equilibrium that improves task throughput for data analytics workloads by 4--6× over prior greedy heuristics and performs within 90% of an upper bound on throughput from a globally optimized policy.

Journal ArticleDOI
TL;DR: An automated technique that performs hardware–software coanalysis of the application and ultra-low-power processor in an embedded system to determine application-specific peak power and energy requirements is presented.
Abstract: Many emerging applications such as the Internet of Things, wearables, implantables, and sensor networks are constrained by power and energy. These applications rely on ultra-low-power processors that have rapidly become the most abundant type of processor manufactured today. In the ultra-low-power embedded systems used by these applications, peak power and energy requirements are the primary factors that determine critical system characteristics, such as size, weight, cost, and lifetime. While the power and energy requirements of these systems tend to be application specific, conventional techniques for rating peak power and energy cannot accurately bound the power and energy requirements of an application running on a processor, leading to overprovisioning that increases system size and weight. In this article, we present an automated technique that performs hardware–software coanalysis of the application and ultra-low-power processor in an embedded system to determine application-specific peak power and energy requirements. Our technique provides more accurate, tighter bounds than conventional techniques for determining peak power and energy requirements. Also, unlike conventional approaches, our technique reports guaranteed bounds on peak power and energy independent of an application’s input set. Tighter bounds on peak power and energy can be exploited to reduce system size, weight, and cost.

Journal ArticleDOI
TL;DR: A framework to extract the directory access stream from parallel least recently used stacks is developed, enabling rapid analysis of the directory's accesses and contents across both core count and cache size scaling, and the notion of relative reuse distance between sharers is introduced.
Abstract: Researchers have proposed numerous techniques to improve the scalability of coherence directories. The effectiveness of these techniques not only depends on application behavior, but also on the CPU's configuration, for example, its core count and cache size. As CPUs continue to scale, it is essential to explore the directory's application and architecture dependencies. However, this is challenging given the slow speed of simulators. While it is common practice to simulate different applications, previous research on directory designs have explored only a few—and in most cases, only one—CPU configuration, which can lead to an incomplete and inaccurate view of the directory's behavior.This article proposes to use multicore reuse distance analysis to study coherence directories. We develop a framework to extract the directory access stream from parallel least recently used (LRU) stacks, enabling rapid analysis of the directory's accesses and contents across both core count and cache size scaling. A key part of our framework is the notion of relative reuse distance between sharers, which defines sharing in a capacity-dependent fashion and facilitates our analyses along the data cache size dimension.We implement our framework in a profiler and then apply it to gain insights into the impact of multicore CPU scaling on directory behavior. Our profiling results show that directory accesses reduce by 3.3× when scaling the data cache size from 16KB to 1MB, despite an increase in sharing-based directory accesses. We also show that increased sharing caused by data cache scaling allows the portion of on-chip memory occupied by the directory to be reduced by 43.3%, compared to a reduction of only 2.6% when scaling the number of cores. And, we show certain directory entries exhibit high temporal reuse. In addition to gaining insights, we also validate our profile-based results, and find they are within 2--10% of cache simulations on average, across different validation experiments. Finally, we conduct four case studies that illustrate our insights on existing directory techniques. In particular, we demonstrate our directory occupancy insights on a Cuckoo directory; we apply our sharing insights to provide bounds on the size of Scalable Coherence Directories (SCD) and Dual-Grain Directories (DGD); and, we demonstrate our directory entry reuse insights on a multilevel directory design.