scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Computer Systems in 2021"


Journal ArticleDOI
TL;DR: A novel method for neural network quantization that emulates a non-uniform $k$-quantile quantizer, which adapts to the distribution of the quantized parameters, which provides a novel alternative to the existing uniform quantization techniques for neural networks.
Abstract: We present a novel method for neural network quantization Our method, named UNIQ, emulates a non-uniform k-quantile quantizer and adapts the model to perform well with quantized weights by injecting noise to the weights at training time As a by-product of injecting noise to weights, we find that activations can also be quantized to as low as 8-bit with only a minor accuracy degradation Our non-uniform quantization approach provides a novel alternative to the existing uniform quantization techniques for neural networks We further propose a novel complexity metric of number of bit operations performed (BOPs), and we show that this metric has a linear relation with logic utilization and power We suggest evaluating the trade-off of accuracy vs complexity (BOPs) The proposed method, when evaluated on ResNet18/34/50 and MobileNet on ImageNet, outperforms the prior state of the art both in the low-complexity regime and the high accuracy regime We demonstrate the practical applicability of this approach, by implementing our non-uniformly quantized CNN on FPGA

50 citations


Journal ArticleDOI
TL;DR: In this article, the authors present a simulation software for the vulnerability evaluation of RMSs and illustrate three case studies in which this tool was effectively used to model and assess state-of-the-art distributed reputation management systems.
Abstract: Multi-agent distributed systems are characterized by autonomous entities that interact with each other to provide, and/or request, different kinds of services. In several contexts, especially when a reward is offered according to the quality of service, individual agents (or coordinated groups) may act in a selfish way. To prevent such behaviours, distributed Reputation Management Systems (RMSs) provide every agent with the capability of computing the reputation of the others according to direct past interactions, as well as indirect opinions reported by their neighbourhood. This last point introduces a weakness on gossiped information that makes RMSs vulnerable to malicious agents’ intent on disseminating false reputation values. Given the variety of application scenarios in which RMSs can be adopted, as well as the multitude of behaviours that agents can implement, designers need RMS evaluation tools that allow them to predict the robustness of the system to security attacks, before its actual deployment. To this aim, we present a simulation software for the vulnerability evaluation of RMSs and illustrate three case studies in which this tool was effectively used to model and assess state-of-the-art RMSs.

8 citations


Journal ArticleDOI
TL;DR: Dune and IX achieve this in a way different from Unikernel LibOSs by augmenting processes, and they have essentially implemented a hypervisor, which naturally provides strong isolation and allows KylinX to focus on the flexibility and efficiency issues.
Abstract: Unikernel specializes a minimalistic LibOS and a target application into a standalone single-purpose virtual machine (VM) running on a hypervisor, which is referred to as (virtual) appliance. Compared to traditional VMs, Unikernel appliances have smaller memory footprint and lower overhead while guaranteeing the same level of isolation. On the downside, Unikernel strips off the process abstraction from its monolithic appliance and thus sacrifices flexibility, efficiency, and applicability. In this article, we examine whether there is a balance embracing the best of both Unikernel appliances (strong isolation) and processes (high flexibility/efficiency). We present KylinX, a dynamic library operating system for simplified and efficient cloud virtualization by providing the pVM (process-like VM) abstraction. A pVM takes the hypervisor as an OS and the Unikernel appliance as a process allowing both page-level and library-level dynamic mapping. At the page level, KylinX supports pVM fork plus a set of API for inter-pVM communication (IpC, which is compatible with conventional UNIX IPC). At the library level, KylinX supports shared libraries to be linked to a Unikernel appliance at runtime. KylinX enforces mapping restrictions against potential threats. We implement a prototype of KylinX by modifying MiniOS and Xen tools. Extensive experimental results show that KylinX achieves similar performance both in micro benchmarks (fork, IpC, library update, etc.) and in applications (Redis, web server, and DNS server) compared to conventional processes, while retaining the strong isolation benefit of VMs/Unikernels.

5 citations


Journal ArticleDOI
TL;DR: The Latency-Tolerant Register File (LTRF) as discussed by the authors proposes to prefetch the estimated register working-set from the main register file to the register file cache under software control at the beginning of each interval, and overlap the prefetch latency with the execution of other warps.
Abstract: Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this article, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp’s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. We observe that register bank conflicts while prefetching the registers could greatly reduce the effectiveness of LTRF. Therefore, we devise a compile-time register renumbering technique to reduce the likelihood of register bank conflicts. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8× larger capacity and improving overall GPU performance by 34%.

4 citations


Journal ArticleDOI
TL;DR: SmartIO as discussed by the authors proposes a distributed I/O solution that enables cost-effective scaling of resources between networked hosts by allowing devices to become part of the same global address space as cluster applications.
Abstract: The large variety of compute-heavy and data-driven applications accelerate the need for a distributed I/O solution that enables cost-effective scaling of resources between networked hosts. For example, in a cluster system, different machines may have various devices available at different times, but moving workloads to remote units over the network is often costly and introduces large overheads compared to accessing local resources. To facilitate I/O disaggregation and device sharing among hosts connected using Peripheral Component Interconnect Express (PCIe) non-transparent bridges, we present SmartIO. NVMes, GPUs, network adapters, or any other standard PCIe device may be borrowed and accessed directly, as if they were local to the remote machines. We provide capabilities beyond existing disaggregation solutions by combining traditional I/O with distributed shared-memory functionality, allowing devices to become part of the same global address space as cluster applications. Software is entirely removed from the data path, and simultaneous sharing of a device among application processes running on remote hosts is enabled. Our experimental results show that I/O devices can be shared with remote hosts, achieving native PCIe performance. Thus, compared to existing device distribution mechanisms, SmartIO provides more efficient, low-cost resource sharing, increasing the overall system performance.

4 citations


Journal ArticleDOI
TL;DR: Metron as mentioned in this paper exploits the underlying network and commodity servers' resources to offload part of the packet processing logic to the network, and then uses tag-based hardware dispatching to carry out the remaining packet processing at the speed of the servers' cores, with zero inter-core communication.
Abstract: Deployment of 100Gigabit Ethernet (GbE) links challenges the packet processing limits of commodity hardware used for Network Functions Virtualization (NFV). Moreover, realizing chained network functions (i.e., service chains) necessitates the use of multiple CPU cores, or even multiple servers, to process packets from such high speed links. Our system Metron jointly exploits the underlying network and commodity servers’ resources: (i) to offload part of the packet processing logic to the network, (ii) by using smart tagging to setup and exploit the affinity of traffic classes, and (iii) by using tag-based hardware dispatching to carry out the remaining packet processing at the speed of the servers’ cores, with zero inter-core communication. Moreover, Metron transparently integrates, manages, and load balances proprietary “blackboxes” together with Metron service chains. Metron realizes stateful network functions at the speed of 100GbE network cards on a single server, while elastically and rapidly adapting to changing workload volumes. Our experiments demonstrate that Metron service chains can coexist with heterogeneous blackboxes, while still leveraging Metron’s accurate dispatching and load balancing. In summary, Metron has (i) 2.75–8× better efficiency, up to (ii) 4.7× lower latency, and (iii) 7.8× higher throughput than OpenBox, a state-of-the-art NFV system.

3 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose a distributed graph processing framework for distributed system and Processing-in-memory (PIM) architecture that precisely enforces loop-carried dependency, i.e., when a condition is satisfied by a neighbor, all following neighbors can be skipped.
Abstract: To hide the complexity of the underlying system, graph processing frameworks ask programmers to specify graph computations in user-defined functions (UDFs) of graph-oriented programming model. Due to the nature of distributed execution, current frameworks cannot precisely enforce the semantics of UDFs, leading to unnecessary computation and communication. It exemplifies a gap between programming model and runtime execution. This article proposes novel graph processing frameworks for distributed system and Processing-in-memory (PIM) architecture that precisely enforces loop-carried dependency; i.e., when a condition is satisfied by a neighbor, all following neighbors can be skipped. Our approach instruments the UDFs to express the loop-carried dependency, then the distributed execution framework enforces the precise semantics by performing dependency propagation dynamically. Enforcing loop-carried dependency requires the sequential processing of the neighbors of each vertex distributed in different nodes. We propose to circulant scheduling in the framework to allow different nodes to process disjoint sets of edges/vertices in parallel while satisfying the sequential requirement. The technique achieves an excellent trade-off between precise semantics and parallelism—the benefits of eliminating unnecessary computation and communication offset the reduced parallelism. We implement a new distributed graph processing framework SympleGraph, and two variants of runtime systems—GraphS and GraphSR—for PIM-based graph processing architecture, which significantly outperform the state-of-the-art.

2 citations


Journal ArticleDOI
TL;DR: Graspan as mentioned in this paper is a disk-based parallel graph system that uses an edge-pair centric computation model to compute dynamic transitive closures on very large program graphs, which can be used to uncover many real-world bugs in large-scale systems code.
Abstract: There is more than a decade-long history of using static analysis to find bugs in systems such as Linux. Most of the existing static analyses developed for these systems are simple checkers that find bugs based on pattern matching. Despite the presence of many sophisticated interprocedural analyses, few of them have been employed to improve checkers for systems code due to their complex implementations and poor scalability. In this article, we revisit the scalability problem of interprocedural static analysis from a “Big Data” perspective. That is, we turn sophisticated code analysis into Big Data analytics and leverage novel data processing techniques to solve this traditional programming language problem. We propose Graspan, a disk-based parallel graph system that uses an edge-pair centric computation model to compute dynamic transitive closures on very large program graphs. We develop two backends for Graspan, namely, Graspan-C running on CPUs and Graspan-G on GPUs, and present their designs in the article. Graspan-C can analyze large-scale systems code on any commodity PC, while, if GPUs are available, Graspan-G can be readily used to achieve orders of magnitude speedup by harnessing a GPU’s massive parallelism. We have implemented fully context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases written in multiple languages such as Linux and Apache Hadoop demonstrates that their Graspan implementations are language-independent, scale to millions of lines of code, and are much simpler than their original implementations. Moreover, we show that these analyses can be used to uncover many real-world bugs in large-scale systems code.

1 citations


Journal ArticleDOI
TL;DR: By specializing for AI applications, it is shown that a purpose-built edge data center can be designed for the stresses of accelerated AI at 15% lower TCO than one derived from homogeneous servers and infrastructure.
Abstract: Artificial intelligence and machine learning are experiencing widespread adoption in industry and academia. This has been driven by rapid advances in the applications and accuracy of AI through increasingly complex algorithms and models; this, in turn, has spurred research into specialized hardware AI accelerators. Given the rapid pace of advances, it is easy to forget that they are often developed and evaluated in a vacuum without considering the full application environment. This article emphasizes the need for a holistic, end-to-end analysis of artificial intelligence (AI) workloads and reveals the “AI tax.” We deploy and characterize Face Recognition in an edge data center. The application is an AI-centric edge video analytics application built using popular open source infrastructure and machine learning (ML) tools. Despite using state-of-the-art AI and ML algorithms, the application relies heavily on pre- and post-processing code. As AI-centric applications benefit from the acceleration promised by accelerators, we find they impose stresses on the hardware and software infrastructure: storage and network bandwidth become major bottlenecks with increasing AI acceleration. By specializing for AI applications, we show that a purpose-built edge data center can be designed for the stresses of accelerated AI at 15% lower TCO than one derived from homogeneous servers and infrastructure.

1 citations


Journal ArticleDOI
TL;DR: In this paper, the authors evaluate two state-of-the-art distributed management paradigms and, motivated by their drawbacks, propose a new one called Management Application (MA), along with a DMOM framework based on MA.
Abstract: Many-Core Systems-on-Chip increasingly require Dynamic Multi-objective Management (DMOM) of resources. DMOM uses different management components for objectives and resources to implement comprehensive and self-adaptive system resource management. DMOMs are challenging because they require a scalable and well-organized framework to make each component modular, allowing it to be instantiated or redesigned with a limited impact on other components. This work evaluates two state-of-the-art distributed management paradigms and, motivated by their drawbacks, proposes a new one called Management Application (MA), along with a DMOM framework based on MA. MA is a distributed application, specific for management, where each task implements a management role. This paradigm favors scalability and modularity because the management design assumes different and parallel modules, decoupled from the OS. An experiment with a task mapping case study shows that MA reduces the overhead of management resources (-61.5%), latency (-66%), and communication volume (-96%) compared to state-of-the-art per-application management. Compared to cluster-based management (CBM) implemented directly as part of the OS, MA is similar in resources and communication volume, increasing only the mapping latency (+16%). Results targeting a complete DMOM control loop addressing up to three different objectives show the scalability regarding system size and adaptation frequency compared to CBM, presenting an overall management latency reduction of 17.2% and an overall monitoring messages’ latency reduction of 90.2%.