scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Performance in 2019"


Posted Content
TL;DR: This paper evaluates the performance and compares the results of all chipsets from Qualcomm, HiSilicon, Samsung, MediaTek and Unisoc that are providing hardware acceleration for AI inference and discusses the recent changes in the Android ML pipeline.
Abstract: The performance of mobile AI accelerators has been evolving rapidly in the past two years, nearly doubling with each new generation of SoCs. The current 4th generation of mobile NPUs is already approaching the results of CUDA-compatible Nvidia graphics cards presented not long ago, which together with the increased capabilities of mobile deep learning frameworks makes it possible to run complex and deep AI models on mobile devices. In this paper, we evaluate the performance and compare the results of all chipsets from Qualcomm, HiSilicon, Samsung, MediaTek and Unisoc that are providing hardware acceleration for AI inference. We also discuss the recent changes in the Android ML pipeline and provide an overview of the deployment of deep learning models on mobile devices. All numerical results provided in this paper can be found and are regularly updated on the official project website: this http URL.

88 citations


Posted Content
TL;DR: This paper presents the BenchCouncil's coordinated e ort on edge AI benchmarks, named Edge AIBench, which models four typical application scenarios with the focus on data distribution and workload collaboration on three layers.
Abstract: In edge computing scenarios, the distribution of data and collaboration of workloads on different layers are serious concerns for performance, privacy, and security issues. So for edge computing benchmarking, we must take an end-to-end view, considering all three layers: client-side devices, edge computing layer, and cloud servers. Unfortunately, the previous work ignores this most important point. This paper presents the BenchCouncil's coordinated e ort on edge AI benchmarks, named Edge AIBench. In total, Edge AIBench models four typical application scenarios: ICU Patient Monitor, Surveillance Camera, Smart Home, and Autonomous Vehicle with the focus on data distribution and workload collaboration on three layers. Edge AIBench is a part of the open-source AIBench project, publicly available from this http URL. We also build an edge computing testbed with a federated learning framework to resolve performance, privacy, and security issues.

52 citations


Proceedings ArticleDOI
TL;DR: In this article, the performance and power values are plotted on a scatter graph and a number of dimensions and observations from the trends on this plot are discussed and analyzed, including power consumption, numerical precision, and inference versus training.
Abstract: Advances in multicore processors and accelerators have opened the flood gates to greater exploration and application of machine learning techniques to a variety of applications. These advances, along with breakdowns of several trends including Moore's Law, have prompted an explosion of processors and accelerators that promise even greater computational and machine learning capabilities. These processors and accelerators are coming in many forms, from CPUs and GPUs to ASICs, FPGAs, and dataflow accelerators. This paper surveys the current state of these processors and accelerators that have been publicly announced with performance and power consumption numbers. The performance and power values are plotted on a scatter graph and a number of dimensions and observations from the trends on this plot are discussed and analyzed. For instance, there are interesting trends in the plot regarding power consumption, numerical precision, and inference versus training. We then select and benchmark two commercially-available low size, weight, and power (SWaP) accelerators as these processors are the most interesting for embedded and mobile machine learning inference applications that are most applicable to the DoD and other SWaP constrained users. We determine how they actually perform with real-world images and neural network models, compare those results to the reported performance and power consumption values and evaluate them against an Intel CPU that is used in some embedded applications.

46 citations


Proceedings ArticleDOI
TL;DR: It is observed that just having either resource-based or workload-based (request per second (RPS) or concurrent requests) auto-scaling is inadequate to address the needs of the serverless platforms.
Abstract: Serverless computing is increasingly popular because of the promise of lower cost and the convenience it provides to users who do not need to focus on server management. This has resulted in the availability of a number of proprietary and open-source serverless solutions. We seek to understand how the performance of serverless computing depends on a number of design issues using several popular open-source serverless platforms. We identify the idiosyncrasies affecting performance (throughput and latency) for different open-source serverless platforms. Further, we observe that just having either resource-based (CPU and memory) or workload-based (request per second (RPS) or concurrent requests) auto-scaling is inadequate to address the needs of the serverless platforms.

36 citations


Posted Content
TL;DR: In this paper, the authors investigate the impact of GPU Dynamic Voltage and Frequency Scaling (DVFS) on the energy consumption and performance of deep learning and find that DVFS has great potentials to help develop energy efficient DNN training/inference schemes.
Abstract: Over the past years, great progress has been made in improving the computing power of general-purpose graphics processing units (GPGPUs), which facilitates the prosperity of deep neural networks (DNNs) in multiple fields like computer vision and natural language processing. A typical DNN training process repeatedly updates tens of millions of parameters, which not only requires huge computing resources but also consumes significant energy. In order to train DNNs in a more energy-efficient way, we empirically investigate the impact of GPU Dynamic Voltage and Frequency Scaling (DVFS) on the energy consumption and performance of deep learning. Our experiments cover a wide range of GPU architectures, DVFS settings, and DNN configurations. We observe that, compared to the default core frequency settings of three tested GPUs, the optimal core frequency can help conserve 8.7%$\sim$23.1% energy consumption for different DNN training cases. Regarding the inference, the benefits vary from 19.6%$\sim$26.4%. Our findings suggest that GPU DVFS has great potentials to help develop energy efficient DNN training/inference schemes.

36 citations


Posted Content
TL;DR: This paper investigates the state of the art of the theoretical researches including exact solutions, stability, asymptotic analyses and multidimensional models on retrial models arising from real world applications.
Abstract: Retrial phenomenon naturally arises in various systems such as call centers, cellular networks and random access protocols in local area networks. This paper gives a comprehensive survey on theory and applications of retrial queues in these systems. We investigate the state of the art of the theoretical researches including exact solutions, stability, asymptotic analyses and multidimensional models. We present an overview on retrial models arising from real world applications. Some open problems and promising research directions are also discussed.

26 citations


Posted Content
TL;DR: It is shown how big-data workloads suffer from significant slowdowns and lack predictability and replicability, even when state-of-the-art experimentation techniques are used, and guidelines for practitioners to reduce the volatility of big data performance are provided.
Abstract: Performance variability has been acknowledged as a problem for over a decade by cloud practitioners and performance engineers. Yet, our survey of top systems conferences reveals that the research community regularly disregards variability when running experiments in the cloud. Focusing on networks, we assess the impact of variability on cloud-based big-data workloads by gathering traces from mainstream commercial clouds and private research clouds. Our data collection consists of millions of datapoints gathered while transferring over 9 petabytes of data. We characterize the network variability present in our data and show that, even though commercial cloud providers implement mechanisms for quality-of-service enforcement, variability still occurs, and is even exacerbated by such mechanisms and service provider policies. We show how big-data workloads suffer from significant slowdowns and lack predictability and replicability, even when state-of-the-art experimentation techniques are used. We provide guidelines for practitioners to reduce the volatility of big data performance, making experiments more repeatable.

24 citations


Posted Content
TL;DR: This paper analyzes the average time necessary to download a block of data under the Poisson request arrival model in two service/scheduling scenarios and indicates that availability codes can minimize the download time in some settings, but are not always optimal.
Abstract: Availability codes have recently been proposed to facilitate efficient storage, management, and retrieval of frequently accessed data in distributed storage systems. Such codes provide multiple disjoint recovery groups for each data object, which makes it possible for multiple users to access the same object in a non-overlapping way. However in the presence of server-side performance variability, downloading an object using a recovery group takes longer than using a single server hosting the object. Therefore it is not immediately clear whether availability codes reduce latency to access hot data. Accordingly, the goal of this paper is to analyze, using a queuing theoretical approach, the download time in storage systems that employ availability codes. For data access, we consider the widely adopted Fork-Join model with redundancy. In this model, each request arrival splits into multiple copies and completes as soon as any one of the copies finishes service. We first carry out the analysis under the low-traffic regime in which case the system consists of at most one download request at any time. In this setting, we compare the download time in systems with availability, maximum distance separable (MDS), and replication codes. Our results indicate that availability codes can reduce download time in some settings, but are not always optimal. When the low-traffic assumption does not hold, system consists of multiple inter-dependent Fork-Join queues operating in parallel, which makes the exact analysis intractable. For this case we present upper and lower bounds on the download time. These bounds yield insight on system performance with respect to varying popularities over the stored objects. We also derive an M/G/1 queue approximation for the system, and show with simulations that it performs well in estimating the actual system performance.

23 citations


Posted Content
TL;DR: An optimizing compiler, GraphZero, is proposed to completely address limitations through symmetry breaking based on group theory and demonstrates up to 40X performance improvement and up to 197X reduction on schedule generation overhead over AutoMine.
Abstract: Graph mining for structural patterns is a fundamental task in many applications. Compilation-based graph mining systems, represented by AutoMine, generate specialized algorithms for the provided patterns and substantially outperform other systems. However, the generated code causes substantial computation redundancy and the compilation process incurs too much overhead to be used online, both due to the inherent symmetry in the structural patterns. In this paper, we propose an optimizing compiler, GraphZero, to completely address these limitations through symmetry breaking based on group theory. GraphZero implements three novel techniques. First, its schedule explorer efficiently prunes the schedule space without missing any high-performance schedule. Second, it automatically generates and enforces a set of restrictions to eliminate computation redundancy. Third, it generalizes orientation, a surprisingly effective optimization that was mainly used for clique patterns, to apply to arbitrary patterns. Evaluated on multiple graph mining applications and complex patterns with 7 real-world graph datasets, GraphZero demonstrates up to 40X performance improvement and up to 197X reduction on schedule generation overhead over AutoMine.

21 citations


Posted Content
TL;DR: ALERT, the runtime scheduler, uses a probabilistic model to detect environmental volatility and then simultaneously select both a DNN and a system resource configuration to meet latency, accuracy, and energy constraints, and achieves more than 13% energy reduction, and 27% error reduction over prior approaches that adapt solely at the application or system level.
Abstract: An increasing number of software applications incorporate runtime Deep Neural Networks (DNNs) to process sensor data and return inference results to humans. Effective deployment of DNNs in these interactive scenarios requires meeting latency and accuracy constraints while minimizing energy, a problem exacerbated by common system dynamics. Prior approaches handle dynamics through either (1) system-oblivious DNN adaptation, which adjusts DNN latency/accuracy tradeoffs, or (2) application-oblivious system adaptation, which adjusts resources to change latency/energy tradeoffs. In contrast, this paper improves on the state-of-the-art by coordinating application- and system-level adaptation. ALERT, our runtime scheduler, uses a probabilistic model to detect environmental volatility and then simultaneously select both a DNN and a system resource configuration to meet latency, accuracy, and energy constraints. We evaluate ALERT on CPU and GPU platforms for image and speech tasks in dynamic environments. ALERT's holistic approach achieves more than 13% energy reduction, and 27% error reduction over prior approaches that adapt solely at the application or system level. Furthermore, ALERT incurs only 3% more energy consumption and 2% higher DNN-inference error than an oracle scheme with perfect application and system knowledge.

18 citations


Posted Content
TL;DR: LoadSpy is developed, a whole-program profiler to pinpoint redundant memory load operations, which are often a symptom of many redundant operations in programs, and optimize several well-known benchmarks and real-world applications, yielding significant speedups.
Abstract: Modern software packages have become increasingly complex with millions of lines of code and references to many external libraries. Redundant operations are a common performance limiter in these code bases. Missed compiler optimization opportunities, inappropriate data structure and algorithm choices, and developers' inattention to performance are some common reasons for the existence of redundant operations. Developers mainly depend on compilers to eliminate redundant operations. However, compilers' static analysis often misses optimization opportunities due to ambiguities and limited analysis scope; automatic optimizations to algorithmic and data structural problems are out of scope. We develop LoadSpy, a whole-program profiler to pinpoint redundant memory load operations, which are often a symptom of many redundant operations. The strength of LoadSpy exists in identifying and quantifying redundant load operations in programs and associating the redundancies with program execution contexts and scopes to focus developers' attention on problematic code. LoadSpy works on fully optimized binaries, adopts various optimization techniques to reduce its overhead, and provides a rich graphic user interface, which make it a complete developer tool. Applying LoadSpy showed that a large fraction of redundant loads is common in modern software packages despite highest levels of automatic compiler optimizations. Guided by LoadSpy, we optimize several well-known benchmarks and real-world applications, yielding significant speedups.

Proceedings ArticleDOI
TL;DR: In this article, the authors present a performance evaluation of Docker and Singularity on bare metal nodes in the Chameleon cloud, where Docker containers can be mapped with InfiniBand hardware with RDMA communication and analysis of mapping elements of parallel workloads to the containers for optimal resource management with container-ready orchestration tools.
Abstract: The HPC community is actively researching and evaluating tools to support execution of scientific applications in cloud-based environments. Among the various technologies, containers have recently gained importance as they have significantly better performance compared to full-scale virtualization, support for microservices and DevOps, and work seamlessly with workflow and orchestration tools. Docker is currently the leader in containerization technology because it offers low overhead, flexibility, portability of applications, and reproducibility. Singularity is another container solution that is of interest as it is designed specifically for scientific applications. It is important to conduct performance and feature analysis of the container technologies to understand their applicability for each application and target execution environment. This paper presents a (1) performance evaluation of Docker and Singularity on bare metal nodes in the Chameleon cloud (2) mechanism by which Docker containers can be mapped with InfiniBand hardware with RDMA communication and (3) analysis of mapping elements of parallel workloads to the containers for optimal resource management with container-ready orchestration tools. Our experiments are targeted toward application developers so that they can make informed decisions on choosing the container technologies and approaches that are suitable for their HPC workloads on cloud infrastructure. Our performance analysis shows that scientific workloads for both Docker and Singularity based containers can achieve near-native performance. Singularity is designed specifically for HPC workloads. However, Docker still has advantages over Singularity for use in clouds as it provides overlay networking and an intuitive way to run MPI applications with one container per rank for fine-grained resources allocation.

Posted Content
TL;DR: A one-unit repairable system, supported by two identical spare units on cold standby, and serviced by two types of repairers is studied, finding the optimum number of repairs the expert should complete and the optimum patience time given to the regular repairer in order to maximize $\omega$.
Abstract: We study a one-unit repairable system, supported by two identical spare units on cold standby, and serviced by two types of repairers. The model applies, for instance, to ANSI (American National Standard Institute) centrifugal pumps in a chemical plant. The failed unit undergoes repair either by an in-house repairer within a random or deterministic patience time, or else by a visiting expert repairer. The expert repairs one or all failed units before leaving, and does so faster but at a higher cost rate than the regular repairer. Four models arise depending on the number of repairs done by the expert and the nature of the patience time. We compare these models based on the limiting availability $A_{\infty}$, and the limiting profit per unit time $\omega$, using semi-Markov processes, when all distributions are exponential. As anticipated, to maximize $A_{\infty}$, the expert should repair all failed units. To maximize $\omega$, a suitably chosen deterministic patience time is better than a random patience time. Furthermore, given all cost parameters, we determine the optimum number of repairs the expert should complete, and the optimum patience time given to the regular repairer in order to maximize $\omega$.

Proceedings ArticleDOI
TL;DR: N nanoBench as discussed by the authors is a tool for evaluating small microbenchmarks using hardware performance counters on Intel and AMD x86 systems, which can evaluate small, isolated pieces of code.
Abstract: We present nanoBench, a tool for evaluating small microbenchmarks using hardware performance counters on Intel and AMD x86 systems. Most existing tools and libraries are intended to either benchmark entire programs, or program segments in the context of their execution within a larger program. In contrast, nanoBench is specifically designed to evaluate small, isolated pieces of code. Such code is common in microbenchmark-based hardware analysis techniques. Unlike previous tools, nanoBench can execute microbenchmarks directly in kernel space. This allows to benchmark privileged instructions, and it enables more accurate measurements. The reading of the performance counters is implemented with minimal overhead avoiding functions calls and branches. As a consequence, nanoBench is precise enough to measure individual memory accesses. We illustrate the utility of nanoBench at the hand of two case studies. First, we briefly discuss how nanoBench has been used to determine the latency, throughput, and port usage of more than 13,000 instruction variants on recent x86 processors. Second, we show how to generate microbenchmarks to precisely characterize the cache architectures of eleven Intel Core microarchitectures. This includes the most comprehensive analysis of the employed cache replacement policies to date.

Posted Content
TL;DR: Scylla is presented, which integrates Mesos with Docker Swarm to enable orchestration of MPI jobs on a cluster of VMs acquired from the Chameleon cloud and uses Docker Swarm for communication between containerized tasks and Apache Mesos for resource pooling and allocation.
Abstract: Open source cloud technologies provide a wide range of support for creating customized compute node clusters to schedule tasks and managing resources. In cloud infrastructures such as Jetstream and Chameleon, which are used for scientific research, users receive complete control of the Virtual Machines (VM) that are allocated to them. Importantly, users get root access to the VMs. This provides an opportunity for HPC users to experiment with new resource management technologies such as Apache Mesos that have proven scalability, flexibility, and fault tolerance. To ease the development and deployment of HPC tools on the cloud, the containerization technology has matured and is gaining interest in the scientific community. In particular, several well known scientific code bases now have publicly available Docker containers. While Mesos provides support for Docker containers to execute individually, it does not provide support for container inter-communication or orchestration of the containers for a parallel or distributed application. In this paper, we present the design, implementation, and performance analysis of a Mesos framework, Scylla, which integrates Mesos with Docker Swarm to enable orchestration of MPI jobs on a cluster of VMs acquired from the Chameleon cloud [1]. Scylla uses Docker Swarm for communication between containerized tasks (MPI processes) and Apache Mesos for resource pooling and allocation. Scylla allows a policy-driven approach to determine how the containers should be distributed across the nodes depending on the CPU, memory, and network throughput requirement for each application.

Posted Content
TL;DR: Temporal-Carry-Deferring MAC (TCD-MAC) as discussed by the authors can gain significant energy and performance benefit when utilized to process a stream of input data and then propose using the TCD-MAC to build a reconfigurable, high speed, and low power Neural Processing Engine.
Abstract: In this paper, we first propose the design of Temporal-Carry-deferring MAC (TCD-MAC) and illustrate how our proposed solution can gain significant energy and performance benefit when utilized to process a stream of input data. We then propose using the TCD-MAC to build a reconfigurable, high speed, and low power Neural Processing Engine (TCD-NPE). We, further, propose a novel scheduler that lists the sequence of needed processing events to process an MLP model in the least number of computational rounds in our proposed TCD-NPE. We illustrate that our proposed TCD-NPE significantly outperform similar neural processing solutions that use conventional MACs in terms of both energy consumption and execution time.

Posted Content
TL;DR: Simulation-based studies to consider variations where service times for a customer are predicted, as might be done in modern settings using machine learning techniques or related mechanisms, and find that simply using the number of jobs to choose a queue is better when using predicted service times to order jobs in a queue.
Abstract: The supermarket model refers to a system with a large number of queues, where arriving customers choose $d$ queues at random and join the queue with the fewest customers. The supermarket model demonstrates the power of even small amounts of choice, as compared to simply joining a queue chosen uniformly at random, for load balancing systems. In this work we perform simulation-based studies to consider variations where service times for a customer are predicted, as might be done in modern settings using machine learning techniques or related mechanisms. Our primary takeaway is that using even seemingly weak predictions of service times can yield significant benefits over blind First In First Out queueing in this context. However, some care must be taken when using predicted service time information to both choose a queue and order elements for service within a queue; while in many cases using the information for both choosing and ordering is beneficial, in many of our simulation settings we find that simply using the number of jobs to choose a queue is better when using predicted service times to order jobs in a queue. Although this study is simulation based, our study leaves many natural theoretical open questions for future work.

Posted Content
TL;DR: A new algorithmic framework, \textsfCR-Pursuit, is presented, and it is proved that it achieves the minimal competitive ratio among all deterministic algorithms (up to a problem-dependent constant factor) for inventory-constrained online optimization.
Abstract: This paper studies online optimization under inventory (budget) constraints. While online optimization is a well-studied topic, versions with inventory constraints have proven difficult. We consider a formulation of inventory-constrained optimization that is a generalization of the classic one-way trading problem and has a wide range of applications. We present a new algorithmic framework, \textsf{CR-Pursuit}, and prove that it achieves the minimal competitive ratio among all deterministic algorithms (up to a problem-dependent constant factor) for inventory-constrained online optimization. Our algorithm and its analysis not only simplify and unify the state-of-the-art results for the standard one-way trading problem, but they also establish novel bounds for generalizations including concave revenue functions. For example, for one-way trading with price elasticity, the \textsf{CR-Pursuit} algorithm achieves a competitive ratio that is within a small additive constant (i.e., 1/3) to the lower bound of $\ln \theta+1$, where $\theta$ is the ratio between the maximum and minimum base prices.

Proceedings ArticleDOI
TL;DR: In this article, the authors investigate various traces collected from synthetic benchmarks that mimic real applications on simulated and real message-passing systems in order to pinpoint the mechanisms behind delay propagation.
Abstract: Analytic, first-principles performance modeling of distributed-memory applications is difficult due to a wide spectrum of random disturbances caused by the application and the system. These disturbances (commonly called "noise") destroy the assumptions of regularity that one usually employs when constructing simple analytic models. Despite numerous efforts to quantify, categorize, and reduce such effects, a comprehensive quantitative understanding of their performance impact is not available, especially for long delays that have global consequences for the parallel application. In this work, we investigate various traces collected from synthetic benchmarks that mimic real applications on simulated and real message-passing systems in order to pinpoint the mechanisms behind delay propagation. We analyze the dependence of the propagation speed of idle waves emanating from injected delays with respect to the execution and communication properties of the application, study how such delays decay under increased noise levels, and how they interact with each other. We also show how fine-grained noise can make a system immune against the adverse effects of propagating idle waves. Our results contribute to a better understanding of the collective phenomena that manifest themselves in distributed-memory parallel applications.

Posted Content
TL;DR: This study demonstrates the potential of transient servers with a speedup of 7.7X with more than 62.9% monetary savings for some cluster configurations, and identifies a number of important challenges and opportunities for redesigning distributed training frameworks to be transient-aware.
Abstract: Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable---e.g., for rapidly evaluating new model designs---they often come with significantly higher monetary costs due to sublinear scalability. In this paper, we investigate the feasibility of using training clusters composed of cheaper transient GPU servers to get the benefits of distributed training without the high costs. We conduct the first large-scale empirical analysis, launching more than a thousand GPU servers of various capacities, aimed at understanding the characteristics of transient GPU servers and their impact on distributed training performance. Our study demonstrates the potential of transient servers with a speedup of 7.7X with more than 62.9% monetary savings for some cluster configurations. We also identify a number of important challenges and opportunities for redesigning distributed training frameworks to be transient-aware. For example, the dynamic cost and availability characteristics of transient servers suggest the need for frameworks to dynamically change cluster configurations to best take advantage of current conditions.

Journal ArticleDOI
Yoshiaki Inoue1
TL;DR: This paper forms GPU-based inference servers as a batch service queueing model with batch-size dependent processing times and derives a closed-form upper bound for the mean latency, which provides a simple characterization of the latency performance.
Abstract: GPU-accelerated computing is a key technology to realize high-speed inference servers using deep neural networks (DNNs). An important characteristic of GPU-based inference is that the computational efficiency, in terms of the processing speed and energy consumption, drastically increases by processing multiple jobs together in a batch. In this paper, we formulate GPU-based inference servers as a batch service queueing model with batch-size dependent processing times. We first show that the energy efficiency of the server monotonically increases with the arrival rate of inference jobs, which suggests that it is energy-efficient to operate the inference server under a utilization level as high as possible within a latency requirement of inference jobs. We then derive a closed-form upper bound for the mean latency, which provides a simple characterization of the latency performance. Through simulation and numerical experiments, we show that the exact value of the mean latency is well approximated by this upper bound. We further compare this upper bound with the latency curve measured in real implementation of GPU-based inference servers and we show that the real performance curve is well explained by the derived simple formula.

Posted Content
TL;DR: The challenges of benchmarking modern workloads as FIDSS (Fragmented, Isolated, Dynamic, Service-based, and Stochastic), and the PRDAERS benchmarking rules that the benchmarks should be specified in a paper-and-pencil manner are concluded.
Abstract: This paper outlines BenchCouncil's view on the challenges, rules, and vision of benchmarking modern workloads like Big Data, AI or machine learning, and Internet Services. We conclude the challenges of benchmarking modern workloads as FIDSS (Fragmented, Isolated, Dynamic, Service-based, and Stochastic), and propose the PRDAERS benchmarking rules that the benchmarks should be specified in a paper-and-pencil manner, relevant, diverse, containing different levels of abstractions, specifying the evaluation metrics and methodology, repeatable, and scaleable. We believe proposing simple but elegant abstractions that help achieve both efficiency and general-purpose is the final target of benchmarking in future, which may be not pressing. In the light of this vision, we shortly discuss BenchCouncil's related projects.

Posted Content
TL;DR: In this paper, the authors propose a systematic approach to construct priority-aware analytical performance models using micro-architecture specifications and input traffic, which decomposes the given NoC into individual queues with modified service time to enable accurate and scalable latency computations.
Abstract: Networks-on-chip (NoCs) have become the standard for interconnect solutions in industrial designs ranging from client CPUs to many-core chip-multiprocessors. Since NoCs play a vital role in system performance and power consumption, pre-silicon evaluation environments include cycle-accurate NoC simulators. Long simulations increase the execution time of evaluation frameworks, which are already notoriously slow, and prohibit design-space exploration. Existing analytical NoC models, which assume fair arbitration, cannot replace these simulations since industrial NoCs typically employ priority schedulers and multiple priority classes. To address this limitation, we propose a systematic approach to construct priority-aware analytical performance models using micro-architecture specifications and input traffic. Our approach consists of developing two novel transformations of queuing system and designing an algorithm which iteratively uses these two transformations to estimate end-to-end latency. Our approach decomposes the given NoC into individual queues with modified service time to enable accurate and scalable latency computations. Specifically, we introduce novel transformations along with an algorithm that iteratively applies these transformations to decompose the queuing system. Experimental evaluations using real architectures and applications show high accuracy of 97% and up to 2.5x speedup in full-system simulation.

Proceedings ArticleDOI
TL;DR: In this article, the authors provide a comprehensive, memory-centric characterization of the SPEC CPU2017 benchmark suite, using a number of mechanisms including dynamic binary instrumentation, measurements on native hardware using hardware performance counters and OS based tools.
Abstract: In this paper we provide a comprehensive, memory-centric characterization of the SPEC CPU2017 benchmark suite, using a number of mechanisms including dynamic binary instrumentation, measurements on native hardware using hardware performance counters and OS based tools. We present a number of results including working set sizes, memory capacity consumption and, memory bandwidth utilization of various workloads. Our experiments reveal that the SPEC CPU2017 workloads are surprisingly memory intensive, with approximately 50% of all dynamic instructions being memory intensive ones. We also show that there is a large variation in the memory footprint and bandwidth utilization profiles of the entire suite, with some benchmarks using as much as 16 GB of main memory and up to 2.3 GB/s of memory bandwidth. We also perform instruction execution and distribution analysis of the suite and find that the average instruction count for SPEC CPU2017 workloads is an order of magnitude higher than SPEC CPU2006 ones. In addition, we also find that FP benchmarks of the SPEC 2017 suite have higher compute requirements: on average, FP workloads execute three times the number of compute operations as compared to INT workloads.

Posted Content
TL;DR: In this article, the Palm inversion formula is used to characterize the distribution of the Laplace transforms of the arrival process, message processing times and admission control in bufferless message processing systems.
Abstract: The idea behind the recently introduced "age of information" performance measure of a networked message processing system is that it indicates our knowledge regarding the "freshness" of the most recent piece of information that can be used as a criterion for real-time control. In this foundational paper, we examine two such measures, one that has been extensively studied in the recent literature and a new one that could be more relevant from the point of view of the processor. Considering these measures as stochastic processes in a stationary environment (defined by the arrival processes, message processing times and admission controls in bufferless systems), we characterize their distributions using the Palm inversion formula. Under renewal assumptions we derive explicit solutions for their Laplace transforms and show some interesting decomposition properties. Previous work has mostly focused on computation of expectations in very particular cases. We argue that using bufferless or very small buffer systems is best and support this by simulation. We also pose some open problems including assessment of enqueueing policies that may be better in cases where one wishes to minimize more general functionals of the age of information measures.

Posted Content
TL;DR: In this paper, the authors focus Hyperledger Fabric, the first blockchain in the market tailored for a private environment, allowing businesses to create a permissioned network, and demonstrate the negative impact of network delays on a PBFT-based blockchain.
Abstract: Blockchain has become one of the most attractive technologies for applications, with a large range of deployments such as production, economy, or banking. Under the hood, Blockchain technology is a type of distributed database that supports untrusted parties. In this paper we focus Hyperledger Fabric, the first blockchain in the market tailored for a private environment, allowing businesses to create a permissioned network. Hyperledger Fabric implements a PBFT consensus in order to maintain a non forking blockchain at the application level. We deployed this framework over an area network between France and Germany in order to evaluate its performance when potentially large network delays are observed. Overall we found that when network delay increases significantly (i.e. up to 3.5 seconds at network layer between two clouds), we observed that the blocks added to our blockchain had up to 134 seconds offset after 100 th block from one cloud to another. Thus by delaying block propagation, we demonstrated that Hyperledger Fabric does not provide sufficient consistency guaranties to be deployed in critical environments. Our work, is the fist to evidence the negative impact of network delays on a PBFT-based blockchain.

Journal ArticleDOI
TL;DR: The load balance in a system of nodes in which each object is stored at different nodes improves multiplicatively with $d$ as long as the spacing between consecutive spacings is consecutive between the ordered statistics of uniform random variables.
Abstract: To facilitate load balancing, distributed systems store data redundantly. We evaluate the load balancing performance of storage schemes in which each object is stored at $d$ different nodes, and each node stores the same number of objects. In our model, the load offered for the objects is sampled uniformly at random from all the load vectors with a fixed cumulative value. We find that the load balance in a system of $n$ nodes improves multiplicatively with $d$ as long as $d = o\left(\log(n)\right)$, and improves exponentially once $d = \Theta\left(\log(n)\right)$. We show that the load balance improves in the same way with $d$ when the service choices are created with XOR's of $r$ objects rather than object replicas. In such redundancy schemes, storage overhead is reduced multiplicatively by $r$. However, recovery of an object requires downloading content from $r$ nodes. At the same time, the load balance increases additively by $r$. We express the system's load balance in terms of the maximal spacing or maximum of $d$ consecutive spacings between the ordered statistics of uniform random variables. Using this connection and the limit results on the maximal $d$-spacings, we derive our main results.

Proceedings ArticleDOI
TL;DR: This work heavily extended OSACA to support ARM instructions and critical path prediction including the detection of loop-carried dependencies, which turns it into a versatile cross-architecture modeling tool.
Abstract: Useful models of loop kernel runtimes on out-of-order architectures require an analysis of the in-core performance behavior of instructions and their dependencies. While an instruction throughput prediction sets a lower bound to the kernel runtime, the critical path defines an upper bound. Such predictions are an essential part of analytic (i.e., white-box) performance models like the Roofline and Execution-Cache-Memory (ECM) models. They enable a better understanding of the performance-relevant interactions between hardware architecture and loop code. The Open Source Architecture Code Analyzer (OSACA) is a static analysis tool for predicting the execution time of sequential loops. It previously supported only x86 (Intel and AMD) architectures and simple, optimistic full-throughput execution. We have heavily extended OSACA to support ARM instructions and critical path prediction including the detection of loop-carried dependencies, which turns it into a versatile cross-architecture modeling tool. We show runtime predictions for code on Intel Cascade Lake, AMD Zen, and Marvell ThunderX2 micro-architectures based on machine models from available documentation and semi-automatic benchmarking. The predictions are compared with actual measurements.

Posted Content
TL;DR: It is proved that a bang–bang control is always optimal for this optimization problem and the monotonicity and optimality of the long-run average profit of the data center with respect to the asynchronous dynamic policy under different service prices is characterized.
Abstract: In this paper, we use a Markov decision process to find optimal asynchronous policy of an energy-efficient data center with two groups of heterogeneous servers, a finite buffer, and a fast setup process at sleep state. Servers in Group 1 always work. Servers in Group 2 may either work or sleep, and a fast setup process occurs when server's states are changed from sleep to work. In such a data center, an asynchronous dynamic policy is designed as two sub-policies: The setup policy and the sleep policy, which determine the switch rule between the work and sleep states for the servers in Group 2. To analyze the optimal asynchronous dynamic policy, we apply the Markov decision process to establish a policy-based Poisson equation, which provides expression for the unique solution of the performance potential by means of the RG-factorization. Based on this, we can characterize the monotonicity and optimality of the long-run average profit of the data center with respect to the asynchronous dynamic policy under different service prices. Furthermore, we prove that the bang-bang control is always optimal for this optimization problem, and supports a threshold-type dynamic control in the energy-efficient data center. We hope that the methodology and results derived in this paper can shed light to the study of more general energy-efficient data centers.

Posted Content
TL;DR: This analysis focuses on the tradeoffs between the total number of remote agents, the reliability of the remote support system, and the resulting safety of the driverless vehicles, and develops a numerical method to compute the exact staffing level needed to achieve various performance measures.
Abstract: Driverless vehicles promise a host of societal benefits including dramatically improved safety, increased accessibility, greater productivity, and higher quality of life. As this new technology approaches widespread deployment, both industry and government are making provisions for teleoperations systems, in which remote human agents provide assistance to driverless vehicles. This assistance can involve real-time remote operation and even ahead-of-time input via human-in-the-loop artificial intelligence systems. In this paper, we address the problem of staffing such a remote support center. Our analysis focuses on the tradeoffs between the total number of remote agents, the reliability of the remote support system, and the resulting safety of the driverless vehicles. By establishing a novel connection between queues with large batch arrivals and storage processes, we determine the probability of the system exceeding its service capacity. This connection drives our staffing methodology. We also develop a numerical method to compute the exact staffing level needed to achieve various performance measures. This moment generating function based technique may be of independent interest, and our overall staffing analysis may be of use in other applications that combine human expertise and automated systems.