scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Performance in 2018"


Journal ArticleDOI
TL;DR: In this paper, the stationary distribution of the AoI in information update systems is analyzed in terms of the stationary distributions of the system delay and the peak AoI for different service disciplines.
Abstract: This paper considers the stationary distribution of the age of information (AoI) in information update systems. We first derive a general formula for the stationary distribution of the AoI, which holds for a wide class of information update systems. The formula indicates that the stationary distribution of the AoI is given in terms of the stationary distributions of the system delay and the peak AoI. To demonstrate its applicability and usefulness, we analyze the AoI in single-server queues with four different service disciplines: first-come first-served (FCFS), preemptive last-come first-served (LCFS), and two variants of non-preemptive LCFS service disciplines. For the FCFS and the preemptive LCFS service disciplines, the GI/GI/1, M/GI/1, and GI/M/1 queues are considered, and for the non-preemptive LCFS service disciplines, the M/GI/1 and GI/M/1 queues are considered. With these results, we further show comparison results for the mean AoI's in the M/GI/1 and GI/M/1 queues under those service disciplines.

134 citations


Proceedings ArticleDOI
TL;DR: This paper is the first to try to broaden the class of algorithms expressible as TCU operations and is theFirst to show benefits of this mapping in terms of: program simplicity, efficiency, and performance.
Abstract: Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or 16x16) to accelerate the convolutional and recurrent neural networks in deep learning workloads. In this paper we leverage NVIDIA's TCU to express both reduction and scan with matrix multiplication and show the benefits -- in terms of program simplicity, efficiency, and performance. Our algorithm exercises the NVIDIA TCUs which would otherwise be idle, achieves 89%-98% of peak memory copy bandwidth, and is orders of magnitude faster (up to 100x for reduction and 3x for scan) than state-of-the-art methods for small segment sizes -- common in machine learning and scientific applications. Our algorithm achieves this while decreasing the power consumption by up to 22% for reduction and16%for scan.

42 citations


Proceedings ArticleDOI
TL;DR: This paper presents the design and implementation of a tool to construct faithful models of the latency, throughput, and port usage of x86 instructions, and introduces a more precise definition of latency that considers dependencies between different pairs of input and output operands.
Abstract: Modern microarchitectures are some of the world's most complex man-made systems. As a consequence, it is increasingly difficult to predict, explain, let alone optimize the performance of software running on such microarchitectures. As a basis for performance predictions and optimizations, we would need faithful models of their behavior, which are, unfortunately, seldom available. In this paper, we present the design and implementation of a tool to construct faithful models of the latency, throughput, and port usage of x86 instructions. To this end, we first discuss common notions of instruction throughput and port usage, and introduce a more precise definition of latency that, in contrast to previous definitions, considers dependencies between different pairs of input and output operands. We then develop novel algorithms to infer the latency, throughput, and port usage based on automatically-generated microbenchmarks that are more accurate and precise than existing work. To facilitate the rapid construction of optimizing compilers and tools for performance prediction, the output of our tool is provided in a machine-readable format. We provide experimental results for processors of all generations of Intel's Core architecture, i.e., from Nehalem to Coffee Lake, and discuss various cases where the output of our tool differs considerably from prior work.

38 citations


Posted Content
TL;DR: This work studies the performance of the LL(d) policy which assigns a job to a server that currently has the least workload among d randomly selected servers in large-scale homogeneous clusters and proposes a partial integro-differential equation to describe the evolution of the system.
Abstract: Motivated by distributed schedulers that combine the power-of-d-choices with late binding and systems that use replication with cancellation-on-start, we study the performance of the LL(d) policy which assigns a job to a server that currently has the least workload among d randomly selected servers in large-scale homogeneous clusters. We consider general service time distributions and propose a partial integro-differential equation to describe the evolution of the system. This equation relies on the earlier proven ansatz for LL(d) which asserts that the workload distribution of any finite set of queues becomes independent of one another as the number of servers tends to infinity. Based on this equation we propose a fixed point iteration for the limiting workload distribution and study its convergence. For exponential job sizes we present a simple closed form expression for the limiting workload distribution that is valid for any work-conserving service discipline as well as for the limiting response time distribution in case of first-come-first-served scheduling. We further show that for phase-type distributed job sizes the limiting workload and response time distribution can be expressed via the unique solution of a simple set of ordinary differential equations. Numerical and analytical results that compare response time of the classic power-of-d-choices algorithm and the LL(d) policy are also presented and the accuracy of the limiting response time distribution for finite systems is illustrated using simulation.

26 citations


Posted Content
TL;DR: In this article, the authors propose an algorithm for multi-stage big data processing platforms to adaptively determine and cache the most valuable intermediate datasets that can be reused in the future.
Abstract: In the era of big data and cloud computing, large amounts of data are generated from user applications and need to be processed in the datacenter. Data-parallel computing frameworks, such as Apache Spark, are widely used to perform such data processing at scale. Specifically, Spark leverages distributed memory to cache the intermediate results, represented as Resilient Distributed Datasets (RDDs). This gives Spark an advantage over other parallel frameworks for implementations of iterative machine learning and data mining algorithms, by avoiding repeated computation or hard disk accesses to retrieve RDDs. By default, caching decisions are left at the programmer's discretion, and the LRU policy is used for evicting RDDs when the cache is full. However, when the objective is to minimize total work, LRU is woefully inadequate, leading to arbitrarily suboptimal caching decisions. In this paper, we design an algorithm for multi-stage big data processing platforms to adaptively determine and cache the most valuable intermediate datasets that can be reused in the future. Our solution automates the decision of which RDDs to cache: this amounts to identifying nodes in a direct acyclic graph (DAG) representing computations whose outputs should persist in the memory. Our experiment results show that our proposed cache optimization solution can improve the performance of machine learning applications on Spark decreasing the total work to recompute RDDs by 12%.

26 citations


Journal ArticleDOI
TL;DR: A time series compression algorithm that achieves state-of-the-art compression ratios while requiring less than 1KB of memory and adding virtually no latency is introduced, suitable not only for low-power devices collecting data, but also for servers storing and querying data.
Abstract: Thanks to the rapid proliferation of connected devices, sensor-generated time series constitute a large and growing portion of the world's data. Often, this data is collected from distributed, resource-constrained devices and centralized at one or more servers. A key challenge in this setup is reducing the size of the transmitted data without sacrificing its quality. Lower quality reduces the data's utility, but smaller size enables both reduced network and storage costs at the servers and reduced power consumption in sensing devices. A natural solution is to compress the data at the sensing devices. Unfortunately, existing compression algorithms either violate the memory and latency constraints common for these devices or, as we show experimentally, perform poorly on sensor-generated time series. We introduce a time series compression algorithm that achieves state-of-the-art compression ratios while requiring less than 1KB of memory and adding virtually no latency. This method is suitable not only for low-power devices collecting data, but also for servers storing and querying data; in the latter context, it can decompress at over 3GB/s in a single thread, even faster than many algorithms with much lower compression ratios. A key component of our method is a high-speed forecasting algorithm that can be trained online and significantly outperforms alternatives such as delta coding. Extensive experiments on datasets from many domains show that these results hold not only for sensor data but also across a wide array of other time series.

24 citations


Proceedings ArticleDOI
TL;DR: The Open Source Architecture Code Analyzer (OSACA) as mentioned in this paper is a static analysis tool for predicting the execution time of sequential loops comprising x86 instructions under the assumption of an infinite first-level cache and perfect out-of-order scheduling.
Abstract: An accurate prediction of scheduling and execution of instruction streams is a necessary prerequisite for predicting the in-core performance behavior of throughput-bound loop kernels on out-of-order processor architectures. Such predictions are an indispensable component of analytical performance models, such as the Roofline and the Execution-Cache-Memory (ECM) model, and allow a deep understanding of the performance-relevant interactions between hardware architecture and loop code. We present the Open Source Architecture Code Analyzer (OSACA), a static analysis tool for predicting the execution time of sequential loops comprising x86 instructions under the assumption of an infinite first-level cache and perfect out-of-order scheduling. We show the process of building a machine model from available documentation and semi-automatic benchmarking, and carry it out for the latest Intel Skylake and AMD Zen micro-architectures. To validate the constructed models, we apply them to several assembly kernels and compare runtime predictions with actual measurements. Finally we give an outlook on how the method may be generalized to new architectures.

20 citations


Posted Content
TL;DR: This work evaluates the impact of compiler pass sequences specialization (also known as compiler phase ordering) as a means to reduce the energy consumed by a set of programs/functions when comparing with the use of the standard compiler phase orders provided by, e.g., -OX flags.
Abstract: Compiler writers typically focus primarily on the performance of the generated program binaries when selecting the passes and the order in which they are applied in the standard optimization levels, such as GCC -O3. In some domains, such as embedded systems and High-Performance Computing (HPC), it might be sometimes acceptable to slowdown computations if the energy consumed can be significantly decreased. Embedded systems often rely on a battery and besides energy also have power dissipation limitations, while HPC centers have a growing concern with electricity and cooling costs. Relying on power policies to apply frequency/voltage scaling and/or change the CPU to idle states (e.g., alternate between power levels in bursts) as the main method to reduce energy leaves potential for improvement using other orthogonal approaches. In this work we evaluate the impact of compiler pass sequences specialization (also known as compiler phase ordering) as a means to reduce the energy consumed by a set of programs/functions when comparing with the use of the standard compiler phase orders provided by, e.g., -OX flags. We use our phase selection and ordering framework to explore the design space in the context of a Clang+LLVM compiler targeting a multicore ARM processor in an ODROID board and a dual x86 desktop representative of a node in a Supercomputing center. Our experiments with a set of representative kernels show that there we can reduce energy consumption by up to 24% and that some of these improvements can only be partially explained by improvements to execution time. The experiments show cases where applications that run faster consume more energy. Additionally, we make an effort to characterize the compiler sequence exploration space in terms of their impact on performance and energy.

17 citations


Posted Content
TL;DR: The results show that although some specific functions can have execution times decreased by as much as 74%, the majority of individual metrics indicates little to no decrease in performance.
Abstract: In this work we examine how the updates addressing Meltdown and Spectre vulnerabilities impact the performance of HPC applications. To study this we use the application kernel module of XDMoD to test the performance before and after the application of the vulnerability patches. We tested the performance difference for multiple application and benchmarks including: NWChem, NAMD, HPCC, IOR, MDTest and IMB. The results show that although some specific functions can have execution times decreased by as much as 74%, the majority of individual metrics indicates little to no decrease in performance. The real-world applications show a 2-3% decrease in performance for single node jobs and a 5-11% decrease for parallel multi node jobs.

16 citations


Posted Content
TL;DR: A Roofline performance model is used to analyze the three highly optimized implementations of convolutional neural networks on modern multi-- and many--core CPUs and explains why, and under what conditions, the FFT--based implementations outperform the Winograd--based one, on modern CPUs.
Abstract: Winograd-based convolution has quickly gained traction as a preferred approach to implement convolutional neural networks (ConvNet) on various hardware platforms because it requires fewer floating point operations than FFT-based or direct convolutions This paper compares three highly optimized implementations (regular FFT--, Gauss--FFT--, and Winograd--based convolutions) on modern multi-- and many--core CPUs Although all three implementations employed the same optimizations for modern CPUs, our experimental results with two popular ConvNets (VGG and AlexNet) show that the FFT--based implementations generally outperform the Winograd--based approach, contrary to the popular belief To understand the results, we use a Roofline performance model to analyze the three implementations in detail, by looking at each of their computation phases and by considering not only the number of floating point operations, but also the memory bandwidth and the cache sizes The performance analysis explains why, and under what conditions, the FFT--based implementations outperform the Winograd--based one, on modern CPUs

15 citations


Posted Content
TL;DR: A way to compute classical Softmax with fewer memory accesses is proposed and it is hypothesized that this reduction in memoryAccesses should improve Softmax performance on actual hardware.
Abstract: The Softmax function is ubiquitous in machine learning, multiple previous works suggested faster alternatives for it. In this paper we propose a way to compute classical Softmax with fewer memory accesses and hypothesize that this reduction in memory accesses should improve Softmax performance on actual hardware. The benchmarks confirm this hypothesis: Softmax accelerates by up to 1.3x and Softmax+TopK combined and fused by up to 5x.

Book ChapterDOI
TL;DR: In this article, the authors present refinements to the execution-cache-memory performance model and a previously published power model for multicore processors, which enables a very accurate prediction of performance and energy consumption of contemporary multi-core processors as a function of relevant parameters such as number of active cores and core and uncore frequencies.
Abstract: This paper presents refinements to the execution-cache-memory performance model and a previously published power model for multicore processors. The combination of both enables a very accurate prediction of performance and energy consumption of contemporary multicore processors as a function of relevant parameters such as number of active cores as well as core and Uncore frequencies. Model validation is performed on the Sandy Bridge-EP and Broadwell-EP microarchitectures. Production-related variations in chip quality are demonstrated through a statistical analysis of the fit parameters obtained on one hundred Broadwell-EP CPUs of the same model. Insights from the models are used to explain the performance- and energy-related behavior of the processors for scalable as well as saturating (i.e., memory-bound) codes. In the process we demonstrate the models' capability to identify optimal operating points with respect to highest performance, lowest energy-to-solution, and lowest energy-delay product and identify a set of best practices for energy-efficient execution.

Posted Content
TL;DR: It is argued that the time until a Bitcoin transaction is confirmed can be modelled as a particular stochastic fluid queueing process (to be precise: a Cramer-Lundberg process), which becomes increasingly accurate as traffic intensifies.
Abstract: We study the probabilistic distribution of the confirmation time of Bitcoin transactions, conditional on the current memory pool (i.e., the queue of transactions awaiting confirmation). The results of this paper are particularly interesting for users that want to make a Bitcoin transaction during `heavy-traffic situations', when the transaction demand exceeds the block capacity. In such situations, Bitcoin users tend to bid up the transaction fees, in order to gain priority over other users that pay a lower fee. We argue that the time until a Bitcoin transaction is confirmed can be modelled as a particular stochastic fluid queueing process (to be precise: a Cramer-Lundberg process). We approximate the queueing process in two different ways. The first approach leads to a lower bound on the confirmation probability, which becomes increasingly tight as traffic decreases. The second approach relies on a diffusion approximation with a continuity correction, which becomes increasingly accurate as traffic intensifies. The accuracy of the approximations under different traffic loads are evaluated in a simulation study.

Posted Content
TL;DR: Hard, using two NVMe disks per node and a distributed file system for caching, achieves a 2.1x speed-up over a 10Gb/s NFS central storage system on a 16 GPU (4 nodes, 4 GPUs per node) cluster for a challenging AlexNet ImageNet image classification benchmark.
Abstract: Deep Learning system architects strive to design a balanced system where the computational accelerator -- FPGA, GPU, etc, is not starved for data. Feeding training data fast enough to effectively keep the accelerator utilization high is difficult when utilizing dedicated hardware like GPUs. As accelerators are getting faster, the storage media \& data buses feeding the data have not kept pace and the ever increasing size of training data further compounds the problem. We describe the design and implementation of a distributed caching system called Hoard that stripes the data across fast local disks of multiple GPU nodes using a distributed file system that efficiently feeds the data to ensure minimal degradation in GPU utilization due to I/O starvation. Hoard can cache the data from a central storage system before the start of the job or during the initial execution of the job and feeds the cached data for subsequent epochs of the same job and for different invocations of the jobs that share the same data requirements, e.g. hyper-parameter tuning. Hoard exposes a POSIX file system interface so the existing deep learning frameworks can take advantage of the cache without any modifications. We show that Hoard, using two NVMe disks per node and a distributed file system for caching, achieves a 2.1x speed-up over a 10Gb/s NFS central storage system on a 16 GPU (4 nodes, 4 GPUs per node) cluster for a challenging AlexNet ImageNet image classification benchmark with 150GB of input dataset. As a result of the caching, Hoard eliminates the I/O bottlenecks introduced by the shared storage and increases the utilization of the system by 2x compared to using the shared storage without the cache.

Journal ArticleDOI
TL;DR: This measurement study provides in-depth empirical comparison and analysis of four representative DL frameworks, and offers practical guidance for service providers to deploying and delivering DL as a Service (DLaaS) and for application developers and DLaaS consumers to select the rightDL frameworks for the right DL workloads.
Abstract: Big data powered Deep Learning (DL) and its applications have blossomed in recent years, fueled by three technological trends: a large amount of digitized data openly accessible, a growing number of DL software frameworks in open source and commercial markets, and a selection of affordable parallel computing hardware devices. However, no single DL framework, to date, dominates in terms of performance and accuracy even for baseline classification tasks on standard datasets, making the selection of a DL framework an overwhelming task. This paper takes a holistic approach to conduct empirical comparison and analysis of four representative DL frameworks with three unique contributions. First, given a selection of CPU-GPU configurations, we show that for a specific DL framework, different configurations of its hyper-parameters may have a significant impact on both performance and accuracy of DL applications. Second, to the best of our knowledge, this study is the first to identify the opportunities for improving the training time performance and the accuracy of DL frameworks by configuring parallel computing libraries and tuning individual and multiple hyper-parameters. Third, we also conduct a comparative measurement study on the resource consumption patterns of four DL frameworks and their performance and accuracy implications, including CPU and memory usage, and their correlations to varying settings of hyper-parameters under different configuration combinations of hardware, parallel computing libraries. We argue that this measurement study provides in-depth empirical comparison and analysis of four representative DL frameworks, and offers practical guidance for service providers to deploying and delivering DL as a Service (DLaaS) and for application developers and DLaaS consumers to select the right DL frameworks for the right DL workloads.

Posted Content
TL;DR: This paper presents an adaptive scheme to determine which DNN model to use for a given input, by considering the desired accuracy and inference time, and considers a range of influential DNN models.
Abstract: The recent ground-breaking advances in deep learning networks ( DNNs ) make them attractive for embedded systems. However, it can take a long time for DNNs to make an inference on resource-limited embedded devices. Offloading the computation into the cloud is often infeasible due to privacy concerns, high latency, or the lack of connectivity. As such, there is a critical need to find a way to effectively execute the DNN models locally on the devices. This paper presents an adaptive scheme to determine which DNN model to use for a given input, by considering the desired accuracy and inference time. Our approach employs machine learning to develop a predictive model to quickly select a pre-trained DNN to use for a given input and the optimization constraint. We achieve this by first training off-line a predictive model, and then use the learnt model to select a DNN model to use for new, unseen inputs. We apply our approach to the image classification task and evaluate it on a Jetson TX2 embedded deep learning platform using the ImageNet ILSVRC 2012 validation dataset. We consider a range of influential DNN models. Experimental results show that our approach achieves a 7.52% improvement in inference accuracy, and a 1.8x reduction in inference time over the most-capable single DNN model.

Posted Content
TL;DR: An optimal algorithm for scheduling multistage jobs in an M/G/1 queue is given and an exact response time analysis of the algorithm is given.
Abstract: Scheduling to minimize mean response time in an M/G/1 queue is a classic problem. The problem is usually addressed in one of two scenarios. In the perfect-information scenario, the scheduler knows each job's exact size, or service requirement. In the zero-information scenario, the scheduler knows only each job's size distribution. The well-known shortest remaining processing time (SRPT) policy is optimal in the perfect-information scenario, and the more complex Gittins policy is optimal in the zero-information scenario. In real systems the scheduler often has partial but incomplete information about each job's size. We introduce a new job model, that of multistage jobs, to capture this partial-information scenario. A multistage job consists of a sequence of stages, where both the sequence of stages and stage sizes are unknown, but the scheduler always knows which stage of a job is in progress. We give an optimal algorithm for scheduling multistage jobs in an M/G/1 queue and an exact response time analysis of our algorithm.

Posted Content
TL;DR: This work considers a connection-level model proposed by Massoulié and Roberts for bandwidth sharing among file transfer flows in a communication network and develops a novel inner product in the state space that allows for a multiplicative type of state-space collapse in steady state.
Abstract: We consider a connection-level model proposed by Massouli\'{e} and Roberts for bandwidth sharing among file transfer flows in a communication network. We study weighted proportionally fair sharing policies and establish explicit-form bounds on the weighted sum of the expected numbers of flows on different routes in heavy traffic. The bounds are linear in the number of critically loaded links in the network, and they hold for a class of phase-type file-size distributions; i.e., the bounds are heavy-traffic insensitive to the distributions in this class. Our approach is Lyapunov-drift based, which is different from the widely used diffusion approximation approach. A key technique we develop is to construct a novel inner product in the state space, which then allows us to obtain a multiplicative type of state-space collapse in steady state. Furthermore, this state-space collapse result implies the interchange of limits as a by-product for the diffusion approximation of the equal-weight case under phase-type file-size distributions, demonstrating the heavy-traffic insensitivity of the stationary distribution.

Posted Content
TL;DR: In this paper, the authors propose a novel neighborhood method for constrained shortest path (SSP) management in virtual networks, which uses novel search space reduction techniques and has a theoretical quadratic speed-up making it practically faster than recent branch-and-bound exhaustive search solutions.
Abstract: Virtual network services that span multiple data centers are important to support emerging data-intensive applications in fields such as bioinformatics and retail analytics. Successful virtual network service composition and maintenance requires flexible and scalable 'constrained shortest path management' both in the management plane for virtual network embedding (VNE) or network function virtualization service chaining (NFV-SC), as well as in the data plane for traffic engineering (TE). In this paper, we show analytically and empirically that leveraging constrained shortest paths within recent VNE, NFV-SC and TE algorithms can lead to network utilization gains (of up to 50%) and higher energy efficiency. The management of complex VNE, NFV-SC and TE algorithms can be, however, intractable for large scale substrate networks due to the NP-hardness of the constrained shortest path problem. To address such scalability challenges, we propose a novel, exact constrained shortest path algorithm viz., 'Neighborhoods Method' (NM). Our NM uses novel search space reduction techniques and has a theoretical quadratic speed-up making it practically faster (by an order of magnitude) than recent branch-and-bound exhaustive search solutions. Finally, we detail our NM-based SDN controller implementation in a real-world testbed to further validate practical NM benefits for virtual network services.

Posted Content
TL;DR: In this paper, the authors consider learning-based variants of the $c \mu$ rule for scheduling in single and parallel server settings of multi-class queueing systems, and show that the greedy algorithm that uses empirically learned service rates results in a constant holding-cost regret.
Abstract: We consider learning-based variants of the $c \mu$ rule for scheduling in single and parallel server settings of multi-class queueing systems. In the single server setting, the $c \mu$ rule is known to minimize the expected holding-cost (weighted queue-lengths summed over classes and a fixed time horizon). We focus on the problem where the service rates $\mu$ are unknown with the holding-cost regret (regret against the $c \mu$ rule with known $\mu$) as our objective. We show that the greedy algorithm that uses empirically learned service rates results in a constant holding-cost regret (the regret is independent of the time horizon). This free exploration can be explained in the single server setting by the fact that any work-conserving policy obtains the same number of samples in a busy cycle. In the parallel server setting, we show that the $c \mu$ rule may result in unstable queues, even for arrival rates within the capacity region. We then present sufficient conditions for geometric ergodicity under the $c \mu$ rule. Using these results, we propose an almost greedy algorithm that explores only when the number of samples falls below a threshold. We show that this algorithm delivers constant holding-cost regret because a free exploration condition is eventually satisfied.

Proceedings ArticleDOI
TL;DR: In this paper, the authors present benchmark data for Intel Memory Drive Technology (IMDT), which is a new generation of Software-defined Memory (SDM) based on Intel ScaleMP collaboration and using 3D XPointTM based Intel Solid-State Drives (SSDs) called Optane.
Abstract: In this paper, we present benchmark data for Intel Memory Drive Technology (IMDT), which is a new generation of Software-defined Memory (SDM) based on Intel ScaleMP collaboration and using 3D XPointTM based Intel Solid-State Drives (SSDs) called Optane. We studied IMDT performance for synthetic benchmarks, scientific kernels, and applications. We chose these benchmarks to represent different patterns for computation and accessing data on disks and memory. To put performance of IMDT in comparison, we used two memory configurations: hybrid IMDT DDR4/Optane and DDR4 only systems. The performance was measured as a percentage of used memory and analyzed in detail. We found that for some applications DDR4/Optane hybrid configuration outperforms DDR4 setup by up to 20%.

Proceedings ArticleDOI
TL;DR: In this paper, the authors show that by performing fewer optimizations available in a standard compiler optimization level such as -O2, while preserving their original ordering, significant savings can be achieved in both execution time and energy consumption.
Abstract: This paper presents the interesting observation that by performing fewer of the optimizations available in a standard compiler optimization level such as -O2, while preserving their original ordering, significant savings can be achieved in both execution time and energy consumption. This observation has been validated on two embedded processors, namely the ARM Cortex-M0 and the ARM Cortex-M3, using two different versions of the LLVM compilation framework; v3.8 and v5.0. Experimental evaluation with 71 embedded benchmarks demonstrated performance gains for at least half of the benchmarks for both processors. An average execution time reduction of 2.4% and 5.3% was achieved across all the benchmarks for the Cortex-M0 and Cortex-M3 processors, respectively, with execution time improvements ranging from 1% up to 90% over the -O2. The savings that can be achieved are in the same range as what can be achieved by the state-of-the-art compilation approaches that use iterative compilation or machine learning to select flags or to determine phase orderings that result in more efficient code. In contrast to these time consuming and expensive to apply techniques, our approach only needs to test a limited number of optimization configurations, less than 64, to obtain similar or even better savings. Furthermore, our approach can support multi-criteria optimization as it targets execution time, energy consumption and code size at the same time.

Posted ContentDOI
TL;DR: A lightweight implementation of random shuffling (LIRS) is proposed to randomly shuffle the indexes of the entire training dataset, and the selected training instances are directly accessed from the storage and packed into batches.
Abstract: Machine learning algorithms, such as Support Vector Machine (SVM) and Deep Neural Network (DNN), have gained a lot of interests recently. When training a machine learning algorithm, randomly shuffle all the training data can improve the testing accuracy and boost the convergence rate. Nevertheless, realizing training data random shuffling in a real system is not a straightforward process due to the slow random accesses in hard disk drive (HDD). To avoid frequent random disk access, the effect of random shuffling is often limited in existing approaches. With the emerging non-volatile memory-based storage device, such as Intel Optane SSD, which provides fast random accesses, we propose a lightweight implementation of random shuffling (LIRS) to randomly shuffle the indexes of the entire training dataset, and the selected training instances are directly accessed from the storage and packed into batches. Experimental results show that LIRS can reduce the total training time of SVM and DNN by 49.9% and 43.5% on average, and improve the final testing accuracy on DNN by 1.01%.

Journal ArticleDOI
TL;DR: In this paper, the authors propose an approach based on inherently distributed data formats such as HDF5 to avoid long latencies when storing the data and to support fast (random) access when retrieving the data for visual processing.
Abstract: More and more massive parallel codes running on several hundreds of thousands of cores enter the computational science and engineering domain, allowing high-fidelity computations on up to trillions of unknowns for very detailed analyses of the underlying problems. During such runs, typically gigabytes of data are being produced, hindering both efficient storage and (interactive) data exploration. Here, advanced approaches based on inherently distributed data formats such as HDF5 become necessary in order to avoid long latencies when storing the data and to support fast (random) access when retrieving the data for visual processing. Avoiding file locking and using collective buffering, write bandwidths to a single file close to the theoretical peak on a modern supercomputing cluster were achieved. The structure of the output file supports a very fast interactive visualisation and introduces additional steering functionality.

Posted Content
TL;DR: The asymptotic miss ratio of data item requests on a LRU cluster with consistent hashing is derived and it is shown that these individual cache spaces on different servers can be effectively viewed as if they could be pooled together to form a single virtual LRU cache space parametrized by an appropriate cache size.
Abstract: To efficiently scale data caching infrastructure to support emerging big data applications, many caching systems rely on consistent hashing to group a large number of servers to form a cooperative cluster. These servers are organized together according to a random hash function. They jointly provide a unified but distributed hash table to serve swift and voluminous data item requests. Different from the single least-recently-used (LRU) server that has already been extensively studied, theoretically characterizing a cluster that consists of multiple LRU servers remains yet to be explored. These servers are not simply added together; the random hashing complicates the behavior. To this end, we derive the asymptotic miss ratio of data item requests on a LRU cluster with consistent hashing. We show that these individual cache spaces on different servers can be effectively viewed as if they could be pooled together to form a single virtual LRU cache space parametrized by an appropriate cache size. This equivalence can be established rigorously under the condition that the cache sizes of the individual servers are large. For typical data caching systems this condition is common. Our theoretical framework provides a convenient abstraction that can directly apply the results from the simpler single LRU cache to the more complex LRU cluster with consistent hashing.

Posted Content
TL;DR: In this article, a systematic literature review for identifying the existing works discussing how to measure process performance based on information retrieved from event logs is presented, and a framework for categorizing and selecting performance analysis approaches based on existing research is introduced.
Abstract: Process mining has gained traction over the past decade and an impressive body of research has resulted in the introduction of a variety of process mining approaches measuring process performance. Having this set of techniques available, organizations might find it difficult to identify which approach is best suited considering context, performance indicator, and data availability. In light of this challenge, this paper aims at introducing a framework for categorizing and selecting performance analysis approaches based on existing research. We start from a systematic literature review for identifying the existing works discussing how to measure process performance based on information retrieved from event logs. Then, the proposed framework is built starting from the information retrieved from these studies taking into consideration different aspects of performance analysis.

Posted Content
TL;DR: It is proved that the policy is optimal under deterministic service time and binomial reward distribution, and the policy processes the maximal number of packets while collecting rewards higher than the expected reward.
Abstract: In this paper we consider a real time queuing system with rewards and deadlines. We assume that packet processing time is known upon arrival, as is the case in communication networks. This assumption allows us to demonstrate that the well known Earliest-Deadline-First policy performance can be improved. We then propose a scheduling policy that provides excellent results for packets with rewards and deadlines. We prove that the policy is optimal under deterministic service time and binomial reward distribution. In the more general case we prove that the policy processes the maximal number of packets while collecting rewards higher than the expected reward. We present simulation results that show its high performance in more generic cases compared to the most commonly used scheduling policies.

Posted Content
TL;DR: A new adaptive framework for data-driven applications which uses a predictive model to select the optimal algorithmic parameters by training with synthetic and real datasets is presented and the effectiveness of a BLAS library is demonstrated.
Abstract: Efficient high-performance libraries often expose multiple tunable parameters to provide highly optimized routines. These can range from simple loop unroll factors or vector sizes all the way to algorithmic changes, given that some implementations can be more suitable for certain devices by exploiting hardware characteristics such as local memories and vector units. Traditionally, such parameters and algorithmic choices are tuned and then hard-coded for a specific architecture and for certain characteristics of the inputs. However, emerging applications are often data-driven, thus traditional approaches are not effective across the wide range of inputs and architectures used in practice. In this paper, we present a new adaptive framework for data-driven applications which uses a predictive model to select the optimal algorithmic parameters by training with synthetic and real datasets. We demonstrate the effectiveness of a BLAS library and specifically on its matrix multiplication routine. We present experimental results for two GPU architectures and show significant performance gains of up to 3x (on a high-end NVIDIA Pascal GPU) and 2.5x (on an embedded ARM Mali GPU) when compared to a traditionally optimized library.

Posted Content
TL;DR: In this article, simple monotonicity arguments can be used to prove global attraction for a broad class of ordinary differential equations that capture the evolution of mean field models with hyperexponential job sizes.
Abstract: Mean field modeling is a popular approach to assess the performance of large scale computer systems. The evolution of many mean field models is characterized by a set of ordinary differential equations that have a unique fixed point. In order to prove that this unique fixed point corresponds to the limit of the stationary measures of the finite systems, the unique fixed point must be a global attractor. While global attraction was established for various systems in case of exponential job sizes, it is often unclear whether these proof techniques can be generalized to non-exponential job sizes. In this paper we show how simple monotonicity arguments can be used to prove global attraction for a broad class of ordinary differential equations that capture the evolution of mean field models with hyperexponential job sizes. This class includes both existing as well as previously unstudied load balancing schemes and can be used for systems with either finite or infinite buffers. The main novelty of the approach exists in using a Coxian representation for the hyperexponential job sizes and a partial order that is stronger than the componentwise partial order used in the exponential case.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a hybrid I/O caching architecture that uses DRAM, Read-Optimized SSD, and Write-optimized SSD in a three-level cache hierarchy and tries to efficiently redirect read requests to either DRAM or RO-SSD while sending writes to WO-SSDs.
Abstract: SSDs are emerging storage devices which unlike HDDs, do not have mechanical parts and therefore, have superior performance compared to HDDs. Due to the high cost of SSDs, entirely replacing HDDs with SSDs is not economically justified. Additionally, SSDs can endure a limited number of writes before failing. To mitigate the shortcomings of SSDs while taking advantage of their high performance, SSD caching is practiced in both academia and industry. Previously proposed caching architectures have only focused on either performance or endurance and neglected to address both parameters in suggested architectures. Moreover, the cost, reliability, and power consumption of such architectures is not evaluated. This paper proposes a hybrid I/O caching architecture that while offers higher performance than previous studies, it also improves power consumption with a similar budget. The proposed architecture uses DRAM, Read-Optimized SSD, and Write-Optimized SSD in a three-level cache hierarchy and tries to efficiently redirect read requests to either DRAM or RO-SSD while sending writes to WO-SSD. To provide high reliability, dirty pages are written to at least two devices which removes any single point of failure. The power consumption is also managed by reducing the number of accesses issued to SSDs. The proposed architecture reconfigures itself between performance- and endurance-optimized policies based on the workload characteristics to maintain an effective tradeoff between performance and endurance. We have implemented the proposed architecture on a server equipped with industrial SSDs and HDDs. The experimental results show that as compared to state-of-the-art studies, the proposed architecture improves performance and power consumption by an average of 8% and 28%, respectively, and reduces the cost by 5% while increasing the endurance cost by 4.7% and negligible reliability penalty.