scispace - formally typeset
Search or ask a question

Showing papers on "Latency (engineering) published in 2016"


Proceedings ArticleDOI
14 Jun 2016
TL;DR: Flexible-LatencY DRAM is proposed, a mechanism that exploits latency variation across DRAM cells within a DRAM chip to improve system performance and exploit the spatial locality of slower cells within DRAM, and access the faster DRAM regions with reduced latencies for the fundamental operations.
Abstract: Long DRAM latency is a critical performance bottleneck in current systems. DRAM access latency is defined by three fundamental operations that take place within the DRAM cell array: (i) activation of a memory row, which opens the row to perform accesses; (ii) precharge, which prepares the cell array for the next memory access; and (iii) restoration of the row, which restores the values of cells in the row that were destroyed due to activation. There is significant latency variation for each of these operations across the cells of a single DRAM chip due to irregularity in the manufacturing process. As a result, some cells are inherently faster to access, while others are inherently slower. Unfortunately, existing systems do not exploit this variation. The goal of this work is to (i) experimentally characterize and understand the latency variation across cells within a DRAM chip for these three fundamental DRAM operations, and (ii) develop new mechanisms that exploit our understanding of the latency variation to reliably improve performance. To this end, we comprehensively characterize 240 DRAM chips from three major vendors, and make several new observations about latency variation within DRAM. We find that (i) there is large latency variation across the cells for each of the three operations; (ii) variation characteristics exhibit significant spatial locality: slower cells are clustered in certain regions of a DRAM chip; and (iii) the three fundamental operations exhibit different reliability characteristics when the latency of each operation is reduced. Based on our observations, we propose Flexible-LatencY DRAM (FLY-DRAM), a mechanism that exploits latency variation across DRAM cells within a DRAM chip to improve system performance. The key idea of FLY-DRAM is to exploit the spatial locality of slower cells within DRAM, and access the faster DRAM regions with reduced latencies for the fundamental operations. Our evaluations show that FLY-DRAM improves the performance of a wide range of applications by 13.3%, 17.6%, and 19.5%, on average, for each of the three different vendors' real DRAM chips, in a simulated 8-core system. We conclude that the experimental characterization and analysis of latency variation within modern DRAM, provided by this work, can lead to new techniques that improve DRAM and system performance.

203 citations


Journal ArticleDOI
TL;DR: A broad survey of techniques aimed at tackling latency in the literature up to August 2014 is offered, finding that classifying techniques according to the sources of delay they alleviate provided the best insight into the following issues.
Abstract: Latency is increasingly becoming a performance bottleneck for Internet Protocol (IP) networks, but historically, networks have been designed with aims of maximizing throughput and utilization. This paper offers a broad survey of techniques aimed at tackling latency in the literature up to August 2014, as well as their merits. A goal of this work is to be able to quantify and compare the merits of the different Internet latency reducing techniques, contrasting their gains in delay reduction versus the pain required to implement and deploy them. We found that classifying techniques according to the sources of delay they alleviate provided the best insight into the following issues: 1) The structural arrangement of a network, such as placement of servers and suboptimal routes, can contribute significantly to latency; 2) each interaction between communicating endpoints adds a Round Trip Time (RTT) to latency, particularly significant for short flows; 3) in addition to base propagation delay, several sources of delay accumulate along transmission paths, today intermittently dominated by queuing delays; 4) it takes time to sense and use available capacity, with overuse inflicting latency on other flows sharing the capacity; and 5) within end systems, delay sources include operating system buffering, head-of-line blocking, and hardware interaction. No single source of delay dominates in all cases, and many of these sources are spasmodic and highly variable. Solutions addressing these sources often both reduce the overall latency and make it more predictable.

176 citations


Proceedings ArticleDOI
12 Mar 2016
TL;DR: This work develops a low-cost mechanism, called ChargeCache, that enables faster access to recently- accessed rows in DRAM, with no modifications to DRAM chips, based on the key observation that a recently-accessed row has more charge and thus the following access to the same row can be performed faster.
Abstract: DRAM latency continues to be a critical bottleneck for system performance. In this work, we develop a low-cost mechanism, called Charge Cache, that enables faster access to recently-accessed rows in DRAM, with no modifications to DRAM chips. Our mechanism is based on the key observation that a recently-accessed row has more charge and thus the following access to the same row can be performed faster. To exploit this observation, we propose to track the addresses of recently-accessed rows in a table in the memory controller. If a later DRAM request hits in that table, the memory controller uses lower timing parameters, leading to reduced DRAM latency. Row addresses are removed from the table after a specified duration to ensure rows that have leaked too much charge are not accessed with lower latency. We evaluate ChargeCache on a wide variety of workloads and show that it provides significant performance and energy benefits for both single-core and multi-core systems.

167 citations


Proceedings ArticleDOI
10 May 2016
TL;DR: This paper presents an algorithm that is based on estimating through a Kalman filter the end-to-end one way delay variation which is experienced by packets traveling from a sender to a destination and is compared to an adaptive threshold to dynamically throttle the sending rate.
Abstract: Video conferencing applications require low latency and high bandwidth. Standard TCP is not suitable for video conferencing since its reliability and in order delivery mechanisms induce large latency. Recently the idea of using the delay gradient to infer congestion is appearing again and is gaining momentum. In this paper we present an algorithm that is based on estimating through a Kalman filter the end-to-end one way delay variation which is experienced by packets traveling from a sender to a destination. This estimate is compared to an adaptive threshold to dynamically throttle the sending rate. The control algorithm has been implemented over the RTP/RTCP protocol and is currently used in Google Hangouts and in the Chrome WebRTC stack. Experiments have been carried out to evaluate the algorithm performance in the case of variable link capacity, presence of heterogeneous or homogeneous concurrent traffic, and backward path traffic.

140 citations


Journal ArticleDOI
TL;DR: New studies that suggest that distinct sites of cellular latency could exist in the human host are focused on, which argues for multiple latent phenotypes that could impact differently on the biology of this virus in vivo.
Abstract: Human cytomegalovirus (HCMV) infection remains a major cause of morbidity in patient populations. In certain clinical settings, it is the reactivation of the pre-existing latent infection in the host that poses the health risk. The prevailing view of HCMV latency was that the virus was essentially quiescent in myeloid progenitor cells and that terminal differentiation resulted in the initiation of the lytic lifecycle and reactivation of infectious virus. However, our understanding of HCMV latency and reactivation at the molecular level has been greatly enhanced through recent advancements in systems biology approaches to perform global analyses of both experimental and natural latency. These approaches, in concert with more classical reductionist experimentation, are furnishing researchers with new concepts in cytomegalovirus latency and suggest that latent infection is far more active than first thought. In this review, we will focus on new studies that suggest that distinct sites of cellular latency could exist in the human host, which, when coupled with recent observations that report different transcriptional programmes within cells of the myeloid lineage, argues for multiple latent phenotypes that could impact differently on the biology of this virus in vivo. Finally, we will also consider how the biology of the host cell where the latent infection persists further contributes to the concept of a spectrum of latent phenotypes in multiple cell types that can be exploited by the virus.

140 citations


Journal ArticleDOI
TL;DR: This paper performs an analytical study of the latency performance of redundant requests, with the primary goals of characterizing under what scenarios sending redundant requests will help (and under what scenario it will not), and of designing optimal redundant-requesting policies.
Abstract: Many systems possess the flexibility to serve requests in more than one way, such as distributed storage systems that store multiple copies of the data. In such systems, the latency of serving the requests may potentially be reduced by sending redundant requests : a request may be sent to more servers than needed and deemed served when the requisite number of servers complete service. Such a mechanism trades off the possibility of faster execution of the request with the increase in the load on the system. Several recent works empirically evaluate the latency performance of redundant requests in diverse settings. In this paper, we perform an analytical study of the latency performance of redundant requests, with the primary goals of characterizing under what scenarios sending redundant requests will help (and under what scenarios it will not), and of designing optimal redundant-requesting policies. We show that when service times are i.i.d. memoryless or “heavier,” and when the additional copies of already-completed jobs can be removed instantly, maximally scheduling redundant requests achieves the optimal average latency. On the other hand, when service times are i.i.d. “lighter” or when service times are memoryless and removal of jobs is not instantaneous, then not having any redundancy in the requests is optimal under high loads. Our results are applicable to arbitrary arrival processes.

131 citations


Proceedings ArticleDOI
01 Sep 2016
TL;DR: This paper proposes an integer form of stochastic computation and introduces some elementary circuits and proposes an efficient implementation of a DNN based on integral SC, and considers a quasi-synchronous implementation that yields 33% reduction in energy consumption with respect to the binary radix implementation without any compromise on performance.
Abstract: The hardware implementation of deep neural networks (DNNs) has recently received tremendous attention since many applications require high-speed operations. However, numerous processing elements and complex interconnections are usually required, leading to a large area occupation and a high power consumption. Stochastic computing has shown promising results for area-efficient hardware implementations, even though existing stochastic algorithms require long streams that exhibit long latency. In this paper, we propose an integer form of stochastic computation and introduce some elementary circuits. We then propose an efficient implementation of a DNN based on integral stochastic computing. The proposed architecture uses integer stochastic streams and a modified Finite State Machine-based tanh function to improve the performance and reduce the latency compared to existing stochastic architectures for DNN. The simulation results show the negligible performance loss of the proposed integer stochastic DNN for different network sizes compared to their floating point versions.

129 citations


01 Jan 2016
TL;DR: The the process and effects of mass communication is universally compatible with any devices to read, and is set as public so you can download it instantly.
Abstract: the process and effects of mass communication is available in our book collection an online access to it is set as public so you can download it instantly. Our book servers hosts in multiple locations, allowing you to get the most less latency time to download any of our books like this one. Merely said, the the process and effects of mass communication is universally compatible with any devices to read.

128 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel active skeleton representation towards low latency human action recognition that is robust in calculating features related to joint positions, and effective in handling the unsegmented sequences.
Abstract: With the development of depth sensors, low latency 3D human action recognition has become increasingly important in various interaction systems, where response with minimal latency is a critical process. High latency not only significantly degrades the interaction experience of users, but also makes certain interaction systems, e.g., gesture control or electronic gaming, unattractive. In this paper, we propose a novel active skeleton representation towards low latency human action recognition . First, we encode each limb of the human skeleton into a state through a Markov random field. The active skeleton is then represented by aggregating the encoded features of individual limbs. Finally, we propose a multi-channel multiple instance learning with maximum-pattern-margin to further boost the performance of the existing model. Our method is robust in calculating features related to joint positions, and effective in handling the unsegmented sequences. Experiments on the MSR Action3D, the MSR DailyActivity3D, and the Huawei/3DLife-2013 dataset demonstrate the effectiveness of the model with the proposed novel representation, and its superiority over the state-of-the-art low latency recognition approaches.

92 citations


Journal ArticleDOI
TL;DR: In this paper, an upper bound on the average service delay of such erasure-coded storage with arbitrary service time distribution and consisting of multiple heterogeneous files is provided, which enables a novel problem of joint latency and storage cost minimization over three dimensions: selecting the erasure code, placing encoded chunks, and optimizing scheduling policy.
Abstract: Modern distributed storage systems offer large capacity to satisfy the exponentially increasing need of storage space. They often use erasure codes to protect against disk and node failures to increase reliability, while trying to meet the latency requirements of the applications and clients. This paper provides an insightful upper bound on the average service delay of such erasure-coded storage with arbitrary service time distribution and consisting of multiple heterogeneous files. Not only does the result supersede known delay bounds that only work for a single file or homogeneous files, it also enables a novel problem of joint latency and storage cost minimization over three dimensions: selecting the erasure code, placement of encoded chunks, and optimizing scheduling policy. The problem is efficiently solved via the computation of a sequence of convex approximations with provable convergence. We further prototype our solution in an open-source cloud storage deployment over three geographically distributed data centers. Experimental results validate our theoretical delay analysis and show significant latency reduction, providing valuable insights into the proposed latency-cost tradeoff in erasure-coded storage.

90 citations


Journal ArticleDOI
TL;DR: Here, the use of pharmacological agents to reverse HIV-1 latency will be explored as a therapeutic strategy towards a cure, while clinical trials of latency-reversing agents LRAs) have demonstrated their ability to increase production of latent HIV- 1, but such interventions have not had an effect on the size of the latent virus reservoir.

Proceedings ArticleDOI
20 Jun 2016
TL;DR: The design and deployment of WiFiSeer, a framework to measure and characterize WiFi latency at large scale, and the measurement results quantitatively confirm some anecdotal perceptions about impacting factors and disapprove others.
Abstract: WiFi latency is a key factor impacting the user experience of modern mobile applications, but it has not been well studied at large scale. In this paper, we design and deploy WiFiSeer, a framework to measure and characterize WiFi latency at large scale. WiFiSeer comprises a systematic methodology for modeling the complex relationships between WiFi latency and a diverse set of WiFi performance metrics, device characteristics, and environmental factors. WiFiSeer was deployed on Tsinghua campus to conduct a WiFi latency measurement study of unprecedented scale with more than 47,000 unique user devices. We observe that WiFi latency follows a long tail distribution and the 90th (99th) percentile is around 20 ms (250 ms). Furthermore, our measurement results quantitatively confirm some anecdotal perceptions about impacting factors and disapprove others. We deploy three practical solutions for improving WiFi latency in Tsinghua, and the results show significantly improved WiFi latencies. In particular, over 1,000 devices use our AP selection service based on a predictive WiFi latency model for 2.5 months, and 72% of their latencies are reduced by over half after they re-associate to the suggested APs.

Journal ArticleDOI
18 Jun 2016
TL;DR: A methodology for statistically rigorous performance evaluation and performance factor attribution for server workloads, which finds that careful design of the server load tester can ensure high quality performance evaluation, and empirical demonstrate the inaccuracy of load testers in previous work.
Abstract: Managing tail latency of requests has become one of the primary challenges for large-scale Internet services. Data centers are quickly evolving and service operators frequently desire to make changes to the deployed software and production hardware configurations. Such changes demand a confident understanding of the impact on one's service, in particular its effect on tail latency (e.g., 95th- or 99th-percentile response latency of the service). Evaluating the impact on the tail is challenging because of its inherent variability. Existing tools and methodologies for measuring these effects suffer from a number of deficiencies including poor load tester design, statistically inaccurate aggregation, and improper attribution of effects. As shown in the paper, these pitfalls can often result in misleading conclusions.In this paper, we develop a methodology for statistically rigorous performance evaluation and performance factor attribution for server workloads. First, we find that careful design of the server load tester can ensure high quality performance evaluation, and empirically demonstrate the inaccuracy of load testers in previous work. Learning from the design flaws in prior work, we design and develop a modular load tester platform, Treadmill, that overcomes pitfalls of existing tools. Next, utilizing Treadmill, we construct measurement and analysis procedures that can properly attribute performance factors. We rely on statistically-sound performance evaluation and quantile regression, extending it to accommodate the idiosyncrasies of server systems. Finally, we use our augmented methodology to evaluate the impact of common server hardware features with Facebook production workloads on production hardware. We decompose the effects of these features on request tail latency and demonstrate that our evaluation methodology provides superior results, particularly in capturing complicated and counter-intuitive performance behaviors. By tuning the hardware features as suggested by the attribution, we reduce the 99th-percentile latency by 43% and its variance by 93%.

Proceedings ArticleDOI
02 Nov 2016
TL;DR: Results show that motor performance and simultaneity perception are affected by latencies above 75ms, and sense of agency and body ownership only decline at a latency higher than 125 ms, and deteriorate for a latency greater than 300 ms, but they do not break down completely even at the highest tested delay.
Abstract: Latency between a user's movement and visual feedback is inevitable in every Virtual Reality application, as signal transmission and processing take time. Unfortunately, a high end-to-end latency impairs perception and motor performance. While it is possible to reduce feedback delay to tens of milliseconds, these delays will never completely vanish. Currently, there is a gap in literature regarding the impact of feedback delays on perception and motor performance as well as on their interplay in virtual environments employing full-body avatars. With the present study at hand, we address this gap by performing a systematic investigation of different levels of delay across a variety of perceptual and motor tasks during full-body action inside a Cave Automatic Virtual Environment. We presented participants with their virtual mirror image, which responded to their actions with feedback delays ranging from 45 to 350 ms. We measured the impact of these delays on motor performance, sense of agency, sense of body ownership and simultaneity perception by means of psychophysical procedures. Furthermore, we looked at interaction effects between these aspects to identify possible dependencies. The results show that motor performance and simultaneity perception are affected by latencies above 75 ms. Although sense of agency and body ownership only decline at a latency higher than 125 ms, and deteriorate for a latency greater than 300 ms, they do not break down completely even at the highest tested delay. Interestingly, participants perceptually infer the presence of delays more from their motor error in the task than from the actual level of delay. Whether or not participants notice a delay in a virtual environment might therefore depend on the motor task and their performance rather than on the actual delay.

Journal ArticleDOI
TL;DR: It is found that the experienced latency is significantly reduced, compared to using a single path, and multi-path transport is suitable for latency-sensitive traffic and mature enough to be widely deployed.

Proceedings ArticleDOI
10 Apr 2016
TL;DR: This paper provides the first systematic study on WiFi hop latency in the wild based on the latency and WiFi factors collected from 47 APs on T university campus for two months, and trains a decision tree model to help understand, troubleshoot, and optimize WiFiHop latency for WiFi APs in general.
Abstract: As mobile Internet is now indispensable in our daily lives, WiFi's latency performance has become critical to mobile applications' quality of experience. Unfortunately, WiFi hop latency in the wild remains largely unknown. In this paper, we first propose an effective approach to break down the round trip network latency. Then we provide the first systematic study on WiFi hop latency in the wild based on the latency and WiFi factors collected from 47 APs on T university campus for two months. We observe that WiFi hop can be the weakest link in the round trip network latency: more than 50% (10%) of TCP packets suffer from WiFi hop latency larger than 20ms (100ms), and WiFi hop latency occupies more than 60% in more than half of the round trip network latency. To help understand, troubleshoot, and optimize WiFi hop latency for WiFi APs in general, we train a decision tree model. Based on the model's output, we are able to reduce the median latency by 80% from 50ms to 10ms in one real case, and reduce the maximum latency from 250ms to 50ms in another real case.

Journal ArticleDOI
TL;DR: This work introduces an adaptation algorithm for HTTP-based live streaming called LOLYPOP (short for low-latency prediction-based adaptation), which is designed to operate with a transport latency of a few seconds, and leverages Transmission Control Protocol throughput predictions on multiple time scales.
Abstract: Recently, Hypertext Transfer Protocol (HTTP)-based adaptive streaming has become the de facto standard for video streaming over the Internet. It allows clients to dynamically adapt media characteristics to the varying network conditions to ensure a high quality of experience (QoE)—that is, minimize playback interruptions while maximizing video quality at a reasonable level of quality changes. In the case of live streaming, this task becomes particularly challenging due to the latency constraints. The challenge further increases if a client uses a wireless access network, where the throughput is subject to considerable fluctuations. Consequently, live streams often exhibit latencies of up to 20 to 30 seconds. In the present work, we introduce an adaptation algorithm for HTTP-based live streaming called LOLYPOP (short for low-latency prediction-based adaptation), which is designed to operate with a transport latency of a few seconds. To reach this goal, LOLYPOP leverages Transmission Control Protocol throughput predictions on multiple time scales, from 1 to 10 seconds, along with estimations of the relative prediction error distributions. In addition to satisfying the latency constraint, the algorithm heuristically maximizes the QoE by maximizing the average video quality as a function of the number of skipped segments and quality transitions. To select an efficient prediction method, we studied the performance of several time series prediction methods in IEEE 802.11 wireless access networks. We evaluated LOLYPOP under a large set of experimental conditions, limiting the transport latency to 3 seconds, against a state-of-the-art adaptation algorithm called FESTIVE. We observed that the average selected video representation index is by up to a factor of 3 higher than with the baseline approach. We also observed that LOLYPOP is able to reach points from a broader region in the QoE space, and thus it is better adjustable to the user profile or service provider requirements.

Journal ArticleDOI
TL;DR: Factors that could explain the difference in latency between initiating and adjusting a movement in response to target displacements are discussed.
Abstract: We can adjust an on-going movement to a change in the target’s position with a latency of about 100 ms, about half of the time that is needed to start a new movement in response to the same change in target position (reaction time). In this opinion paper, we discuss factors that could explain the difference in latency between initiating and adjusting a movement in response to target displacements. We consider the latency to be the sum of the durations of various stages in information processing. Many of these stages are identical for adjusting and initiating a movement, but for movement initiation it is essential to detect that something has changed in order to respond, whereas adjustments to movements can be based on updated position information without detecting that the position has changed. This explanation for the shorter latency for movement adjustments also explains why we can respond to changes that we do not detect.

Journal ArticleDOI
TL;DR: It is proved that Clu-DDAS has a latency bound of $$4R'+2 \varDelta -2$$4R′+2Δ-2, where $$Δ is the maximum degree and $$R'$$R′ is the inferior network radius which is smaller than the network radius $$R$$R.
Abstract: Data aggregation is an essential yet time-consuming task in wireless sensor networks. This paper studies the well-known minimum-latency aggregation schedule problem and proposes an energy-efficient distributed scheduling algorithm named Clu-DDAS based on a novel cluster-based aggregation tree. Our approach differs from all the previous schemes where connected dominating sets or maximal independent sets are employed. This paper proves that Clu-DDAS has a latency bound of $$4R'+2 \varDelta -2$$4R?+2Δ-2, where $$\varDelta $$Δ is the maximum degree and $$R'$$R? is the inferior network radius which is smaller than the network radius $$R$$R. Our experiments show that Clu-DDAS has an approximate latency upper bound of $$4R'+1.085 \varDelta -2$$4R?+1.085Δ-2 with increased $$\varDelta $$Δ. Clu-DDAS has comparable latency as the previously best centralized algorithm, E-PAS, but consumes 78 % less energy as shown by the simulation results. Clu-DDAS outperforms the previously best distributed algorithm, DAS, whose latency bound is $$16R'+\varDelta -14$$16R?+Δ-14 on both latency and energy consumption. On average, Clu-DDAS transmits 67 % fewer total messages than DAS. The paper also proposes an adaptive strategy for updating the schedule to accommodate dynamic network topology.

Journal ArticleDOI
TL;DR: This work describes a model that utilizes cultured central memory CD4+ T cells and replication-competent HIV-1 that generates latently infected cells that can be reactivated using latency reversing agents in vivo.
Abstract: HIV-1 latently infected cells in vivo can be found in extremely low frequencies. Therefore, in vitro cell culture models have been used extensively for the study of HIV-1 latency. Often, these in vitro systems utilize defective viruses. Defective viruses allow for synchronized infections and circumvent the use of antiretrovirals. In addition, replication-defective viruses cause minimal cytopathicity because they fail to spread and usually do not encode env or accessory genes. On the other hand, replication-competent viruses encode all or most viral genes and better recapitulate the nuances of the viral replication cycle. The study of latency with replication-competent viruses requires the use of antiretroviral drugs in culture, and this mirrors the use of antiretroviral treatment (ART) in vivo. We describe a model that utilizes cultured central memory CD4+ T cells and replication-competent HIV-1. This method generates latently infected cells that can be reactivated using latency reversing agents ...

01 Jan 2016
TL;DR: The bone tumors general aspects and data on 8542 cases is universally compatible with any devices to read and online access to it is set as public so you can get it instantly.
Abstract: bone tumors general aspects and data on 8542 cases is available in our digital library an online access to it is set as public so you can get it instantly. Our book servers hosts in multiple countries, allowing you to get the most less latency time to download any of our books like this one. Merely said, the bone tumors general aspects and data on 8542 cases is universally compatible with any devices to read.

Proceedings ArticleDOI
01 Sep 2016
TL;DR: The presented results allow to conclude that support for scheduling with different TTI sizes is important for LLC and should be included in the future 5G.
Abstract: In this paper we study the downlink latency performance in a multi-user cellular network. We use a flexible 5G radio frame structure, where the TTI size is configurable on a peruser basis according to their specific service requirements. Results show that at low system loads using a short TTI (e.g. 0.25 ms) is an attractive solution to achieve low latency communications (LLC). The main benefits come from the low transmission delay required to transmit the payloads. However, as the load increases, longer TTI configurations with lower relative control overhead (and therefore higher spectral efficiency) provide better performance as these better cope with the non-negligible queuing delay. The presented results allow to conclude that support for scheduling with different TTI sizes is important for LLC and should be included in the future 5G.

Journal ArticleDOI
TL;DR: This dissertation provides a detailed analysis of DRAM latency by using both circuit-levelsimulation with a detailed DRAM model and FPGA-based pro?ling of real DRAM modules, and proposes anew technique, Architectural-Variation-Aware DRAM (AVA-DRAM), which reduces DRAMlatency at low cost.
Abstract: In modern systems, DRAM-based main memory is signi?cantly slower than the processor.Consequently, processors spend a long time waiting to access data from main memory, makingthe long main memory access latency one of the most critical bottlenecks to achieving highsystem performance. Unfortunately, the latency of DRAM has remained almost constant inthe past decade. This is mainly because DRAM has been optimized for cost-per-bit, ratherthan access latency. As a result, DRAM latency is not reducing with technology scaling, andcontinues to be an important performance bottleneck in modern and future systems.This dissertation seeks to achieve low latency DRAM-based memory systems at low costin three major directions. The key idea of these three major directions is to enable and ex-ploit latency heterogeneity in DRAM architecture. First, based on the observation that longbitlines in DRAM are one of the dominant sources of DRAM latency, we propose a newDRAM architecture, Tiered-Latency DRAM (TL-DRAM), which divides the long bitline intotwo shorter segments using an isolation transistor, allowing one segment to be accessed withreduced latency. Second, we propose a ?ne-grained DRAM latency reduction mechanism,Adaptive-Latency DRAM, which optimizes DRAM latency for the common operating conditions for individual DRAM module. We observe that DRAM manufacturers incorporate a very large timing margin as a provision against the worst-case operating conditions, whichis accessing the slowest cell across all DRAM products with the worst latency at the highesttemperature, even though such a slowest cell and such an operating condition are rare. Ourmechanism dynamically optimizes DRAM latency to the current operating condition of theaccessed DRAM module, thereby reliably improving system performance. Third, we observethat cells closer to the peripheral logic can be much faster than cells farther from the peripherallogic (a phenomenon we call architectural variation). Based on this observation, we propose anew technique, Architectural-Variation-Aware DRAM (AVA-DRAM), which reduces DRAMlatency at low cost, by pro?ling and identifying only the inherently slower regions in DRAMto dynamically determine the lowest latency DRAM can operate at without causing failures.This dissertation provides a detailed analysis of DRAM latency by using both circuit-levelsimulation with a detailed DRAM model and FPGA-based pro?ling of real DRAM modules.Our latency analysis shows that our low latency DRAM mechanisms enable significant latencyreductions, leading to large improvement in both system performance and energy e?fficiencyacross a variety of workloads in our evaluated systems, while ensuring reliable DRAM operation.

Proceedings ArticleDOI
27 Feb 2016
TL;DR: This paper designs a new adaptive work stealing policy, called tail-control, that reduces the number of requests that miss a target latency, and implements this approach in the Intel Thread Building Block (TBB) library and evaluates it on real-world workloads and synthetic workloads.
Abstract: Interactive web services increasingly drive critical business workloads such as search, advertising, games, shopping, and finance. Whereas optimizing parallel programs and distributed server systems have historically focused on average latency and throughput, the primary metric for interactive applications is instead consistent responsiveness, i.e., minimizing the number of requests that miss a target latency. This paper is the first to show how to generalize work-stealing, which is traditionally used to minimize the makespan of a single parallel job, to optimize for a target latency in interactive services with multiple parallel requests.We design a new adaptive work stealing policy, called tail-control, that reduces the number of requests that miss a target latency. It uses instantaneous request progress, system load, and a target latency to choose when to parallelize requests with stealing, when to admit new requests, and when to limit parallelism of large requests. We implement this approach in the Intel Thread Building Block (TBB) library and evaluate it on real-world workloads and synthetic workloads. The tail-control policy substantially reduces the number of requests exceeding the desired target latency and delivers up to 58% relative improvement over various baseline policies. This generalization of work stealing for multiple requests effectively optimizes the number of requests that complete within a target latency, a key metric for interactive services.

Journal ArticleDOI
TL;DR: The problem is formulated mathematically and two heuristics, the Memetic Algorithm and the Recursive Granular Algorithm are proposed and an extensive experimental study shows that both algorithms are able to produce promising results in reasonable time.

Patent
06 Oct 2016
TL;DR: In this article, a request for agent computation of sensor data is made, and agents to query are determined based on the required confidence and required latency for completion of the agent computation.
Abstract: Data is received characterizing a request for agent computation of sensor data. The request includes a required confidence and required latency for completion of the agent computation. Agents to query are determined based on the required confidence. Data is transmitted to query the determined agents to provide analysis of the sensor data. Related apparatus, systems, techniques, and articles are also described.

Proceedings ArticleDOI
18 Apr 2016
TL;DR: This paper proposes PSLO, a framework supporting the Xth percentile latency and throughput SLOs under consolidated VM environment by precisely coordinating the level of IO concurrency and arrival rate for each VM issue queue, and designs and implements a PSLO prototype in the real VM consolidation environment created by Xen.
Abstract: It is desirable but challenging to simultaneously support latency SLO at a pre-defined percentile, i.e., the Xth percentile latency SLO, and throughput SLO for consolidated VM storage. Ensuring the Xth percentile latency contributes to accurately differentiating service levels in the metric of the application-level latency SLO compliance, especially for the application built on multiple VMs. However, the Xth percentile latency SLO and throughput SLO enforcement are the opposite sides of the same coin due to the conflicting requirements for the level of IO concurrency. To address this challenge, this paper proposes PSLO, a framework supporting the Xth percentile latency and throughput SLOs under consolidated VM environment by precisely coordinating the level of IO concurrency and arrival rate for each VM issue queue. It is noted that PSLO can take full advantage of the available IO capacity allowed by SLO constraints to improve throughput or reduce latency with the best effort. We design and implement a PSLO prototype in the real VM consolidation environment created by Xen. Our extensive trace-driven prototype evaluation shows that our system is able to optimize the Xth percentile latency and throughput for consolidated VMs under SLO constraints.

Journal ArticleDOI
TL;DR: This paper proposes SVR-NoC, a network-onchip (NoC) latency model using support vector regression (SVR), and proposes a learning framework that relies on SVR to collect training data and predict the traffic flow latency.
Abstract: In this paper, we propose SVR-NoC, a network-on-chip (NoC) latency model using support vector regression (SVR). More specifically, based on the application communication information and the NoC routing algorithm, the channel and source queue waiting times are first estimated using an analytical queuing model with two equivalent queues. To improve the prediction accuracy, the queuing theory-based delay estimations are included as features in the learning process. We then propose a learning framework that relies on SVR to collect training data and predict the traffic flow latency. The proposed learning methods can be used to analyze various traffic scenarios for the target NoC platform. Experimental results on both synthetic and real-application traffic demonstrate on average less than 12% prediction error in network saturation load, as well as more than $100\times $ speedup compared to cycle-accurate simulations can be achieved.

Journal ArticleDOI
01 Jan 2016-Gene
TL;DR: This study demonstrates that HCMV expresses during the latency phase, both in vivo and in vitro, only a subset of its microRNAs, which may indicate that they play an important role in maintenance and reactivation of latency.

Journal ArticleDOI
TL;DR: BAFi's constitute a promising family of molecules for inclusion in therapeutic combinatorial HIV-1 latency reversal, and were strongly induced by BAFi's Caffeic Acid Phenethyl Ester and Pyrimethamine, two molecules previously characterized for clinical application.