scispace - formally typeset
Search or ask a question

Showing papers on "Latency (engineering) published in 2017"


Journal ArticleDOI
TL;DR: An online algorithm to learn the unknown dynamic environment and guarantee that the performance gap compared to the optimal strategy is bounded by a logarithmic function with time is proposed.
Abstract: With mobile devices increasingly able to connect to cloud servers from anywhere, resource-constrained devices can potentially perform offloading of computational tasks to either save local resource usage or improve performance. It is of interest to find optimal assignments of tasks to local and remote devices that can take into account the application-specific profile, availability of computational resources, and link connectivity, and find a balance between energy consumption costs of mobile devices and latency for delay-sensitive applications. We formulate an NP-hard problem to minimize the application latency while meeting prescribed resource utilization constraints. Different from most of existing works that either rely on the integer programming solver, or on heuristics that offer no theoretical performance guarantees, we propose Hermes, a novel fully polynomial time approximation scheme (FPTAS). We identify for a subset of problem instances, where the application task graphs can be described as serial trees, Hermes provides a solution with latency no more than $(1+\epsilon)$ times of the minimum while incurring complexity that is polynomial in problem size and $\frac{1}{\epsilon}$ . We further propose an online algorithm to learn the unknown dynamic environment and guarantee that the performance gap compared to the optimal strategy is bounded by a logarithmic function with time. Evaluation is done by using real data set collected from several benchmarks, and is shown that Hermes improves the latency by $16$ percent compared to a previously published heuristic and increases CPU computing time by only $0.4$ percent of overall latency.

233 citations


Proceedings ArticleDOI
02 Oct 2017
TL;DR: In this paper, the authors studied the power-delay tradeoff in the context of task offloading in a multi-user MEC scenario, and formulated the problem as a computation and transmit power minimization subject to latency and reliability constraints.
Abstract: While mobile edge computing (MEC) alleviates the computation and power limitations of mobile devices, additional latency is incurred when offloading tasks to remote MEC servers. In this work, the power-delay tradeoff in the context of task offloading is studied in a multi-user MEC scenario. In contrast with current system designs relying on average metrics (e.g., the average queue length and average latency), a novel network design is proposed in which latency and reliability constraints are taken into account. This is done by imposing a probabilistic constraint on users' task queue lengths and invoking results from extreme value theory to characterize the occurrence of low- probability events in terms of queue length (or queuing delay) violation. The problem is formulated as a computation and transmit power minimization subject to latency and reliability constraints, and solved using tools from Lyapunov stochastic optimization. Simulation results demonstrate the effectiveness of the proposed approach, while examining the power-delay tradeoff and required computational resources for various computation intensities.

160 citations


Journal ArticleDOI
TL;DR: It is revealed that with caching at both transmitter and receiver sides, the network can benefit simultaneously from traffic load reduction and transmission rate enhancement, thereby effectively reducing the content delivery latency.
Abstract: This paper studies the fundamental tradeoff between storage and latency in a general wireless interference network with caches equipped at all transmitters and receivers. The tradeoff is characterized by an information-theoretic metric, normalized delivery time (NDT), which is the worst case delivery time of the actual traffic load at a transmission rate specified by degrees of freedom of a given channel. We obtain both an achievable upper bound and a theoretical lower bound of the minimum NDT for any number of transmitters, any number of receivers, and any feasible cache size tuple. We show that the achievable NDT is exactly optimal in certain cache size regions, and is within a bounded multiplicative gap to the theoretical lower bound in other regions. In the achievability analysis, we first propose a novel cooperative transmitter/receiver coded caching strategy. It offers the freedom to adjust file splitting ratios for NDT minimization. We then propose a delivery strategy that transforms the considered interference network into a new class of cooperative X-multicast channels. It leverages local caching gain, coded multicasting gain, and transmitter cooperation gain (via interference alignment and interference neutralization) opportunistically. Finally, the achievable NDT is obtained by solving a linear programming problem. This paper reveals that with caching at both transmitter and receiver sides, the network can benefit simultaneously from traffic load reduction and transmission rate enhancement, thereby effectively reducing the content delivery latency.

153 citations


Proceedings ArticleDOI
14 Oct 2017
TL;DR: Drizzle is a system that decouples the processing interval from the coordination interval used for fault tolerance and adaptability and exhibits better adaptability, and can recover from failures 4x faster than Flink while having up to 13x lower latency during recovery.
Abstract: Large scale streaming systems aim to provide high throughput and low latency. They are often used to run mission-critical applications, and must be available 24x7. Thus such systems need to adapt to failures and inherent changes in workloads, with minimal impact on latency and throughput. Unfortunately, existing solutions require operators to choose between achieving low latency during normal operation and incurring minimal impact during adaptation. Continuous operator streaming systems, such as Naiad and Flink, provide low latency during normal execution but incur high overheads during adaptation (e.g., recovery), while micro-batch systems, such as Spark Streaming and FlumeJava, adapt rapidly at the cost of high latency during normal operations. Our key observation is that while streaming workloads require millisecond-level processing, workload and cluster properties change less frequently. Based on this, we develop Drizzle, a system that decouples the processing interval from the coordination interval used for fault tolerance and adaptability. Our experiments on a 128 node EC2 cluster show that on the Yahoo Streaming Benchmark, Drizzle can achieve end-to-end record processing latencies of less than 100ms and can get 2-3x lower latency than Spark. Drizzle also exhibits better adaptability, and can recover from failures 4x faster than Flink while having up to 13x lower latency during recovery.

147 citations


Proceedings ArticleDOI
14 Oct 2017
TL;DR: ZYGOS is presented, a system optimized for μs-scale, in-memory computing on multicore servers that implements a work-conserving scheduler within a specialized operating system designed for high request rates and a large number of network connections.
Abstract: This paper focuses on the efficient scheduling on multicore systems of very fine-grain networked tasks, which are the typical building block of online data-intensive applications. The explicit goal is to deliver high throughput (millions of remote procedure calls per second) for tail latency service-level objectives that are a small multiple of the task size. We present ZYGOS, a system optimized for μs-scale, in-memory computing on multicore servers. It implements a work-conserving scheduler within a specialized operating system designed for high request rates and a large number of network connections. ZYGOS uses a combination of shared-memory data structures, multi-queue NICs, and inter-processor interrupts to rebalance work across cores. For an aggressive service-level objective expressed at the 99th percentile, ZYGOS achieves 75% of the maximum possible load determined by a theoretical, zero-overhead model (centralized queueing with FCFS) for 10μs tasks, and 88% for 25μs tasks. We evaluate ZYGOS with a networked version of Silo, a state-of-the-art in-memory transactional database, running TPC-C. For a service-level objective of 1000μs latency at the 99th percentile, ZYGOS can deliver a 1.63x speedup over Linux (because of its dataplane architecture) and a 1.26x speedup over IX, a state-of-the-art dataplane (because of its work-conserving scheduler).

144 citations


Proceedings ArticleDOI
01 Jun 2017
TL;DR: This work proposes Cachier, a system that uses the caching model along with novel optimizations to minimize latency by adaptively balancing load between the edge and the cloud, by leveraging spatiotemporal locality of requests, using offline analysis of applications, and online estimates of network conditions.
Abstract: Recognition and perception based mobile applications, such as image recognition, are on the rise. These applications recognize the user's surroundings and augment it with information and/or media. These applications are latency-sensitive. They have a soft-realtime nature - late results are potentially meaningless. On the one hand, given the compute-intensive nature of the tasks performed by such applications, execution is typically offloaded to the cloud. On the other hand, offloading such applications to the cloud incurs network latency, which can increase the user-perceived latency. Consequently, edge computing has been proposed to let devices offload intensive tasks to edge servers instead of the cloud, to reduce latency. In this paper, we propose a different model for using edge servers. We propose to use the edge as a specialized cache for recognition applications and formulate the expected latency for such a cache. We show that using an edge server like a typical web cache, for recognition applications, can lead to higher latencies. We propose Cachier, a system that uses the caching model along with novel optimizations to minimize latency by adaptively balancing load between the edge and the cloud, by leveraging spatiotemporal locality of requests, using offline analysis of applications, and online estimates of network conditions. We evaluate Cachier for image-recognition applications and show that our techniques yield 3x speedup in responsiveness, and perform accurately over a range of operating conditions. To the best of our knowledge, this is the first work that models edge servers as caches for compute-intensive recognition applications, and Cachier is the first system that uses this model to minimize latency for these applications.

135 citations


Proceedings ArticleDOI
05 Jun 2017
TL;DR: This work empirically demonstrate a new form of variation that exists within a real DRAM chip, induced by the design and placement of different components in theDRAM chip: different regions in DRAM, based on their relative distances from the peripheral structures, require different minimum access latencies for reliable operation.
Abstract: Variation has been shown to exist across the cells within a modern DRAM chip. Prior work has studied and exploited several forms of variation, such as manufacturing-process- or temperature-induced variation. We empirically demonstrate a new form of variation that exists within a real DRAM chip, induced by the design and placement of different components in the DRAM chip: different regions in DRAM, based on their relative distances from the peripheral structures, require different minimum access latencies for reliable operation. In particular, we show that in most real DRAM chips, cells closer to the peripheral structures can be accessed much faster than cells that are farther. We call this phenomenon design-induced variation in DRAM. Our goals are to i) understand design-induced variation that exists in real, state-of-the-art DRAM chips, ii) exploit it to develop low-cost mechanisms that can dynamically find and use the lowest latency at which to operate a DRAM chip reliably, and, thus, iii) improve overall system performance while ensuring reliable system operation. To this end, we first experimentally demonstrate and analyze designed-induced variation in modern DRAM devices by testing and characterizing 96 DIMMs (768 DRAM chips). Our experimental study shows that i) modern DRAM chips exhibit design-induced latency variation in both row and column directions, ii) access latency gradually increases in the row direction within a DRAM cell array (mat) and this pattern repeats in every mat, and iii) some columns require higher latency than others due to the internal hierarchical organization of the DRAM chip. Our characterization identifies DRAM regions that are vulnerable to errors, if operated at lower latency, and finds consistency in their locations across a given DRAM chip generation, due to design-induced variation. Variations in the vertical and horizontal dimensions, together, divide the cell array into heterogeneous-latency regions, where cells in some regions require longer access latencies for reliable operation. Reducing the latency uniformly across all regions in DRAM would improve performance, but can introduce failures in the inherently slower regions that require longer access latencies for correct operation. We refer to these inherently slower regions of DRAM as design-induced vulnerable regions. Based on our extensive experimental analysis, we develop two mechanisms that reliably reduce DRAM latency. First, DIVI Profiling uses runtime profiling to dynamically identify the lowest DRAM latency that does not introduce failures. DIVA Profiling exploits design-induced variation and periodically profiles only the vulnerable regions to determine the lowest DRAM latency at low cost. It is the first mechanism to dynamically determine the lowest latency that can be used to operate DRAM reliably. DIVA Profiling reduces the latency of read/write requests by 35.1%/57.8%, respectively, at 55C. Our second mechanism, DIVA Shuffling, shuffles data such that values stored in vulnerable regions are mapped to multiple error-correcting code (ECC) codewords. As a result, DIVA Shuffling can correct 26% more multi-bit errors than conventional ECC. Combined together, our two mechanisms reduce read/write latency by 40.0%/60.5%, which translates to an overall system performance improvement of 14.7%/13.7%/13.8% (in 2-/4-/8-core systems) over a variety of workloads, while ensuring reliable operation.

122 citations


Journal ArticleDOI
26 Apr 2017
TL;DR: In this article, the authors compared different redundancy strategies in terms of the number of redundant tasks and the time when they are issued and canceled, and designed a general redundancy strategy that achieves a good latency-cost trade-off for an arbitrary service time distribution.
Abstract: In cloud computing systems, assigning a task to multiple servers and waiting for the earliest copy to finish is an effective method to combat the variability in response time of individual servers and reduce latency. But adding redundancy may result in higher cost of computing resources, as well as an increase in queueing delay due to higher traffic load. This work helps in understanding when and how redundancy gives a cost-efficient reduction in latency. For a general task service time distribution, we compare different redundancy strategies in terms of the number of redundant tasks and the time when they are issued and canceled. We get the insight that the log-concavity of the task service time creates a dichotomy of when adding redundancy helps. If the service time distribution is log-convex (i.e., log of the tail probability is convex), then adding maximum redundancy reduces both latency and cost. And if it is log-concave (i.e., log of the tail probability is concave), then less redundancy, and early cancellation of redundant tasks is more effective. Using these insights, we design a general redundancy strategy that achieves a good latency-cost trade-off for an arbitrary service time distribution. This work also generalizes and extends some results in the analysis of fork-join queues.

121 citations


Journal ArticleDOI
TL;DR: This letter investigates the problem of ultra-reliable and low-latency communication in millimeter wave-enabled massive multiple-input multiple-output networks using the Lyapunov technique, whereby a utility-delay control approach is proposed, which adapts to channel variations and queue dynamics.
Abstract: Ultra-reliability and low latency are two key components in 5G networks. In this letter, we investigate the problem of ultra-reliable and low-latency communication in millimeter wave-enabled massive multiple-input multiple-output networks. The problem is cast as a network utility maximization subject to probabilistic latency and reliability constraints. To solve this problem, we resort to the Lyapunov technique, whereby a utility-delay control approach is proposed, which adapts to channel variations and queue dynamics. Numerical results demonstrate that our proposed approach ensures reliable communication with a guaranteed probability of 99.99%, and reduces latency by 28.41% and 77.11% as compared to baselines with and without probabilistic latency constraints, respectively.

107 citations


Proceedings ArticleDOI
12 Jun 2017
TL;DR: In this article, a clustering method to group spatially proximate user devices with mutual task popularity interests and their serving cloudlets is proposed, and cloudlets can proactively cache the popular tasks' computations of their cluster members to minimize computing latency.
Abstract: In this paper, the fundamental problem of distribution and proactive caching of computing tasks in fog networks is studied under latency and reliability constraints. In the proposed scenario, computing can be executed either locally at the user device or offloaded to an edge cloudlet. Moreover, cloudlets exploit both their computing and storage capabilities by proactively caching popular task computation results to minimize computing latency. To this end, a clustering method to group spatially proximate user devices with mutual task popularity interests and their serving cloudlets is proposed. Then, cloudlets can proactively cache the popular tasks' computations of their cluster members to minimize computing latency. Additionally, the problem of distributing tasks to cloudlets is formulated as a matching game in which a cost function of computing delay is minimized under latency and reliability constraints. Simulation results show that the proposed scheme guarantees reliable computations with bounded latency and achieves up to 91% decrease in computing latency as compared to baseline schemes.

104 citations


Posted Content
24 May 2017
TL;DR: A dense vehicular communication network where each vehicle broadcasts its safety information to its neighborhood in each transmission period is considered, and a novel rotation matching algorithm is developed, which converges to an $L$ -rotation stable matching after a limited number of iterations.
Abstract: In this paper, we consider a dense vehicular communication network where each vehicle broadcasts its safety information to its neighborhood in each transmission period. Such applications require low latency and high reliability, and thus, we exploit non-orthogonal multiple access to reduce the access latency and to improve the packet reception probability. In the proposed two-fold scheme, the BS performs semi-persistent scheduling and allocates time-frequency resources in a non-orthogonal manner while the vehicles autonomously perform distributed power control with iterative signaling control. We formulate the centralized scheduling and resource allocation problem as equivalent to a multi-dimensional stable roommate matching problem, in which the users and time/frequency resources are considered as disjoint sets of objects to be matched with each other. We then develop a novel rotation matching algorithm, which converges to an $L$ -rotation stable matching after a limited number of iterations. Simulation results show that the proposed scheme outperforms the traditional orthogonal multiple access scheme in terms of the access latency and reliability.

01 Feb 2017
TL;DR: This document presents a lightweight active queue management design called PIE (Proportional Integral controller Enhanced) that can effectively control the average queuing latency to a target value and is simple enough to implement in both hardware and software.
Abstract: Bufferbloat is a phenomenon in which excess buffers in the network cause high latency and latency variation. As more and more interactive applications (e.g., voice over IP, real-time video streaming, and financial transactions) run in the Internet, high latency and latency variation degrade application performance. There is a pressing need to design intelligent queue management schemes that can control latency and latency variation, and hence provide desirable quality of service to users. This document presents a lightweight active queue management design called "PIE" (Proportional Integral controller Enhanced) that can effectively control the average queuing latency to a target value. Simulation results, theoretical analysis, and Linux testbed results have shown that PIE can ensure low latency and achieve high link utilization under various congestion situations. The design does not require per-packet timestamps, so it incurs very little overhead and is simple enough to implement in both hardware and software.

Proceedings ArticleDOI
03 Apr 2017
TL;DR: ParaBox is proposed, a novel hybrid packet processing architecture that, when possible, dynamically distributes packets to VNFs in parallel and merges their outputs intelligently to ensure the preservation of correct sequential processing semantics.
Abstract: Service Function Chains (SFCs) comprise a sequence of Network Functions (NFs) that are typically traversed in-order by data flows. Consequently, SFC delay grows linearly with the length of the SFC. Yet, for highly latency sensitive applications, this delay may be unacceptable---particularly when the constituent NFs are virtualized, running on commodity servers. In this paper, we investigate how SFC latency may be reduced by exploiting opportunities for parallel packet processing across NFs. We propose ParaBox, a novel hybrid packet processing architecture that, when possible, dynamically distributes packets to VNFs in parallel and merges their outputs intelligently to ensure the preservation of correct sequential processing semantics. To demonstrate the feasibility of our approach, we implement a ParaBox prototype on top of the DPDK-enabled Berkeley Extensible Software Switch. Our preliminary experiment results show that ParaBox can not only significantly reduce the service chaining latency, but also improve throughput.

Proceedings ArticleDOI
05 Jun 2017
TL;DR: A latency-driven cooperative task computing algorithm with one-for-all concept for simultaneous selection of the F-RAN nodes to serve with proper heterogeneous resource allocation for multi-user services is proposed.
Abstract: Fog computing is emerging as one promising solution to meet the increasing demand for ultra-low latency services in wireless networks. Taking a forward-looking perspective, we propose a Fog-Radio Access Network (F-RAN) model, which utilizes the existing infrastructure, e.g., small cells and macro base stations, to achieve the ultra-low latency by joint computing across multiple F-RAN nodes and near-range communications at the edge. We treat the low latency design as an optimization problem, which characterizes the tradeoff between communication and computing across multiple F-RAN nodes. Since this problem is NP-hard, we propose a latency-driven cooperative task computing algorithm with one-for-all concept for simultaneous selection of the F-RAN nodes to serve with proper heterogeneous resource allocation for multi-user services. Considering the limited heterogeneous resources shared among all users, we advocate the one-for-all strategy for every user taking other's situation into consideration and seek for a "win-win" solution. The numerical results show that the low latency services can be achieved by F-RAN via latency-driven cooperative task computing.

Journal ArticleDOI
TL;DR: The key conclusion of this dissertation is that augmenting DRAM architecture with simple and low-cost features, and developing a better understanding of manufactured DRAM chips together lead to significant memory latency reduction as well as energy efficiency improvement.
Abstract: Over the past two decades, the storage capacity and access bandwidth of main memory have improved tremendously, by 128x and 20x, respectively. These improvements are mainly due to the continuous technology scaling of DRAM (dynamic random-access memory), which has been used as the physical substrate for main memory. In stark contrast with capacity and bandwidth, DRAM latency has remained almost constant, reducing by only 1.3x in the same time frame. Therefore, long DRAM latency continues to be a critical performance bottleneck in modern systems. Increasing core counts, and the emergence of increasingly more data-intensive and latency-critical applications further stress the importance of providing low-latency memory accesses. In this dissertation, we identify three main problems that contribute significantly to long latency of DRAM accesses. To address these problems, we present a series of new techniques. Our new techniques significantly improve both system performance and energy efficiency. We also examine the critical relationship between supply voltage and latency in modern DRAM chips and develop new mechanisms that exploit this voltage-latency trade-o to improve energy efficiency. First, while bulk data movement is a key operation in many applications and operating systems, contemporary systems perform this movement inefficiently, by transferring data from DRAM to the processor, and then back to DRAM, across a narrow o -chip channel. The use of this narrow channel for bulk data movement results in high latency and high energy consumption. This dissertation introduces a new DRAM design, Low-cost Inter-linked SubArrays (LISA), which provides fast and energy-efficient bulk data movement across sub- arrays in a DRAM chip. We show that the LISA substrate is very powerful and versatile by demonstrating that it efficiently enables several new architectural mechanisms, including low-latency data copying, reduced DRAM access latency for frequently-accessed data, and reduced preparation latency for subsequent accesses to a DRAM bank. Second, DRAM needs to be periodically refreshed to prevent data loss due to leakage. Unfortunately, while DRAM is being refreshed, a part of it becomes unavailable to serve memory requests, which degrades system performance. To address this refresh interference problem, we propose two access-refresh parallelization techniques that enable more overlap- ping of accesses with refreshes inside DRAM, at the cost of very modest changes to the memory controllers and DRAM chips. These two techniques together achieve performance close to an idealized system that does not require refresh. Third, we find, for the first time, that there is significant latency variation in accessing different cells of a single DRAM chip due to the irregularity in the DRAM manufacturing process. As a result, some DRAM cells are inherently faster to access, while others are inherently slower. Unfortunately, existing systems do not exploit this variation and use a fixed latency value based on the slowest cell across all DRAM chips. To exploit latency variation within the DRAM chip, we experimentally characterize and understand the behavior of the variation that exists in real commodity DRAM chips. Based on our characterization, we propose Flexible-LatencY DRAM (FLY-DRAM), a mechanism to reduce DRAM latency by categorizing the DRAM cells into fast and slow regions, and accessing the fast regions with a reduced latency, thereby improving system performance significantly. Our extensive experimental characterization and analysis of latency variation in DRAM chips can also enable development of other new techniques to improve performance or reliability. Fourth, this dissertation, for the first time, develops an understanding of the latency behavior due to another important factor { supply voltage, which significantly impacts DRAM performance, energy consumption, and reliability. We take an experimental approach to understanding and exploiting the behavior of modern DRAM chips under different supply voltage values. Our detailed characterization of real commodity DRAM chips demonstrates that memory access latency reduces with increasing supply voltage. Based on our characterization, we propose Voltron, a new mechanism that improves system energy efficiency by dynamically adjusting the DRAM supply voltage based on a performance model. Our extensive experimental data on the relationship between DRAM supply voltage, latency, and reliability can further enable developments of other new mechanisms that improve latency, energy efficiency, or reliability. The key conclusion of this dissertation is that augmenting DRAM architecture with simple and low-cost features, and developing a better understanding of manufactured DRAM chips together leads to significant memory latency reduction as well as energy efficiency improvement. We hope and believe that the proposed architectural techniques and detailed experimental data on real commodity DRAM chips presented in this dissertation will enable developments of other new mechanisms to improve the performance, energy efficiency, or reliability of future memory systems.

Journal ArticleDOI
TL;DR: This paper verifies and analyzes the latency of cellular-based V2X with shortened TTI, which is one of the most efficient latency reduction schemes, and proposes cellular- based V2x system design principles in terms of shortened T TI with only one OFDM symbol and while sustaining radio resource control connection.
Abstract: Vehicle-to-everything (V2X) is a form of wireless communication that is extremely sensitive to latency, because the latency is directly related to driving safety. The V2X systems developed so far have been based on the LTE system. However, the conventional LTE system is not able to support the latency requirements of latency-aware V2X. Fortunately, the state-of-the-art cellular technology standard includes the development of latency reduction schemes, such as shortened transmission time intervals (TTI) and self-contained subframes. This paper verifies and analyzes the latency of cellular-based V2X with shortened TTI, which is one of the most efficient latency reduction schemes. To verify the feasibility of V2X service, we divide the V2X latency into two types of latency, TTI-independent latency and TTI-proportional latency. Moreover, using system-level simulations considering additional overhead from shortened TTI, we evaluate the latency of cellular-based V2X systems. Based on this feasibility verification, we then propose cellular-based V2X system design principles in terms of shortened TTI with only one OFDM symbol and while sustaining radio resource control connection.

Proceedings ArticleDOI
12 Jun 2017
TL;DR: A novel proximity and quality-of-service-aware resource allocation for V2V communication is proposed, which exploits the spatial-temporal aspects of vehicles in terms of their physical proximity and traffic demands, to minimize the total transmission power while considering queuing latency and reliability.
Abstract: Recently vehicle-to-vehicle (V2V) communication emerged as a key enabling technology to ensure traffic safety and other mission-critical applications. In this paper, a novel proximity and quality-of-service (QoS)-aware resource allocation for V2V communication is proposed. The proposed approach exploits the spatial-temporal aspects of vehicles in terms of their physical proximity and traffic demands, to minimize the total transmission power while considering queuing latency and reliability. Due to the overhead caused by frequent information exchange between vehicles and the roadside unit (RSU), the centralized problem is decoupled into two interrelated subproblems. First, a novel RSU-assisted virtual clustering mechanism is proposed to group vehicles in zones based on their physical proximity. Given the vehicles' traffic demands and their QoS requirements, resource blocks are assigned to each zone. Second, leveraging techniques from Lyapunov stochastic optimization, a power minimization solution is proposed for each V2V pair within each zone. Simulation results for a Manhattan model have shown that the proposed scheme outperforms the baseline in terms of average queuing latency reduction up to 97% and significant improvement in reliability.

Posted Content
TL;DR: In this paper, the authors investigate the various sources of end-to-end delay of current wireless networks by taking the 4G Long Term Evolution (LTE) as an example.
Abstract: The fifth-generation cellular mobile networks are expected to support mission critical ultra-reliable low latency communication (URLLC) services in addition to the enhanced mobile broadband applications. This article first introduces three emerging mission critical applications of URLLC and identifies their requirements on end-to-end latency and reliability. We then investigate the various sources of end-to-end delay of current wireless networks by taking the 4G Long Term Evolution (LTE) as an example. Subsequently, we propose and evaluate several techniques to reduce the end-to-end latency from the perspectives of error control coding, signal processing, and radio resource management. We also briefly discuss other network design approaches with the potential for further latency reduction.

Patent
Toufiqul Islam1, Jianglei Ma1, Kelvin Kar Kin Au1, Zhang Jiayin1, Mohamed Adel Salem1 
30 Aug 2017
TL;DR: In this paper, the presence of both low latency and latency tolerant communications in shared time-frequency resources to try to improve resource utilization is discussed. And in some embodiments, a latency tolerant transmission is postponed to free resources to send a low latency transmission.
Abstract: Some user equipments (UEs) served by a base station may need to receive data from the base station and/or transmit data to the base station with lower latency than other UEs. It is desired to accommodate the presence of both low latency and latency tolerant communications in shared time-frequency resources to try to improve resource utilization. Embodiments are disclosed in which low latency and latency tolerant communications coexist in the same time-frequency resources. In some embodiments, a latency tolerant transmission is postponed to free resources to send a low latency transmission.

Journal ArticleDOI
11 Oct 2017
TL;DR: In this paper, the authors present a cost (pain) vs. latency (gain) analysis of using simple replication or erasure coding for straggler mitigation in executing jobs with many tasks.
Abstract: Redundancy for straggler mitigation, originally in data download and more recently in distributed computing context, has been shown to be effective both in theory and practice. Analysis of systems with redundancy has drawn significant attention and numerous papers have studied pain and gain of redundancy under various service models and assumptions on the straggler characteristics. We here present a cost (pain) vs. latency (gain) analysis of using simple replication or erasure coding for straggler mitigation in executing jobs with many tasks. We quantify the effect of the tail of task execution times and discuss tail heaviness as a decisive parameter for the cost and latency of using redundancy. Specifically, we find that coded redundancy achieves better cost vs. latency and allows for greater achievable latency and cost tradeoff region compared to replication and can yield reduction in both cost and latency under less heavy tailed execution times. We show that delaying redundancy is not effective in reducing cost.

Journal ArticleDOI
TL;DR: ix is presented, a dataplane operating system that provides high I/O performance and high resource efficiency while maintaining the protection and isolation benefits of existing kernels.
Abstract: The conventional wisdom is that aggressive networking requirements, such as high packet rates for small messages and μs-scale tail latency, are best addressed outside the kernel, in a user-level networking stack. We present ix, a dataplane operating system that provides high I/O performance and high resource efficiency while maintaining the protection and isolation benefits of existing kernels. ix uses hardware virtualization to separate management and scheduling functions of the kernel (control plane) from network processing (dataplane). The dataplane architecture builds upon a native, zero-copy API and optimizes for both bandwidth and latency by dedicating hardware threads and networking queues to dataplane instances, processing bounded batches of packets to completion, and eliminating coherence traffic and multicore synchronization. The control plane dynamically adjusts core allocations and voltage/frequency settings to meet service-level objectives. We demonstrate that ix outperforms Linux and a user-space network stack significantly in both throughput and end-to-end latency. Moreover, ix improves the throughput of a widely deployed, key-value store by up to 6.4× and reduces tail latency by more than 2× . With three varying load patterns, the control plane saves 46%--54% of processor energy, and it allows background jobs to run at 35%--47% of their standalone throughput.

Journal ArticleDOI
Changhyun Lee, Chunjong Park1, Keon Jang2, Sue Moon1, Dongsu Han 
TL;DR: It is demonstrated that latency-based implicit feedback is accurate enough to signal a single packet’s queuing delay in 10 Gb/s networks, which enables a new congestion control algorithm, DX, that performs fine-grained control to adjust the congestion window just enough to achieve very low queuingdelay while attaining full utilization.
Abstract: Since the advent of datacenter networking, achieving low latency within the network has been a primary goal. Many congestion control schemes have been proposed in recent years to meet the datacenters’ unique performance requirement. The nature of congestion feedback largely governs the behavior of congestion control. In datacenter networks, where round trip times are in hundreds of microseconds, accurate feedback is crucial to achieve both high utilization and low queueing delay. Proposals for datacenter congestion control predominantly leverage explicit congestion notification (ECN) or even explicit in-network feedback to minimize the queuing delay. In this paper, we explore latency-based feedback as an alternative and show its advantages over ECN. Against the common belief that such implicit feedback is noisy and inaccurate, we demonstrate that latency-based implicit feedback is accurate enough to signal a single packet’s queuing delay in 10 Gb/s networks. Such high accuracy enables us to design a new congestion control algorithm, DX, that performs fine-grained control to adjust the congestion window just enough to achieve very low queuing delay while attaining full utilization. Our extensive evaluation shows that: 1) the latency measurement accurately reflects the one-way queuing delay in single packet level; 2) the latency feedback can be used to perform practical and fine-grained congestion control in high-speed datacenter networks; and 3) DX outperforms DCTCP with 5.33 times smaller median queueing delay at 1 Gb/s and 1.57 times at 10 Gb/s.

Proceedings ArticleDOI
04 Oct 2017
TL;DR: DPCM is designed, which reduces data access latency through parallel processing approaches and exploiting device-side state replica and is implemented and validated with extensive evaluations.
Abstract: Control-plane operations are indispensable to providing data access to mobile devices in the 4G LTE networks. They provision necessary control states at the device and network nodes to enable data access. However, the current design may suffer from long data access latency even under good radio conditions. The fundamental problem is that, data-plane packet delivery cannot start or resume until all control-plane procedures are completed, and these control procedures run sequentially by design. We show both are more than necessary under popular use cases. We design DPCM, which reduces data access latency through parallel processing approaches and exploiting device-side state replica. We implement DPCM and validate its effectiveness with extensive evaluations.

Proceedings ArticleDOI
14 Oct 2017
TL;DR: This work introduces the Adaptive Slow-to- Fast scheduling framework, which matches the heterogeneity of the workload–a mix of short and long requests–to theogeneity of the hardware– cores running at different speeds.
Abstract: Interactive service providers have strict requirements on high-percentile (tail) latency to meet user expectations. If providers meet tail latency targets with less energy, they increase profits, because energy is a significant operating expense. Unfortunately, optimizing tail latency and energy are typically conflicting goals. Our work resolves this conflict by exploiting servers with per-core Dynamic Voltage and Frequency Scaling (DVFS) and Asymmetric Multicore Processors (AMPs). We introduce the Adaptive Slow-to- Fast scheduling framework, which matches the heterogeneity of the workload–a mix of short and long requests–to the heterogeneity of the hardware– cores running at different speeds. The scheduler prioritizes long requests to faster cores by exploiting the insight that long requests reveal themselves. We use control theory to design threshold-based scheduling policies that use individual request progress, load, competition, and latency targets to optimize performance and energy. We configure our framework to optimize Energy Efficiency for a given Tail Latency (EETL) for both DVFS and AMP. In this framework, each request self-schedules, starting on a slow core and then migrating itself to faster cores. At high load, when a desired AMP core speed s is not available for a request but a faster core is, the longest request on an s core type migrates early to make room for the other request. Compared to per-core DVFS systems, EETL for AMPs delivers the same tail latency, reduces energy by 18% to 50%, and improves capacity (throughput) by 32% to 82%. We demonstrate that our framework effectively exploits dynamic DVFS and static AMP heterogeneity to reduce provisioing and operational costs for interactive services. CCS CONCEPTS • Computer systems organization $\rightarrow$ Heterogeneous (hybrid) systems; • Software and its engineering $\rightarrow$ Scheduling;

Book ChapterDOI
09 Jul 2017
TL;DR: Empirical evidence suggests a need for updated guidelines for designing latency in HCI, particularly on the lower boundary latencies below 100 ms, even though smaller latencies have been shown to be perceivable to the user and impact user performance negatively.
Abstract: Latency or system response time (ie, the delay between user input and system response) is a fundamental factor affecting human-computer interaction (HCI) If latency exceeds a critical threshold, user performance and experience get impaired Therefore, several design guidelines giving recommendations on maximum latencies for an optimal user experience have been developed within the last five centuries Concentrating on the lower boundary latencies, these guidelines are critically reviewed and contrasted with recent empirical findings Results of the review reveal that latencies below 100 ms were seldom considered in guidelines so far even though smaller latencies have been shown to be perceivable to the user and impact user performance negatively Thus, empirical evidence suggests a need for updated guidelines for designing latency in HCI

Journal ArticleDOI
TL;DR: This study proposes a novel method of GC based on reinforcement learning that significantly reduces the long-tail latency by 29--36% at the 99th percentile compared to state-of-the-art schemes.
Abstract: NAND flash memory is widely used in various systems, ranging from real-time embedded systems to enterprise server systems. Because the flash memory has erase-before-write characteristics, we need flash-memory management methods, i.e., address translation and garbage collection. In particular, garbage collection (GC) incurs long-tail latency, e.g., 100 times higher latency than the average latency at the 99th percentile. Thus, real-time and quality-critical systems fail to meet the given requirements such as deadline and QoS constraints. In this study, we propose a novel method of GC based on reinforcement learning. The objective is to reduce the long-tail latency by exploiting the idle time in the storage system. To improve the efficiency of the reinforcement learning-assisted GC scheme, we present new optimization methods that exploit fine-grained GC to further reduce the long-tail latency. The experimental results with real workloads show that our technique significantly reduces the long-tail latency by 29--36% at the 99.99th percentile compared to state-of-the-art schemes.

Proceedings ArticleDOI
24 Jun 2017
TL;DR: PowerChief is presented, a runtime framework that provides joint design of service and query to monitor the latency statistics across service stages and adaptively chooses the boosting technique to accelerate the bottleneck service with improved responsiveness and dynamically reallocates the constrained power budget acrossservice stages to accommodate the chosen boosting technique.
Abstract: Modern user facing applications consist of multiple processing stages with a number of service instances in each stage. The latency profile of these multi-stage applications is intrinsically variable, making it challenging to provide satisfactory responsiveness. Given a limited power budget, improving the end-to-end latency requires intelligently boosting the bottleneck service across stages using multiple boosting techniques. However, prior work fail to acknowledge the multi-stage nature of user-facing applications and perform poorly in improving responsiveness on power constrained CMP, as they are unable to accurately identify bottleneck service and apply the boosting techniques adaptively.In this paper, we present PowerChief, a runtime framework that 1) provides joint design of service and query to monitor the latency statistics across service stages and accurately identifies the bottleneck service during runtime; 2) adaptively chooses the boosting technique to accelerate the bottleneck service with improved responsiveness; 3) dynamically reallocates the constrained power budget across service stages to accommodate the chosen boosting technique. Evaluated with real world multi-stage applications, PowerChief improves the average latency by 20.3x and 32.4x (99% tail latency by 13.3x and 19.4x) for Sirius and Natural Language Processing applications respectively compared to stage-agnostic power allocation. In addition, for the given QoS target, PowerChief reduces the power consumption of Sirius and Web Search applications by 23% and 33% respectively over prior work.

Journal ArticleDOI
TL;DR: This paper proposes a new class of low-rank matrix completion algorithms, which predicts the missing entries in an extracted “network feature matrix” by iteratively minimizing a weighted Schatten- $p$ norm to approximate the rank.
Abstract: Network latency prediction is important for server selection and quality-of-service estimation in real-time applications on the Internet. Traditional network latency prediction schemes attempt to estimate the latencies between all pairs of nodes in a network based on sampled round-trip times, through either Euclidean embedding or matrix factorization. However, these schemes become less effective in terms of estimating the latencies of personal devices, due to unstable and time-varying network conditions, triangle inequality violation and the unknown ranks of latency matrices. In this paper, we propose a matrix completion approach to network latency estimation. Specifically, we propose a new class of low-rank matrix completion algorithms, which predicts the missing entries in an extracted “network feature matrix” by iteratively minimizing a weighted Schatten- $p$ norm to approximate the rank. Simulations on true low-rank matrices show that our new algorithm achieves better and more robust performance than multiple state-of-the-art matrix completion algorithms in the presence of noise. We further enhance latency estimation based on multiple “frames” of latency matrices measured in the past, and extend the proposed matrix completion scheme to the case of 3-D tensor completion. Extensive performance evaluations driven by real-world latency measurements collected from the Seattle platform show that our proposed approaches significantly outperform various state-of-the-art network latency estimation techniques, especially for networks that contain personal devices.

Journal ArticleDOI
TL;DR: A latency model based on human neurons differentiated in culture from an NIH-approved embryonic stem cell line that establishes a non-productive infection state resembling latency when infected at low viral doses in the presence of the antivirals acyclovir and interferon-α is developed.
Abstract: Herpes simplex virus 1 (HSV-1) uses latency in peripheral ganglia to persist in its human host, however, recurrent reactivation from this reservoir can cause debilitating and potentially life-threatening disease. Most studies of latency use live-animal infection models, but these are complex, multilayered systems and can be difficult to manipulate. Infection of cultured primary neurons provides a powerful alternative, yielding important insights into host signaling pathways controlling latency. However, small animal models do not recapitulate all aspects of HSV-1 infection in humans and are limited in terms of the available molecular tools. To address this, we have developed a latency model based on human neurons differentiated in culture from an NIH-approved embryonic stem cell line. The resulting neurons are highly permissive for replication of wild-type HSV-1, but establish a non-productive infection state resembling latency when infected at low viral doses in the presence of the antivirals acyclovir and interferon-α. In this state, viral replication and expression of a late viral gene marker are not detected but there is an accumulation of the viral latency-associated transcript (LAT) RNA. After a six-day establishment period, antivirals can be removed and the infected cultures maintained for several weeks. Subsequent treatment with sodium butyrate induces reactivation and production of new infectious virus. Human neurons derived from stem cells provide the appropriate species context to study this exclusively human virus with the potential for more extensive manipulation of the progenitors and access to a wide range of preexisting molecular tools.

Proceedings ArticleDOI
01 Dec 2017
TL;DR: Fog networking is incorporated into heterogeneous cellular networks that are composed of a high power node (HPN) and many low power nodes (LPNs) and the locations of the fog nodes are specified by modifying the unsupervised soft-clustering machine learning algorithm with the ultimate aim of reducing latency.
Abstract: This paper incorporates fog networking into heterogeneous cellular networks that are composed of a high power node (HPN) and many low power nodes (LPNs). The locations of the fog nodes that are upgraded from LPNs are specified by modifying the unsupervised soft-clustering machine learning algorithm with the ultimate aim of reducing latency. The clusters are constructed accordingly so that the leader of each cluster becomes a fog node. The proposed approach significantly reduces the latency with respect to the simple, but practical, Voronoi tessellation model, however the improvement is bounded and saturates. Hence, closed-loop error control systems will be challenged in meeting the demanding latency requirement of 5G systems, so that open-loop communication may be required to meet the 1ms latency requirement of 5G networks.