scispace - formally typeset
Search or ask a question

Showing papers on "Latency (engineering) published in 2013"


Proceedings ArticleDOI
03 Nov 2013
TL;DR: It is demonstrated that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design.
Abstract: Large-scale data analytics frameworks are shifting towards shorter task durations and larger degrees of parallelism to provide low latency. Scheduling highly parallel jobs that complete in hundreds of milliseconds poses a major challenge for task schedulers, which will need to schedule millions of tasks per second on appropriate machines while offering millisecond-level latency and high availability. We demonstrate that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design. We implement and deploy our scheduler, Sparrow, on a 110-machine cluster and demonstrate that Sparrow performs within 12% of an ideal scheduler.

597 citations


Proceedings ArticleDOI
08 Jul 2013
TL;DR: A lightweight design that can effectively control the average queueing latency to a reference value is presented, PIE (Proportional Integral controller Enhanced), that is robust and optimized for various network scenarios.
Abstract: Bufferbloat is a phenomenon where excess buffers in the network cause high latency and jitter. As more and more interactive applications (e.g. voice over IP, real time video conferencing and financial transactions) run in the Internet, high latency and jitter degrade application performance. There is a pressing need to design intelligent queue management schemes that can control latency and jitter; and hence provide desirable quality of service to users. We present here a lightweight design, PIE (Proportional Integral controller Enhanced), that can effectively control the average queueing latency to a reference value. The design does not require per-packet extra processing, so it incurs very small overhead and is simple to implement in both hardware and software. In addition, the design parameters are self-tuning, and hence PIE is robust and optimized for various network scenarios. Simulation results, theoretical analysis and Linux testbed results show that PIE can ensure low latency and achieve high link utilization under various congestion situations.

280 citations


Proceedings ArticleDOI
23 Feb 2013
TL;DR: This work introduces Tiered-Latency DRAM (TL-DRAM), which achieves both low latency and low cost-per-bit, and proposes mechanisms that use the low-latency segment as a hardware-managed or software-managed cache.
Abstract: The capacity and cost-per-bit of DRAM have historically scaled to satisfy the needs of increasingly large and complex computer systems. However, DRAM latency has remained almost constant, making memory latency the performance bottleneck in today's systems. We observe that the high access latency is not intrinsic to DRAM, but a trade-off made to decrease cost-per-bit. To mitigate the high area overhead of DRAM sensing structures, commodity DRAMs connect many DRAM cells to each sense-amplifier through a wire called a bitline. These bitlines have a high parasitic capacitance due to their long length, and this bitline capacitance is the dominant source of DRAM latency. Specialized low-latency DRAMs use shorter bitlines with fewer cells, but have a higher cost-per-bit due to greater sense-amplifier area overhead. In this work, we introduce Tiered-Latency DRAM (TL-DRAM), which achieves both low latency and low cost-per-bit. In TL-DRAM, each long bitline is split into two shorter segments by an isolation transistor, allowing one segment to be accessed with the latency of a short-bitline DRAM without incurring high cost-per-bit. We propose mechanisms that use the low-latency segment as a hardware-managed or software-managed cache. Evaluations show that our proposed mechanisms improve both performance and energy-efficiency for both single-core and multi-programmed workloads.

269 citations


Journal ArticleDOI
TL;DR: A latency-aware learning formulation is used to train a logistic regression-based classifier that automatically determines distinctive canonical poses from data and uses these to robustly recognize actions in the presence of ambiguous poses.
Abstract: An important aspect in designing interactive, action-based interfaces is reliably recognizing actions with minimal latency. High latency causes the system's feedback to lag behind user actions and thus significantly degrades the interactivity of the user experience. This paper presents algorithms for reducing latency when recognizing actions. We use a latency-aware learning formulation to train a logistic regression-based classifier that automatically determines distinctive canonical poses from data and uses these to robustly recognize actions in the presence of ambiguous poses. We introduce a novel (publicly released) dataset for the purpose of our experiments. Comparisons of our method against both a Bag of Words and a Conditional Random Field (CRF) classifier show improved recognition performance for both pre-segmented and online classification tasks. Additionally, we employ GentleBoost to reduce our feature set and further improve our results. We then present experiments that explore the accuracy/latency trade-off over a varying number of actions. Finally, we evaluate our algorithm on two existing datasets.

262 citations


Journal ArticleDOI
TL;DR: Results indicated mean amplitude was the most robust against increases in background noise and the adaptive mean measure was more biased, but represented an efficient estimator of the true ERP signal particularly for individual-subject latency variability.
Abstract: There is considerable variability in the quantification of event-related potential (ERP) amplitudes and latencies. We examined susceptibility of ERP quantification measures to incremental increases in background noise through published ERP data and simulations. Measures included mean amplitude, adaptive mean, peak amplitude, peak latency, and centroid latency. Results indicated mean amplitude was the most robust against increases in background noise. The adaptive mean measure was more biased, but represented an efficient estimator of the true ERP signal particularly for individual-subject latency variability. Strong evidence is provided against using peak amplitude. For latency measures, the peak latency measure was less biased and less efficient than the centroid latency measurement. Results emphasize the prudence in reporting the number of trials retained for averaging as well as noise estimates for groups and conditions when comparing ERPs.

190 citations


Proceedings ArticleDOI
09 Dec 2013
TL;DR: It is argued that the use of redundancy is an effective way to convert extra capacity into reduced latency by initiating redundant operations across diverse resources and using the first result which completes, redundancy improves a system's latency even under exceptional conditions.
Abstract: Low latency is critical for interactive networked applications. But while we know how to scale systems to increase capacity, reducing latency --- especially the tail of the latency distribution --- can be much more difficult. In this paper, we argue that the use of redundancy is an effective way to convert extra capacity into reduced latency. By initiating redundant operations across diverse resources and using the first result which completes, redundancy improves a system's latency even under exceptional conditions. We study the tradeoff with added system utilization, characterizing the situations in which replicating all tasks reduces mean latency. We then demonstrate empirically that replicating all operations can result in significant mean and tail latency reduction in real-world systems including DNS queries, database servers, and packet forwarding within networks.

186 citations


Posted Content
TL;DR: In this paper, the authors argue that the use of redundancy is an effective way to convert extra capacity into reduced latency, and demonstrate empirically that replicating all operations can result in significant mean and tail latency reduction in real-world systems including DNS queries, database servers, and packet forwarding within networks.
Abstract: Low latency is critical for interactive networked applications. But while we know how to scale systems to increase capacity, reducing latency --- especially the tail of the latency distribution --- can be much more difficult. In this paper, we argue that the use of redundancy is an effective way to convert extra capacity into reduced latency. By initiating redundant operations across diverse resources and using the first result which completes, redundancy improves a system's latency even under exceptional conditions. We study the tradeoff with added system utilization, characterizing the situations in which replicating all tasks reduces mean latency. We then demonstrate empirically that replicating all operations can result in significant mean and tail latency reduction in real-world systems including DNS queries, database servers, and packet forwarding within networks.

184 citations


Proceedings ArticleDOI
03 Nov 2013
TL;DR: It is shown that it is possible to obtain both serializable transactions and low latency, under two conditions: transactions are known ahead of time, permitting an a priori static analysis of conflicts, and transactions are structured as transaction chains consisting of a sequence of hops.
Abstract: Currently, users of geo-distributed storage systems face a hard choice between having serializable transactions with high latency, or limited or no transactions with low latency. We show that it is possible to obtain both serializable transactions and low latency, under two conditions. First, transactions are known ahead of time, permitting an a priori static analysis of conflicts. Second, transactions are structured as transaction chains consisting of a sequence of hops, each hop modifying data at one server. To demonstrate this idea, we built Lynx, a geo-distributed storage system that offers transaction chains, secondary indexes, materialized join views, and geo-replication. Lynx uses static analysis to determine if each hop can execute separately while preserving serializability---if so, a client needs wait only for the first hop to complete, which occurs quickly. To evaluate Lynx, we built three applications: an auction service, a Twitter-like microblogging site and a social networking site. These applications successfully use chains to achieve low latency operation and good throughput.

159 citations


Proceedings ArticleDOI
27 Apr 2013
TL;DR: This paper presents a detailed analysis of users' coping mechanisms for latency, and presents the results of a follow-up study demonstrating user perception of latency in the land-on phase of the dragging task.
Abstract: Although advances in touchscreen technology have provided us with more precise devices, touchscreens are still laden with latency issues. Common commercial devices present with latency up to 125ms. Although these levels have been shown to impact users' perception of the responsiveness of the system [16], relatively little is known about the impact of latency on the performance of tasks common to direct-touch interfaces, such as direct physical manipulation. In this paper, we study the effect of latency of a direct-touch pointing device on dragging tasks. Our tests show that user performance decreases as latency increases. We also find that user performance is more severely affected by latency when targets are smaller or farther away. We present a detailed analysis of users' coping mechanisms for latency, and present the results of a follow-up study demonstrating user perception of latency in the land-on phase of the dragging task.

148 citations


Proceedings ArticleDOI
27 Aug 2013
TL;DR: Kwiken, a framework that takes an end-to-end view of latency improvements and costs, decomposes the problem of minimizing latency over a general processing DAG into a manageable optimization over individual stages.
Abstract: We found that interactive services at Bing have highly variable datacenter-side processing latencies because their processing consists of many sequential stages, parallelization across 10s-1000s of servers and aggregation of responses across the network. To improve the tail latency of such services, we use a few building blocks: reissuing laggards elsewhere in the cluster, new policies to return incomplete results and speeding up laggards by giving them more resources. Combining these building blocks to reduce the overall latency is non-trivial because for the same amount of resource (e.g., number of reissues), different stages improve their latency by different amounts. We present Kwiken, a framework that takes an end-to-end view of latency improvements and costs. It decomposes the problem of minimizing latency over a general processing DAG into a manageable optimization over individual stages. Through simulations with production traces, we show sizable gains; the 99th percentile of latency improves by over 50% when just 0.1% of the responses are allowed to have partial results and by over 40% for 25% of the services when just 5% extra resources are used for reissues.

132 citations


Proceedings ArticleDOI
23 Jun 2013
TL;DR: A novel DRAM bank organization with center high-aspect-ratio mats called CHARM is introduced, which improves both the instructions per cycle and system-wide energy-delay product up to 21% and 32%, respectively, with only a 3% increase in die area.
Abstract: DRAM has been a de facto standard for main memory, and advances in process technology have led to a rapid increase in its capacity and bandwidth. In contrast, its random access latency has remained relatively stagnant, as it is still around 100 CPU clock cycles. Modern computer systems rely on caches or other latency tolerance techniques to lower the average access latency. However, not all applications have ample parallelism or locality that would help hide or reduce the latency. Moreover, applications' demands for memory space continue to grow, while the capacity gap between last-level caches and main memory is unlikely to shrink. Consequently, reducing the main-memory latency is important for application performance. Unfortunately, previous proposals have not adequately addressed this problem, as they have focused only on improving the bandwidth and capacity or reduced the latency at the cost of significant area overhead.We propose asymmetric DRAM bank organizations to reduce the average main-memory access latency. We first analyze the access and cycle times of a modern DRAM device to identify key delay components for latency reduction. Then we reorganize a subset of DRAM banks to reduce their access and cycle times by half with low area overhead. By synergistically combining these reorganized DRAM banks with support for non-uniform bank accesses, we introduce a novel DRAM bank organization with center high-aspect-ratio mats called CHARM. Experiments on a simulated chip-multiprocessor system show that CHARM improves both the instructions per cycle and system-wide energy-delay product up to 21% and 32%, respectively, with only a 3% increase in die area.

Posted Content
TL;DR: In this paper, an analytical study of the latency performance of redundant requests is presented, with the primary goals of characterizing under what scenarios sending redundant requests will help and under what scenario they will not help.
Abstract: Several systems possess the flexibility to serve requests in more than one way. For instance, a distributed storage system storing multiple replicas of the data can serve a request from any of the multiple servers that store the requested data, or a computational task may be performed in a compute-cluster by any one of multiple processors. In such systems, the latency of serving the requests may potentially be reduced by sending "redundant requests": a request may be sent to more servers than needed, and it is deemed served when the requisite number of servers complete service. Such a mechanism trades off the possibility of faster execution of at least one copy of the request with the increase in the delay due to an increased load on the system. Due to this tradeoff, it is unclear when redundant requests may actually help. Several recent works empirically evaluate the latency performance of redundant requests in diverse settings. This work aims at an analytical study of the latency performance of redundant requests, with the primary goals of characterizing under what scenarios sending redundant requests will help (and under what scenarios they will not help), as well as designing optimal redundant-requesting policies. We first present a model that captures the key features of such systems. We show that when service times are i.i.d. memoryless or "heavier", and when the additional copies of already-completed jobs can be removed instantly, redundant requests reduce the average latency. On the other hand, when service times are "lighter" or when service times are memoryless and removal of jobs is not instantaneous, then not having any redundancy in the requests is optimal under high loads. Our results hold for arbitrary arrival processes.

Proceedings ArticleDOI
23 Feb 2013
TL;DR: This work proposes an on-chip network called SMART (Single-cycle Multi-hop Asynchronous Repeated Traversal) that aims to present a single-cycle data-path all the way from the source to the destination.
Abstract: As the number of on-chip cores increases, scalable on-chip topologies such as meshes inevitably add multiple hops in each network traversal. The best we can do right now is to design 1-cycle routers, such that the low-load network latency between a source and destination is equal to the number of routers + links (i.e. hops×2) between them. OS/compiler and cache coherence protocols designers often try to limit communication to within a few hops, since on-chip latency is critical for their scalability. In this work, we propose an on-chip network called SMART (Single-cycle Multi-hop Asynchronous Repeated Traversal) that aims to present a single-cycle data-path all the way from the source to the destination. We do not add any additional fast physical express links in the data-path; instead we drive the shared crossbars and links asynchronously up to multiple-hops within a single cycle. We design a router + link microarchitecture to achieve such a traversal, and a flow-control technique to arbitrate and setup multi-hop paths within a cycle. A place-and-routed design at 45nm achieves 11 hops within a 1GHz cycle for paths without turns (9 for paths with turns). We observe 5-8X reduction in low-load latencies across synthetic traffic patterns on an 8×8 CMP, compared to a baseline 1-cycle router. Full-system simulations with SPLASH-2 and PAR-SEC benchmarks demonstrate 27/52% and 20/59% reduction in runtime and EDP for Private/Shared L2 designs.

Journal ArticleDOI
TL;DR: This review discusses how latent infection can be established following infection of an activated CD4 T-cell that undergoes a transition to a resting memory state and also how direct infection of a resting CD4T-cell can lead to latency.
Abstract: Latently infected cells represent the major barrier to either a sterilizing or a functional HIV-1 cure. Multiple approaches to reactivation and depletion of the latent reservoir have been attempted clinically, but full depletion of this compartment remains a long-term goal. Compared to the mechanisms involved in the maintenance of HIV-1 latency and the pathways leading to viral reactivation, less is known about the establishment of latent infection. This review focuses on how HIV-1 latency is established at the cellular and molecular levels. We first discuss how latent infection can be established following infection of an activated CD4 T-cell that undergoes a transition to a resting memory state and also how direct infection of a resting CD4 T-cell can lead to latency. Various animal, primary cell, and cell line models also provide insights into this process and are discussed with respect to the routes of infection that result in latency. A number of molecular mechanisms that are active at both transcriptional and post-transcriptional levels have been associated with HIV-1 latency. Many, but not all of these, help to drive the establishment of latent infection, and we review the evidence in favor of or against each mechanism specifically with regard to the establishment of latency. We also discuss the role of immediate silent integration of viral DNA versus silencing of initially active infections. Finally, we discuss potential approaches aimed at limiting the establishment of latent infection.

Proceedings ArticleDOI
01 Oct 2013
TL;DR: A mechanism to measure link latencies from an OpenFlow controller with high accuracy and a low footprint is proposed and implemented and the performance evaluation is presented.
Abstract: Software Defined Networking, especially through protocols like OpenFlow, is becoming more and more present in networks. It aims at separating the data plane from the control plane for more network programmability, serviceability, heterogeneity and maintainability. Even if mobile applications and multimedia are often pointed at to show the demise of current network architectures, there are currently no ways to efficiently dynamically obtain the latency in an OpenFlow network to efficiently apply QoS policies. In this paper, we propose a mechanism to measure link latencies from an OpenFlow controller with high accuracy and a low footprint. We implemented it and present the performance evaluation. A monitoring packet consumes only 24 Bytes, which is 81% less than the ping utility, for an average accuracy of 99.25% compared to the ping values.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed analytical model can predict the average packet latency more than four orders of magnitude faster than an accurate simulation, while the computation error is less than 10% in non-saturated networks for different system-on-chip platforms.
Abstract: We propose an analytical model based on queueing theory for delay analysis in a wormhole-switched network-on-chip (NoC). The proposed model takes as input an application communication graph, a topology graph, a mapping vector, and a routing matrix, and estimates average packet latency and router blocking time. It works for arbitrary network topology with deterministic routing under arbitrary traffic patterns. This model can estimate per-flow average latency accurately and quickly, thus enabling fast design space exploration of various design parameters in NoC designs. Experimental results show that the proposed analytical model can predict the average packet latency more than four orders of magnitude faster than an accurate simulation, while the computation error is less than 10% in non-saturated networks for different system-on-chip platforms.

Journal ArticleDOI
TL;DR: In this article, the delay efficiency of asynchronous WSN MAC protocols is studied from a delay efficiency perspective, and their latency is investigated based on their delay-efficient duty-cycling schemes.
Abstract: Energy-efficiency is the main concern in most Wireless Sensor Network (WSN) applications. For this purpose, current WSN MAC (Medium Access Control) protocols use duty-cycling schemes, where they consciously switch a node's radio between active and sleep modes. However, a node needs to be aware of (or at least use some mechanism to meet) its neighbors' sleep/active schedules, since messages cannot be exchanged unless both the transmitter and the receiver are awake. Asynchronous duty-cycling schemes have the advantage over synchronous ones to eliminating the need of clock synchronization, and to be conceptually distributed and more dynamic. However, the communicating nodes are prone to spend more time waiting for the active period of each other, which inevitably influences the one-hop delay, and consequently the cumulative end-to-end delay. This paper reviews current asynchronous WSN MAC protocols. Its main contribution is to study these protocols from the delay efficiency perspective, and to investigate on their latency. The asynchronous protocols are divided into six categories: static wake-up preamble, adaptive wake-up preamble, collaborative schedule setting, collisions resolution, receiver-initiated, and anticipation-based. Several state-of-the-art protocols are described following the proposed taxonomy, with comprehensive discussions and comparisons with respect to their latency.

Journal ArticleDOI
TL;DR: The silent/inducible phenotype appears to be associated with chromosomal position, but the molecular basis is not fully clarified and may differ among in vitro models of latency.
Abstract: Background: HIV infection can be treated effectively with antiretroviral agents, but the persistence of a latent reservoir of integrated proviruses prevents eradication of HIV from infected individuals. The chromosomal environment of integrated proviruses has been proposed to influence HIV latency, but the determinants of transcriptional repression have not been fully clarified, and it is unclear whether the same molecular mechanisms drive latency in different cell culture models. Results: Here we compare data from five different in vitro models of latency based on primary human T cells or a T cell line. Cells were infected in vitro and separated into fractions containing proviruses that were either expressed or silent/inducible, and integration site populations sequenced from each. We compared the locations of 6,252 expressed proviruses to those of 6,184 silent/inducible proviruses with respect to 140 forms of genomic annotation, many analyzed over chromosomal intervals of multiple lengths. A regularized logistic regression model linking proviral expression status to genomic features revealed no predictors of latency that performed better than chance, though several genomic features were significantly associated with proviral expression in individual models. Proviruses in the same chromosomal region did tend to share the same expressed or silent/inducible status if they were from the same cell culture model, but not if they were from different models.

Proceedings Article
R Berner1, Christian Brandli1, Minhao Yang1, Shih-Chii Liu1, Tobi Delbruck1 
12 Jun 2013
TL;DR: A 0.18um CMOS vision sensor that combines event-driven asynchronous readout of temporal contrast with synchronous frame-based active pixel sensor read out of intensity is proposed, suitable for mobile applications because it allows low latency at low data rate and therefore, low system-level power consumption.
Abstract: This paper proposes a 0.18um CMOS vision sensor that combines event-driven asynchronous readout of temporal contrast with synchronous frame-based active pixel sensor readout of intensity. The sensor is suitable for mobile applications because it allows low latency at low data rate and therefore, low system-level power consumption. The image frames can be used for scene analysis and the temporal contrast events with 12us latency can be used to track fast moving objects.

Proceedings ArticleDOI
01 Oct 2013
TL;DR: This work shows that when service times are i.i.d. memoryless or “heavy”, and when the additional copies of already-completed jobs can be removed with negligible costs, redundant requests reduce the average latency.
Abstract: Several systems possess the flexibility to serve requests in more than one way. For instance, a distributed storage system storing multiple replicas of the data can serve a request from any of the multiple servers that store the requested data, or a computational task may be performed in a compute-cluster by any one of multiple processors. In such systems, the latency of serving the requests may potentially be reduced by sending redundant requests: a request may be sent to an excess number of servers, and it is deemed served when the requisite number of servers complete service. Such a mechanism trades off the possibility of faster execution of at least one copy of the request with the increase in the delay due to an increased load on the system. Due to this tradeoff, it is unclear when redundant requests may actually help. Several recent works empirically evaluate the latency performance of redundant requests in diverse settings. This work aims at a rigorous analytical study of the latency performance of redundant requests, with the primary goals of characterizing the situations when sending redundant requests will help (and when not), and designing optimal redundant-requesting policies. We first present a model that captures the key features of such systems. We show that when service times are i.i.d. memoryless or “heavy”, and when the additional copies of already-completed jobs can be removed with negligible costs, redundant requests reduce the average latency. On the other hand, when service times are “light” or when service times are memoryless and removal of jobs results in a non-negligible penalty, not having any redundancy in the request is optimal under high loads. Our results hold for arbitrary arrival processes.

Journal ArticleDOI
TL;DR: This paper presents several simple design techniques that can reduce such latency penalty caused by soft-decision ECCs, and suggests that the latency can be reduced by up to 85.3%.
Abstract: With the aggressive technology scaling and use of multi-bit per cell storage, NAND flash memory is subject to continuous degradation of raw storage reliability and demands more and more powerful error correction codes (ECC). This inevitable trend makes conventional BCH code increasingly inadequate, and iterative coding solutions such as LDPC codes become very natural alternative options. However, these powerful coding solutions demand soft-decision memory sensing, which results in longer on-chip memory sensing latency and memory-to-controller data transfer latency. Leveraging well-established lossless data compression theories, this paper presents several simple design techniques that can reduce such latency penalty caused by soft-decision ECCs. Their effectiveness have been well demonstrated through extensive simulations, and the results suggest that the latency can be reduced by up to 85.3%.

Patent
02 Jul 2013
TL;DR: In this article, an electronic device that receives data packets moving across a network data point and compares their time of arrival with a timestamp stored within a data packet is used to calculate the average latency.
Abstract: Systems and methods for accurately calculating the latency of a data-network, by providing an electronic device that receives data packets moving across a network data point and compares their time of arrival with a timestamp stored within a data packet. The electronic device may calculate the average latency by comparing N number of data packets. Further systems and methods for comparing the latencies at N number of electronic devices placed at unique network data points and calculating latencies between each device.

Proceedings ArticleDOI
23 Oct 2013
TL;DR: For accurate delay measurement, this paper proposes to replace the ping tool with an adaptation of paris-traceroute which supports delay and jitter estimation, without being biased by per-flow network load balancing.
Abstract: Monitoring Internet performance and measuring user quality of experience are drawing increased attention from both research and industry. To match this interest, large-scale measurement infrastructures have been constructed. We believe that this effort must be combined with a critical review and calibrarion of the tools being used to measure performance.In this paper, we analyze the suitability of ping for delay measurement. By performing several experiments on different source and destination pairs, we found cases in which ping gave very poor estimates of delay and jitter as they might be experienced by an application. In those cases, delay was heavily dependent on the flow identifier, even if only one IP path was used. For accurate delay measurement we propose to replace the ping tool with an adaptation of paris-traceroute which supports delay and jitter estimation, without being biased by per-flow network load balancing.

Patent
07 Aug 2013
TL;DR: In this paper, the authors present a solution for Network on Chip (NoC) interconnects that automatically and dynamically determines the position of various hosts in a NoC topology based on the connectivity, bandwidth and latency requirements of the system traffic flows and certain performance optimization metrics such as system interconnect latency and interconnect cost.
Abstract: Systems and methods described herein are directed to solutions for Network on Chip (NoC) interconnects that automatically and dynamically determines the position of various hosts in a NoC topology based on the connectivity, bandwidth and latency requirements of the system traffic flows and certain performance optimization metrics such as system interconnect latency and interconnect cost. The example implementations selects hosts for relocation consideration and determines a new possible position for them in the NoC based on the system traffic specification, and using probabilistic functions to decide if the relocation is carried out or not. The procedure is repeated over new sets of hosts until certain optimization targets are satisfied or repetition count is exceeded.

Proceedings ArticleDOI
01 Oct 2013
TL;DR: This work introduces a new host-centric solution for improving latency in virtualized cloud environments by extending a classic scheduling principle---Shortest Remaining Time First---from the virtualization layer, through the host network stack, to the network switches.
Abstract: Public clouds have become a popular platform for building Internet-scale applications. Using virtualization, public cloud services grant customers full control of guest operating systems and applications, while service providers still retain the management of their host infrastructure. Because applications built with public clouds are often highly sensitive to response time, infrastructure builders strive to reduce the latency of their data center's internal network. However, most existing solutions require modification to the software stack controlled by guests. We introduce a new host-centric solution for improving latency in virtualized cloud environments. In this approach, we extend a classic scheduling principle---Shortest Remaining Time First---from the virtualization layer, through the host network stack, to the network switches. Experimental and simulation results show that our solution can reduce median latency of small flows by 40%, with improvements in the tail of almost 90%, while reducing throughput of large flows by less than 3%.

Journal ArticleDOI
TL;DR: Present findings indicate that although there is loss of a relationship between white matter structure and auditory cortex function in autism spectrum disorders, and although there are delayed auditory responses in individuals with autism than compared with age-matched controls, M50 latency nevertheless decreases as a function of age in autism, parallel to the observation in typically developing controls.

Proceedings Article
26 Jun 2013
TL;DR: This work presents a low latency network interface design suitable for request-response based applications and investigates latency-power tradeoffs between using interrupts and polling, as well as the effects of processor's low power states.
Abstract: Ethernet network interfaces in commodity systems are designed with a focus on achieving high bandwidth at low CPU utilization, while often sacrificing latency. This approach is viable only if the high interface latency is still overwhelmingly dominated by software request processing times. However, recent efforts to lower software latency in request-response based systems, such as memcached and RAMCloud, have promoted network interface into a significant contributor to the overall latency. We present a low latency network interface design suitable for request-response based applications. Evaluation on a prototype FPGA implementation has demonstrated that our design exhibits more than double latency improvements without a meaningful negative impact on either bandwidth or CPU power. We also investigate latency-power tradeoffs between using interrupts and polling, as well as the effects of processor's low power states.

01 Jan 2013
TL;DR: The main conclusions are: (i) LITMUS introduces only minor overhead itself, but (ii) it also inherits mainline Linux’s severe limitations in the presence of I/O-bound background tasks.
Abstract: Scheduling latency under Linux and its principal real-time variant, the PREEMPT RT patch, are typically measured using cyclictest, a tracing tool that treats the kernel as a black box and directly reports scheduling latency LITMUS, a real-time extension of Linux focused on algorithmic improvements, is typically evaluated using Feather-Trace, a finedgrained tracing mechanism that produces a comprehensive overhead profile suitable for overhead-aware schedulability analysis This difference in tracing tools and output has to date prevented a direct comparison This paper reports on a port of cyclictest to LITMUS and a case study comparing scheduling latency on a 16-core Intel platform The main conclusions are: (i) LITMUS introduces only minor overhead itself, but (ii) it also inherits mainline Linux’s severe limitations in the presence of I/O-bound background tasks

Proceedings ArticleDOI
28 Mar 2013
TL;DR: A Non-Linear SVM (NLSVM)-based seizure detection SoC which ensures a >95% detection accuracy, <;1% false alarm and <;2s latency is presented.
Abstract: To mitigate seizure-affected patients, SoCs [1-3] have been developed 1) to detect electrical onset of seizure seconds before the clinical onset, and 2) to combine the SoC with neurostimulation. In particular, having detection delay of 95% detection accuracy, <;1% false alarm and <;2s latency.

Proceedings Article
27 May 2013
TL;DR: It is shown that UDP and TCP have different effects depending on the available bandwidth on the control link, while latency drives the overall behavior of the network, that is the time to reach its full capacity.
Abstract: In the OpenFlow framework, packet forwarding (data plane) and routing decisions (control plane) run on different devices. OpenFlow switches are in charge of packet forwarding, whereas a Controller, which can be situated very far from a networking point of view from the switches its manages, sets up switch forwarding tables on a per-flow basis. The connection between a switch and its Controller is thus of primary importance for the performances of the network. In this paper, we study the impact of the latency between an OpenFlow switch and its Controller. We show that UDP and TCP have different effects depending on the available bandwidth on the control link. Bandwidth arbitrates how many flows the Controller can process, as well as the loss rate if the system is under heavy load, while latency drives the overall behavior of the network, that is the time to reach its full capacity. Finally, we propose solutions to mitigate the phenomenons we outline.