scispace - formally typeset
Search or ask a question

Showing papers on "Latency (engineering) published in 2013"


Proceedings Article
02 Apr 2013
TL;DR: The evaluation shows that the Eiger system achieves low latency, has throughput competitive with eventually-consistent and non-transactional Cassandra, and scales out to large clusters almost linearly (averaging 96% increases up to 128 server clusters).
Abstract: We present the first scalable, geo-replicated storage system that guarantees low latency, offers a rich data model, and provides "stronger" semantics. Namely, all client requests are satisfied in the local datacenter in which they arise; the system efficiently supports useful data model abstractions such as column families and counter columns; and clients can access data in a causally-consistent fashion with read-only and write-only transactional support, even for keys spread across many servers. The primary contributions of this work are enabling scalable causal consistency for the complex columnfamily data model, as well as novel, non-blocking algorithms for both read-only and write-only transactions. Our evaluation shows that our system, Eiger, achieves low latency (single-ms), has throughput competitive with eventually-consistent and non-transactional Cassandra (less than 7% overhead for one of Facebook's real-world workloads), and scales out to large clusters almost linearly (averaging 96% increases up to 128 server clusters).

284 citations


Proceedings ArticleDOI
15 Apr 2013
TL;DR: This work advocates a powerful new abstraction called resilient substitution that caters to the specific needs in this new computation model to handle failure recovery and dynamic reconfiguration in response to load changes.
Abstract: TimeStream is a distributed system designed specifically for low-latency continuous processing of big streaming data on a large cluster of commodity machines. The unique characteristics of this emerging application domain have led to a significantly different design from the popular MapReduce-style batch data processing. In particular, we advocate a powerful new abstraction called resilient substitution that caters to the specific needs in this new computation model to handle failure recovery and dynamic reconfiguration in response to load changes. Several real-world applications running on our prototype have been shown to scale robustly with low latency while at the same time maintaining the simple and concise declarative programming model. TimeStream handles an on-line advertising aggregation pipeline at a rate of 700,000 URLs per second with a 2-second delay, while performing sentiment analysis of Twitter data at a peak rate close to 10,000 tweets per second, with approximately 2-second delay.

262 citations


Posted Content
TL;DR: In this paper, the authors argue that the use of redundancy is an effective way to convert extra capacity into reduced latency, and demonstrate empirically that replicating all operations can result in significant mean and tail latency reduction in real-world systems including DNS queries, database servers, and packet forwarding within networks.
Abstract: Low latency is critical for interactive networked applications. But while we know how to scale systems to increase capacity, reducing latency --- especially the tail of the latency distribution --- can be much more difficult. In this paper, we argue that the use of redundancy is an effective way to convert extra capacity into reduced latency. By initiating redundant operations across diverse resources and using the first result which completes, redundancy improves a system's latency even under exceptional conditions. We study the tradeoff with added system utilization, characterizing the situations in which replicating all tasks reduces mean latency. We then demonstrate empirically that replicating all operations can result in significant mean and tail latency reduction in real-world systems including DNS queries, database servers, and packet forwarding within networks.

184 citations


Proceedings ArticleDOI
09 Dec 2013
TL;DR: Digit-Reversal Bouncing achieves perfect packet interleaving and results in smaller and bounded queues even when traffic load approaches 100%, and it uses smaller re-sequencing buffer for absorbing out-of-order packet arrivals.
Abstract: Clos-based networks including Fat-tree and VL2 are being built in data centers, but existing per-flow based routing causes low network utilization and long latency tail. In this paper, by studying the structural properties of Fat-tree and VL2, we propose a per-packet round-robin based routing algorithm called Digit-Reversal Bouncing (DRB). DRB achieves perfect packet interleaving. Our analysis and simulations show that, compared with random-based load-balancing algorithms, DRB results in smaller and bounded queues even when traffic load approaches 100%, and it uses smaller re-sequencing buffer for absorbing out-of-order packet arrivals. Our implementation demonstrates that our design can be readily implemented with commodity switches. Experiments on our testbed, a Fat-tree with 54 servers, confirm our analysis and simulations, and further show that our design handles network failures in 1-2 seconds and has the desirable graceful performance degradation property.

159 citations


Proceedings ArticleDOI
01 Nov 2013
TL;DR: A method for low-latency pose tracking using a DVS and Active Led Markers, which are LEDs blinking at high frequency (>1 KHz), which is compared to traditional pose tracking based on a CMOS camera.
Abstract: At the current state of the art, the agility of an autonomous flying robot is limited by its sensing pipeline, because the relatively high latency and low sampling frequency limit the aggressiveness of the control strategies that can be implemented. To obtain more agile robots, we need faster sensing pipelines. A Dynamic Vision Sensor (DVS) is a very different sensor than a normal CMOS camera: rather than providing discrete frames like a CMOS camera, the sensor output is a sequence of asynchronous timestamped events each describing a change in the perceived brightness at a single pixel. The latency of such sensors can be measured in the microseconds, thus offering the theoretical possibility of creating a sensing pipeline whose latency is negligible compared to the dynamics of the platform. However, to use these sensors we must rethink the way we interpret visual data. This paper presents a method for low-latency pose tracking using a DVS and Active Led Markers (ALMs), which are LEDs blinking at high frequency (>1 KHz). The sensor's time resolution allows distinguishing different frequencies, thus avoiding the need for data association. This approach is compared to traditional pose tracking based on a CMOS camera. The DVS performance is not affected by fast motion, unlike the CMOS camera, which suffers from motion blur.

93 citations


Proceedings Article
01 Jan 2013
TL;DR: Zoolander is presented, a key value store that meets strict, low latency service level objectives (SLOs), and scales out using replication for predictability, an old but seldom-used approach that uses redundant accesses to mask outlier response times.
Abstract: Internet services access networked storage many times while processing a request. Just a few slow storage accesses per request can raise response times a lot, making the whole service less usable and hurting profits. This paper presents Zoolander, a key value store that meets strict, low latency service level objectives (SLOs). Zoolander scales out using replication for predictability, an old but seldom-used approach that uses redundant accesses to mask outlier response times. Zoolander also scales out using traditional replication and partitioning. It uses an analytic model to efficiently combine these competing approaches based on systems data and workload conditions. For example, when workloads under utilize system resources, Zoolander’s model often suggests replication for predictability, strengthening service levels by reducing outlier response times. When workloads use system resources heavily, causing large queuing delays, Zoolander’s model suggests scaling out via traditional approaches. We used a diurnal trace to test Zoolander at scale (up to 40M accesses per hour). Zoolander reduced SLO violations by 32%.

73 citations


Proceedings ArticleDOI
27 Aug 2013
TL;DR: Aqua, a high-bandwidth anonymity system that resists traffic analysis, is presented, and it is shown that Aqua achieves latency low enough for efficient bulk TCP flows, bandwidth sufficient to carry BitTorrent traffic with reasonable efficiency, and resistance to traffic analysis within anonymity sets of hundreds of clients.
Abstract: Existing IP anonymity systems tend to sacrifice one of low latency, high bandwidth, or resistance to traffic-analysis. High-latency mix-nets like Mixminion batch messages to resist traffic-analysis at the expense of low latency. Onion routing schemes like Tor deliver low latency and high bandwidth, but are not designed to withstand traffic analysis. Designs based on DC-nets or broadcast channels resist traffic analysis and provide low latency, but are limited to low bandwidth communication. In this paper, we present the design, implementation, and evaluation of Aqua, a high-bandwidth anonymity system that resists traffic analysis. We focus on providing strong anonymity for BitTorrent, and evaluate the performance of Aqua using traces from hundreds of thousands of actual BitTorrent users. We show that Aqua achieves latency low enough for efficient bulk TCP flows, bandwidth sufficient to carry BitTorrent traffic with reasonable efficiency, and resistance to traffic analysis within anonymity sets of hundreds of clients. We conclude that Aqua represents an interesting new point in the space of anonymity network designs.

72 citations


01 Jan 2013
TL;DR: In this article, the first low-latency search for gravitational-waves from binary inspirals in LIGO and Virgo data was conducted, and the resulting triggers were sent to electromagnetic observatories for followup.
Abstract: Aims: The detection and measurement of gravitational-waves from coalescing neutron-star binary systems is an important science goal for ground-based gravitational-wave detectors. In addition to emitting gravitational-waves at frequencies that span the most sensitive bands of the LIGO and Virgo detectors, these sources are also amongst the most likely to produce an electromagnetic counterpart to the gravitational-wave emission. A joint detection of the gravitational-wave and electromagnetic signals would provide a powerful new probe for astronomy. Methods: During the period between September 19 and October 20, 2010, the first low-latency search for gravitational-waves from binary inspirals in LIGO and Virgo data was conducted. The resulting triggers were sent to electromagnetic observatories for followup. We describe the generation and processing of the low-latency gravitational-wave triggers. The results of the electromagnetic image analysis will be described elsewhere. Results: Over the course of the science run, three gravitational-wave triggers passed all of the low-latency selection cuts. Of these, one was followed up by several of our observational partners. Analysis of the gravitational-wave data leads to an estimated false alarm rate of once every 6.4 days, falling far short of the requirement for a detection based solely on gravitational-wave data.

64 citations


Patent
04 Oct 2013
TL;DR: In this article, a system for processing user input includes an input device, an input processing unit, a high latency subsystem, a low-latency subsystem, and an output device.
Abstract: A system for processing user input includes an input device, an input processing unit, a high- latency subsystem, a low-latency subsystem, input processing unit software for generating signals in response to user inputs, and an output device. The low-latency subsystem receives the signals and generates low-latency output and the high-latency subsystem processes the signals and generates high-latency output.

58 citations


Book ChapterDOI
08 Apr 2013
TL;DR: LTA (LOw LAtency audio visual streaming system), a system for distributed performing arts interaction over advanced packet networks, demonstrated its effectiveness and suitability for distance musical interaction, even when professional players are involved and very ”tempo sensitive” classical baroque music repertoire is concerned.
Abstract: We present LOLA (LOw LAtency audio visual streaming system), a system for distributed performing arts interaction over advanced packet networks. It is intended to operate on high performance networking infrastructures, and is based on low latency audio/video acquisition hardware and on the integration and optimization of audio/video data acquisition, presentation and transmission. The extremely low round trip delay of the transmitted data makes the system suitable for remote musical education, real time distributed musical performance and performing arts activities, but in general also for any human-human interactive distributed activity in which timing and responsiveness are critical factors for the quality of the interaction. The experimentation conducted so far with professional music performers and skilled music students, on geographical distances up to 3500 Km, demonstrated its effectiveness and suitability for distance musical interaction, even when professional players are involved and very ”tempo sensitive” classical baroque music repertoire is concerned.

48 citations


Proceedings ArticleDOI
21 Apr 2013
TL;DR: This paper proposes centralized elastic bubble router - a router micro-architecture based on the use of centralized buffers with elastic buffered links that enables end-to-end latency reduction via high radix switches with low overall buffer requirements.
Abstract: While router buffers have been used as performance multipliers, they are also major consumers of area and power in on-chip networks. In this paper, we propose centralized elastic bubble router - a router micro-architecture based on the use of centralized buffers (CB) with elastic buffered (EB) links. At low loads, the CB is power gated, bypassed, and optimized to produce single cycle operation. A novel extension to bubble flow control enables routing deadlock and message dependent deadlock to be avoided with the same mechanism having constant buffer size per router independent of the number of message types. This solution enables end-to-end latency reduction via high radix switches with low overall buffer requirements. Comparisons made with other low latency routers across different topologies show consistent performance improvement, for example 26% improvement in no load latency of a 2D Mesh and 4X improvement in saturation throughput in a 2D-Generalized Hypercube.

Journal ArticleDOI
01 Sep 2013
TL;DR: A mechanism which utilizes path diversity so that broadcast messages can be disseminated with a short delay and a high reliability compared with the acknowledgment based retransmission approach and the message overhead is low.
Abstract: In vehicular ad hoc networks, many applications require a low latency and high reliability especially the safety applications. Reliable multi-hop broadcast protocols have been widely discussed recently. However, most of them use explicit acknowledgments and timeout retransmissions to provide reliability. The retransmission method incurs delays when a packet loss cannot be detected on time. Acknowledgment messages also increase the MAC layer contention time at each node. In order to provide a high reliability and low latency with a low overhead, we propose a mechanism which utilizes path diversity. In the proposed mechanism, a message is delivered through two different paths. By cooperation of these different paths, broadcast messages can be disseminated with a short delay and a high reliability compared with the acknowledgment based retransmission approach. Since the proposed mechanism does not use any explicit acknowledgment message, the message overhead is low. We evaluate the proposed mechanism using both theoretical analysis and computer simulations.

Posted Content
TL;DR: A taxonomy to categorize existing work based on four main techniques, reducing queue length, accelerating retransmissions, prioritizing mice flows, and exploiting multi-path is proposed.
Abstract: Datacenters are the cornerstone of the big data infrastructure supporting numerous online services. The demand for interactivity, which significantly impacts user experience and provider revenue, is translated into stringent timing requirements for flows in datacenter networks. Thus low latency networking is becoming a major concern of both industry and academia. We provide a short survey of recent progress made by the networking community for low latency datacenter networks. We propose a taxonomy to categorize existing work based on four main techniques, reducing queue length, accelerating retransmissions, prioritizing mice flows, and exploiting multi-path. Then we review select papers, highlight the principal ideas, and discuss their pros and cons. We also present our perspectives of the research challenges and opportunities, hoping to aspire more future work in this space.

Journal ArticleDOI
TL;DR: NaNet is an FPGA-based PCIe X8 Gen2 NIC supporting 1/10 GbE links and the custom 34 Gbps APElink channel, making it suitable for building low-latency, real-time GPU-based computing systems.
Abstract: NaNet is an FPGA-based PCIe X8 Gen2 NIC supporting 1/10 GbE links and the custom 34 Gbps APElink channel. The design has GPUDirect RDMA capabilities and features a network stack protocol offloading module, making it suitable for building low-latency, real-time GPU-based computing systems. We provide a detailed description of the NaNet hardware modular architecture. Benchmarks for latency and bandwidth for GbE and APElink channels are presented, followed by a performance analysis on the case study of the GPU-based low level trigger for the RICH detector in the NA62 CERN experiment, using either the NaNet GbE and APElink channels. Finally, we give an outline of project future activities.

Proceedings ArticleDOI
01 Dec 2013
TL;DR: A memory efficient architecture for single-pass connected components analysis suited for high throughput embedded image processing systems is proposed which achieves a high throughput by partitioning the image into several vertical slices processed in parallel.
Abstract: A memory efficient architecture for single-pass connected components analysis suited for high throughput embedded image processing systems is proposed which achieves a high throughput by partitioning the image into several vertical slices processed in parallel. The low latency of the architecture allows reuse of labels associated with the image objects. This reduces the amount of memory by a factor of more than 5 compared to previous work. This is significant, since memory is a critical resource in embedded image processing on FPGAs.

Journal ArticleDOI
01 May 2013
TL;DR: Expected transmission delay (ETD), a metric that simultaneously considers sleep latency and wireless link quality, is formulated and it is shown that the metric is left-monotonic and left-isotonic, proving that its use in distributed algorithms such as the distributed Bellman-Ford yields consistent, loop-free and optimal paths.
Abstract: In environmentally-powered wireless sensor networks (EPWSNs), low latency wakeup scheduling and packet forwarding is challenging due to dynamic duty cycling, posing time-varying sleep latencies and necessitating the use of dynamic wakeup schedules. We show that the variance of the intervals between receiving wakeup slots affects the expected sleep latency: when the variance of the intervals is low (high), the expected latency is low (high). We therefore propose a novel scheduling scheme that uses the bit-reversal permutation sequence (BRPS) - a finite integer sequence that positions receiving wakeup slots as evenly as possible to reduce the expected sleep latency. At the same time, the sequence serves as a compact representation of wakeup schedules thereby reducing storage and communication overhead. But while low latency wakeup schedule can reduce per-hop delay in ideal conditions, it does not necessarily lead to low latency end-to-end paths because wireless link quality also plays a significant role in the performance of packet forwarding. We therefore formulate expected transmission delay (ETD), a metric that simultaneously considers sleep latency and wireless link quality. We show that the metric is left-monotonic and left-isotonic, proving that its use in distributed algorithms such as the distributed Bellman-Ford yields consistent, loop-free and optimal paths. We perform extensive simulations using real-world energy harvesting traces to evaluate the performance of the scheduling and forwarding scheme.

Journal ArticleDOI
TL;DR: The evaluation results on up to 55,296 nodes of the K computer show the new implementation of MPI collective communication outperforms the existing one for long messages by a factor of 4 to 11 times and shows the short-message algorithms complement the long-message ones.
Abstract: This paper proposes the design of ultra scalable MPI collective communication for the K computer, which consists of 82,944 computing nodes and is the world's first system over 10 PFLOPS. The nodes are connected by a Tofu interconnect that introduces six dimensional mesh/torus topology. Existing MPI libraries, however, perform poorly on such a direct network system since they assume typical cluster environments. Thus, we design collective algorithms optimized for the K computer. On the design of the algorithms, we place importance on collision-freeness for long messages and low latency for short messages. The long-message algorithms use multiple RDMA network interfaces and consist of neighbor communication in order to gain high bandwidth and avoid message collisions. On the other hand, the short-message algorithms are designed to reduce software overhead, which comes from the number of relaying nodes. The evaluation results on up to 55,296 nodes of the K computer show the new implementation outperforms the existing one for long messages by a factor of 4 to 11 times. It also shows the short-message algorithms complement the long-message ones.

Journal ArticleDOI
TL;DR: This paper proposes a versatile Shack–Hartmann WFS based on an industrial smart camera for high-performance measurements of wavefront deformations, using a low-cost field-programmable gate array as the parallel processing platform.
Abstract: Wavefront sensing is important in various optical measurement systems, particularly in the field of adaptive optics (AO). For AO systems, the sampling rate, as well as the latency time, of the wavefront sensors (WFSs) imposes a restriction on the overall achievable temporal resolution. In this paper, we propose a versatile Shack–Hartmann WFS based on an industrial smart camera for high-performance measurements of wavefront deformations, using a low-cost field-programmable gate array as the parallel processing platform. The proposed wavefront reconstruction adds a processing latency of only 740 ns for calculating wavefront characteristics from the pixel stream of the image sensor, providing great potential for demanding AO system designs.

Patent
29 Apr 2013
TL;DR: In this article, the authors propose variable bandwidth allocations such that smaller frequency sub-bands are allocated to users, as their number increases, but the individual users/nodes insert more data-carrying signals in order to compensate for the loss of operating bandwidth arising from the accommodation of more users.
Abstract: Cost, electronic circuitry limitations, and communication channel behaviour yield communication systems with strict bandwidth constraints. Hence, maximally utilizing available bandwidth is crucial, for example in wireless networks, to supporting ever increasing numbers of users and their demands for increased data volumes, low latency, and high download speeds. Accordingly, it would be beneficial for such networks to support variable bandwidth allocations such that smaller frequency sub-bands are allocated to users, as their number increases, but the individual users/nodes insert more data-carrying signals in order to compensate for the loss of operating bandwidth arising from the accommodation of more users. It would further be beneficial for transmitters and receivers according to embodiments of such a network architecture to be based upon low cost design methodologies allowing their deployment within a wide range of applications including high volume, low cost consumer electronics for example.

Journal ArticleDOI
TL;DR: The Low Latency Fault Tolerance (LLFT) system provides fault tolerance for distributed applications, using the leader-follower replication technique, and achieves low latency message delivery during normal operation and low latency reconfiguration and recovery when a fault occurs.
Abstract: The low latency fault tolerance (LLFT) system provides fault tolerance for distributed applications within a local-area network, using a leader–follower replication strategy. LLFT provides application-transparent replication, with strong replica consistency, for applications that involve multiple interacting processes or threads. Its novel system model enables LLFT to maintain a single consistent infinite computation, despite faults and asynchronous communication. The LLFT messaging protocol provides reliable, totally ordered message delivery by employing a group multicast, where the message ordering is determined by the primary replica in the destination group. The leader-determined membership protocol provides reconfiguration and recovery when a replica becomes faulty and when a replica joins or leaves a group, where the membership of the group is determined by the primary replica. The virtual determinizer framework captures the ordering information at the primary replica and enforces the same ordering of non-deterministic operations at the backup replicas. LLFT does not employ a majority-based, multiple-round consensus algorithm and, thus, it can operate in the common industrial case where there is a primary replica and only one backup replica. The LLFT system achieves low latency message delivery during normal operation and low latency reconfiguration and recovery when a fault occurs.

Journal ArticleDOI
TL;DR: Simulation and analysis results show that the proposed architectures can be considered as a viable solution for future NoCs and yield highly scalabilities, high bandwidth, low latency and low power consumption.

Patent
23 Jan 2013
TL;DR: In this paper, a low-latency touch-input device receives writing as input to the device and temporarily displays the writing on a physical layer that overlays a touchscreen display of the device.
Abstract: This document describes embodiments of a low-latency touch-input device. The low-latency touch-input device receives writing as input to the device and temporarily displays the writing on a physical layer that overlays a touchscreen display of the device. The writing is displayed instantaneously on the physical layer before the touch-input device processes the input. The low-latency touch-input device then processes the input to generate a digital representation of the writing and renders the digital representation of the writing on the touchscreen display to replace the writing displayed on the physical layer.

Journal ArticleDOI
TL;DR: A hardware image rectification engine, which supports the processing of stereo high-definition serial digital interfaces video streams with up to 1080p30 video with a latency below 1 ms.
Abstract: The emerging market of digital 3-D film productions in HD resolution leads to the need for high-quality equipment in the production chain The incoming video streams of the two cameras require an image rectification due to unavoidable misalignments within the stereoscopic camera setup This rectification can either take place in postprocessing of the recorded material or it can be applied in real time during the shooting Especially in the case of streaming and recording of live events, real-time processing is necessary and, additionally, the system has to provide a very low latency We present a hardware image rectification engine, which supports the processing of stereo high-definition serial digital interfaces video streams with up to 1080p30 video with a latency below 1 ms The image rectification engines for the two channels are implemented on two Altera Stratix III EP3SL340 running at 7425 MHz They can be controlled by the stereoscopy analysis software, which calculates the parameters required for the image rectification at runtime

DOI
16 Jun 2013
TL;DR: A CMOS vision sensor that combines event-driven asychronous read out of temporal contrast with synchronous frams-based active pixel sensor (APS) readout of intensity that allows low latency at low data rate and low system-level power consumption is proposed.
Abstract: This paper proposes a CMOS vision sensor that combines event-driven asychronous readout of temporal contrast with synchronous frams-based active pixel sensor (APS) readout of intensity. The image frames can be used for scene content analysis and the temporal constrast events can be used to track fast moving objects, to adjust the frame rate, or to guide a region of interest readout Therefore the sensor is suitable for mobile applications because it allows low latency at low data rate and low system-level power consumption. Sharing the photodiode for both readout types allows a compact pixel design that is 60% smaller than a comparable design. The 240x180 sensor has a power consumption of 10mW. It is built in 0.18um technology with 18.5um pixels. The temporal contrast pathway has a minimum latency of 12us, a dynamic range of 120dB, 12% contrast detection threshold and 3.5% contrast matching. The APS readout has 55dB dynamic range with 1% FPN.

Journal ArticleDOI
TL;DR: This work's concern is designing a Two phased Service Oriented Broker (2SOB) for replica selection, and it is possible to achieve an enhancement in the speed of executing Data Grid jobs through reducing the transfer time.

Proceedings ArticleDOI
19 May 2013
TL;DR: A fully parallel 64K point radix-44 FFT processor that shows significant reduction in intermediate memory but with increased hardware complexity and reduced latency with comparable throughput and area is proposed.
Abstract: In this paper we propose a fully parallel 64K point radix-44 FFT processor. The radix-44 parallel unrolled architecture uses a novel radix-4 butterfly unit which takes all four inputs in parallel and can selectively produce one out of the four outputs. The radix-44 block can take all 256 inputs in parallel and can use the select control signals to generate one out of the 256 outputs. The resultant 64K point FFT processor shows significant reduction in intermediate memory but with increased hardware complexity. Compared to the state-of-art implementation [5], our architecture shows reduced latency with comparable throughput and area. The 64K point FFT architecture was synthesized using a 130nm CMOS technology which resulted in a throughput of 1.4 GSPS and latency of 47.7μs with a maximum clock frequency of 350MHz. When compared to [5], the latency is reduced by 303μs with 50.8% reduction in area.

Proceedings Article
27 Jun 2013
TL;DR: This work presents new cooperative schemes including software and hardware to address performance issues with deploying storage-class memory technologies as a storage device, including a new polling scheme called dynamic interval polling and a pipelined execution between storage device and host OS called pipelining post I/O processing.
Abstract: Emerging non-volatile memory technologies as a disk drive replacement raise some issues of software stack and interfaces, which have not been considered in disk-based storage systems. In this work, we present new cooperative schemes including software and hardware to address performance issues with deploying storage-class memory technologies as a storage device. First, we propose a new polling scheme called dynamic interval polling to avoid the unnecessary polls and reduce the burden on storage system bus. Second, we propose a pipelined execution between storage device and host OS called pipelined post I/O processing. By extending vendor-specific I/O interfaces between software and hardware, we can improve the responsiveness of I/O requests with no sacrifice of throughput.

Proceedings ArticleDOI
09 Jun 2013
TL;DR: RACS is proposed, a data center transport protocol that minimizes flow completion times by approximating the Shortest Remaining Processing Time (SRPT) scheduling policy, which is known to be optimal, in a distributed manner.
Abstract: Today's data centers face extreme challenges in providing low latency for online services such as web search, social networking, and recommendation systems. Achieving low latency is important as it impacts user experience, which in turn impacts operator revenue. However, most current congestion control protocols approximate Processor Sharing (PS), which is known to be sub-optimal for minimizing latency. In this paper, we propose Router Assisted Capacity Sharing (RACS), a data center transport protocol that minimizes flow completion times by approximating the Shortest Remaining Processing Time (SRPT) scheduling policy, which is known to be optimal, in a distributed manner. With RACS, flows are assigned weights which determine their relative priority and thus the rate assigned to them. By changing these weights, RACS can approximate a range of scheduling disciplines. Through extensive ns-2 simulations, we demonstrate that RACS outperforms TCP, DCTCP, and RCP in data center environments. In particular, it improves completion times by up to 95% over TCP, 88% over DCTCP, and 80% over RCP. Our results also show that RACS can outperform deadline-aware transport protocols for typical data center workloads.

Patent
16 Dec 2013
TL;DR: In this paper, the authors describe a system and methods for transmitting data over physical channels to provide a high speed, low latency interface such as between a memory controller and memory devices.
Abstract: Systems and methods are described for transmitting data over physical channels to provide a high speed, low latency interface such as between a memory controller and memory devices. Controller-side and memory-side embodiments of such channel interfaces are disclosed which require a low pin count and have low power utilization. In some embodiments of the invention, different voltage, current, etc. levels are used for signaling and more than two levels may be used, such as a vector signaling code wherein each wire signal may take on one of four signal values.

Patent
14 Aug 2013
TL;DR: In this paper, a mass small file low latency storage method based on HBase is proposed, where a small file list comprising a row primary key and two column families is established on the condition of Hadoop and HBase; storage environment suitable for small files is established, an application process including small file writing, small file inserting and small file reading is provided, and further reasonable storage and low latency reading and writing of the mass small files are realized.
Abstract: The invention provides a mass small file low latency storage method based on HBase. A small file list comprising a row primary key and two column families is established on the condition of Hadoop and HBase; storage environment suitable for small files is established, an application process including small file writing, small file inserting and small file reading is provided; and further reasonable storage and low latency reading and writing of the mass small files are realized, and practical requirements are met.