scispace - formally typeset
Search or ask a question

Showing papers on "Latency (engineering) published in 2012"


Posted Content
01 Jan 2012
TL;DR: This paper presents a block cipher that is optimized with respect to latency when implemented in hardware and holds that decryption for one key corresponds to encryption with a related key, which is of independent interest and proves its soundness against generic attacks.
Abstract: This paper presents a block cipher that is optimized with respect to latency when implemented in hardware. Such ciphers are desirable for many future pervasive applications with real-time security needs. Our cipher, named PRINCE, allows encryption of data within one clock cycle with a very competitive chip area compared to known solutions. The fully unrolled fashion in which such algorithms need to be implemented calls for innovative design choices. The number of rounds must be moderate and rounds must have short delays in hardware. At the same time, the traditional need that a cipher has to be iterative with very similar round functions disappears, an observation that increases the design space for the algorithm. An important further requirement is that realizing decryption and encryption results in minimum additional costs. PRINCE is designed in such a way that the overhead for decryption on top of encryption is negligible. More precisely for our cipher it holds that decryption for one key corresponds to encryption with a related key. This property we refer to as α-reflection is of independent interest and we prove its soundness against generic attacks.

439 citations


Proceedings Article
28 May 2012
TL;DR: In this article, the authors discuss large scale data analysis using different MapReduce implementations and then present a performance analysis of high performance parallel applications on virtualized resources, including MPI and CGL-MapReduce.
Abstract: Infrastructure services (Infrastructure-as-a-service), provided by cloud vendors, allow any user to provision a large number of compute instances fairly easily. Whether leased from public clouds or allocated from private clouds, utilizing these virtual resources to perform data/compute intensive analyses requires employing different parallel runtimes to implement such applications. Among many parallelizable problems, most “pleasingly parallel” applications can be performed using MapReduce technologies such as Hadoop, CGL-MapReduce, and Dryad, in a fairly easy manner. However, many scientific applications, which have complex communication patterns, still require low latency communication mechanisms and rich set of communication constructs offered by runtimes such as MPI. In this paper, we first discuss large scale data analysis using different MapReduce implementations and then, we present a performance analysis of high performance parallel applications on virtualized resources.

214 citations


Proceedings ArticleDOI
07 Oct 2012
TL;DR: This work proposes a hybrid system that provides low-fidelity feedback immediately, followed by high-f fidelity visuals at standard levels of latency, and shows that users greatly prefer lower latencies, and improvement continued well below 10ms.
Abstract: Software designed for direct-touch interfaces often utilize a metaphor of direct physical manipulation of pseudo "real-world" objects. However, current touch systems typically take 50-200ms to update the display in response to a physi-cal touch action. Utilizing a high performance touch de-monstrator, subjects were able to experience touch latencies ranging from current levels down to about 1ms. Our tests show that users greatly prefer lower latencies, and noticea-ble improvement continued well below 10ms. This level of performance is difficult to achieve in commercial compu-ting systems using current technologies. As an alternative, we propose a hybrid system that provides low-fidelity visu-al feedback immediately, followed by high-fidelity visuals at standard levels of latency.

178 citations


Patent
04 Oct 2012
TL;DR: In this paper, the authors propose to decode at least a portion of the media data of the second segment relative to the first segment in order to achieve a low latency live profile for dynamic adaptive streaming over HTTP.
Abstract: In one example, a device includes one or more processors configured to receive a first segment of media data, wherein the media data of the first segment comprises a stream access point, receive a second segment of media data, wherein the media data of the second segment lacks a stream access point at the beginning of the second segment, and decode at least a portion of the media data of the second segment relative to at least a portion of data for the first segment. In this manner, the techniques of this disclosure may be used to achieve a Low Latency Live profile for, e.g., dynamic adaptive streaming over HTTP (DASH).

97 citations


Journal ArticleDOI
TL;DR: This article proposes a virtual infrastructure and a data dissemination protocol exploiting this infrastructure, which considers dynamic conditions of multiple sinks and sources and is fault-tolerant, meaning it can bypass routing holes created by imperfect conditions of wireless communication in the network.
Abstract: A new category of intelligent sensor network applications emerges where motion is a fundamental characteristic of the system under consideration. In such applications, sensors are attached to vehicles, or people that move around large geographic areas. For instance, in mission critical applications of wireless sensor networks (WSNs), sinks can be associated to first responders. In such scenarios, reliable data dissemination of events is very important, as well as the efficiency in handling the mobility of both sinks and event sources. For this kind of applications, reliability means real-time data delivery with a high data delivery ratio. In this article, we propose a virtual infrastructure and a data dissemination protocol exploiting this infrastructure, which considers dynamic conditions of multiple sinks and sources. The architecture consists of 'highways' in a honeycomb tessellation, which are the three main diagonals of the honeycomb where the data flow is directed and event data is cached. The highways act as rendezvous regions of the events and queries. Our protocol, namely hexagonal cell-based data dissemination (HexDD), is fault-tolerant, meaning it can bypass routing holes created by imperfect conditions of wireless communication in the network. We analytically evaluate the communication cost and hot region traffic cost of HexDD and compare it with other approaches. Additionally, with extensive simulations, we evaluate the performance of HexDD in terms of data delivery ratio, latency, and energy consumption. We also analyze the hot spot zones of HexDD and other virtual infrastructure based protocols. To overcome the hot region problem in HexDD, we propose to resize the hot regions and evaluate the performance of this method. Simulation results show that our study significantly reduces overall energy consumption while maintaining comparably high data delivery ratio and low latency.

80 citations


Patent
Heng Zhang1, Mehdi Khanpour1, Jun Cao1, Chang Liu1, Afshin Momtaz1 
07 Nov 2012
TL;DR: In this paper, a transceiver includes a high latency channel and a low latency communication channel that is configured to be a bypass channel for the high latency communication channels for low latency applications.
Abstract: Methods, systems, and apparatuses are described for reducing the latency in a transceiver A transceiver includes a high latency communication channel and a low latency communication channel that is configured to be a bypass channel for the high latency communication channel The low latency communication channel may be utilized when implementing the transceiver is used in low latency applications By bypassing the high latency communication channel, the high latency that is introduced therein (due to the many stages of de-serialization used to reduce the data rate for digital processing) can be avoided An increase in data rate is realized when the low latency communication channel is used to pass data A delay-locked loop (DLL) may be used to phase align the transmitter clock of the transceiver with the receiver clock of the transceiver to compensate for a limited tolerance of phase offset between these clocks

72 citations


Journal ArticleDOI
TL;DR: A new wake-up receiver is proposed to reduce energy consumption and latency through adoption of two different data rates for the transmission of wake-ups, and achieves a sensitivity of -73 dBm and dissipating an average power of 8.5 μW from a 1.8 V supply.
Abstract: A new wake-up receiver is proposed to reduce energy consumption and latency through adoption of two different data rates for the transmission of wake-up packets. To reduce the energy consumption, the start frame bits (SFBs) of a wake-up packet are transmitted at a low data rate of 1 kbps, and a bit-level duty cycle is employed for detection of SFBs. To reduce both energy consumption and latency, duty cycling is halted upon detection of the SFB sequence, and the rest of the wake-up packet is transmitted at a higher data rate of 200 kbps. The proposed wake-up receiver is designed and fabricated in a 0.18 μm CMOS technology with a core size of 1850×1560 μm for the target frequency range of 902-928 MHz. The measured results show that the proposed design achieves a sensitivity of -73 dBm, while dissipating an average power of 8.5 μW from a 1.8 V supply.

48 citations



Patent
05 Apr 2012
TL;DR: In this article, various techniques for distributing data, particularly real-time data such as financial market data, to data consumers at low latency are described, including adaptive data distribution techniques and multi-class distribution engine.
Abstract: Various techniques are disclosed for distributing data, particularly real-time data such as financial market data, to data consumers at low latency Exemplary embodiments include embodiments that employ adaptive data distribution techniques and embodiments that employ a multi-class distribution engine

35 citations


Proceedings ArticleDOI
09 May 2012
TL;DR: This paper assesses network partitioning options and bandwidth scalability techniques with deep technology and layout awareness and the main contribution lying in the characterization and precise quantification of such interaction effects between the technology platform, the layout constraints and the network-level quality metrics of a passive optical NoC.
Abstract: The performance of future chip multi-processors will only scale with the number of integrated cores if there is a corresponding increase in memory access efficiency. The focus of this paper on a 3D-stacked wavelength-routed optical layer for high bandwidth and low latency processor-memory communication goes in this direction and complements ongoing efforts on photonically integrated bandwidth-rich DRAM devices. This target environment dictates layout constraints that make the difference in discriminating between alternative design choices of the optical layer. This paper assesses network partitioning options and bandwidth scalability techniques with deep technology and layout awareness, the main contribution lying in the characterization and precise quantification of such interaction effects between the technology platform, the layout constraints and the network-level quality metrics of a passive optical NoC.

31 citations


Proceedings ArticleDOI
04 Mar 2012
TL;DR: This work demonstrates for the first time 40 Gb/s operation of a modular large port count optical packet switch with highly distributed control with 25ns latency and record low energy consumption.
Abstract: We demonstrate for the first time 40 Gb/s operation of a modular large port count optical packet switch with highly distributed control. The switch shows 25ns latency and record low energy consumption of 76.5 pj/bit.

Proceedings ArticleDOI
26 Mar 2012
TL;DR: An energy efficient multi-token based MAC protocol is presented that not only extends the network lifetime and maintain the network connectivity but also achieve congestion less, fault-tolerant and reliable data transmission.
Abstract: Wireless sensor networks (WSNs) have accelerated tremendous research efforts with an aim to maximize the lifetime of battery-powered sensor nodes and, by extension, the overall network lifetime. With an objective to prolong the lifetime of WSN, reducing energy consumption turns out to be the most crucial factors for almost all WSN protocols, particularly for the MAC protocol that directly ensures the state of the main energy consumption component, i.e., the radio module. In order to minimize energy consumption, RMAC and HEMAC protocols allow a node to transmit data packets for multi-hop WSN in a single duty-cycle. At the same time, each node remains in low power sleep mode and wakes up periodically to sense for channel activities, i.e., data transmission. But, in token based MAC protocol, depending on the token availability, there is always an end-to-end communication between source and sink one at a time, still it would have high latency time. Hence, different MAC protocols for WSN always have greater challenges towards energy conservation, maintaining low latency time, and fault-tolerant to node failure. To overcome these problems, we present an energy efficient multi-token based MAC protocol that not only extend the network lifetime and maintain the network connectivity but also achieve congestion less, fault-tolerant and reliable data transmission. Simulation studies of the proposed MAC protocol have been carried out using Castalia simulator, and its performance has been compared with that of SMAC, RMAC, and token based MAC protocol. Simulation results also show that the proposed approach has lower energy consumption and higher delivery ratio.

Proceedings ArticleDOI
16 Sep 2012
TL;DR: A novel datacenter network architecture utilizing OFDM and parallel signal detection technologies and efficient subcarrier allocation algorithms is proposed and fast, low latency, fine granularity, bandwidth flexible, and low power consumption MIMO switching is demonstrated experimentally.
Abstract: We propose a novel datacenter network architecture utilizing OFDM and parallel signal detection technologies and efficient subcarrier allocation algorithms. Fast, low latency, fine granularity, bandwidth flexible, and low power consumption MIMO switching is demonstrated experimentally.

Proceedings ArticleDOI
13 Aug 2012
TL;DR: This work designed and constructed a 24x24-port optical circuit switch (OCS) prototype with a programming time of 68.5 μs, a switching time of 2.8μs, and a receiver electronics initialization time of 8.7 μs and demonstrates the operation of this prototype switch in a data center testbed under various workloads.
Abstract: We designed and constructed a 24x24-port optical circuit switch (OCS) prototype with a programming time of 68.5 μs, a switching time of 2.8 μs, and a receiver electronics initialization time of 8.7 μs [1]. We demonstrate the operation of this prototype switch in a data center testbed under various workloads.

Patent
Philip L. Northcott1
22 May 2012
TL;DR: In this article, the authors present an approach and methods that provide relatively low uncorrectable bit error rates, low write amplification, long life, fast and efficient retrieval, and efficient storage density such that a solid-state drive (SSD) can be implemented using relatively inexpensive MLC Flash for an enterprise storage application.
Abstract: Apparatus and methods provide relatively low uncorrectable bit error rates, low write amplification, long life, fast and efficient retrieval, and efficient storage density such that a solid-state drive (SSD) can be implemented using relatively inexpensive MLC Flash for an enterprise storage application.

Proceedings ArticleDOI
20 May 2012
TL;DR: A novel and fast 4-2 compressor is proposed which will have no need for extra buffers in low latency paths to equalize the delays and the power dissipation will be decreased and the output waveforms will be free of any glitch.
Abstract: This paper discusses about the design of a novel and fast 4-2 compressor. To enhance the speed performance, some changes are performed in the truth table of conventional 4-2 compressor which leaded to reduction of gate level delay to 2 XOR logic gates plus 1 transistor for all parameters. Because of similar paths, there will be no need for extra buffers in low latency paths to equalize the delays. Therefore, the power dissipation will be decreased and the output waveforms will be free of any glitch. The delay of proposed architecture is 340ps which is simulated by HSPICE using TSMC 0.35µm CMOS technology.

Journal ArticleDOI
23 May 2012
TL;DR: HartOS is a hardware-implemented, micro-kernel-structured RTOS targeted for hard real-time embedded applications running on FPGA based platforms and has up to 3 orders of magnitude less mean error in generating the correct period for a periodic task, while having up to 100% less overhead depending on the tick frequency.
Abstract: This paper introduces HartOS, a hardware-implemented, micro-kernel-structured RTOS targeted for hard real-time embedded applications running on FPGA based platforms. Historically hardware RTOSs have been too inflexible and have had limited features and resources. HartOS is designed to be flexible and supports most of the features normally found in a software-based RTOS. To ensure fast, low latency and jitter-free communication between the CPU and RTOS, HartOS uses the ARM AXI4-Stream bus recently supported by the MicroBlaze softcore processor. Compared to μC/OS-II, HartOS has up to 3 orders of magnitude less mean error in generating the correct period for a periodic task, and around 1 order of magnitude less jitter, while having up to 100% less overhead depending on the tick frequency.

Book ChapterDOI
05 Nov 2012
TL;DR: A close-to-sensor low latency visual processing system that shows that by adaptively sampling visual information, low level tracking can be achieved at high temporal frequencies with no increase in bandwidth and using very little memory.
Abstract: In this paper we describe a close-to-sensor low latency visual processing system. We show that by adaptively sampling visual information, low level tracking can be achieved at high temporal frequencies with no increase in bandwidth and using very little memory. By having close-to-sensor processing, image regions can be captured and processed at millisecond sub-frame rates. If spatiotemporal regions have little useful information in them they can be discarded without further processing. Spatiotemporal regions that contain 'interesting' changes are further processed to determine what the interesting changes are. Close-to-sensor processing enables low latency programming of the image sensor such that interesting parts of a scene are sampled more often than less interesting parts. Using a small set of low level rules to define what is interesting, early visual processing proceeds autonomously. We demonstrate system performance with two applications. Firstly, to test the absolute performance of the system, we show low level visual tracking at millisecond rates and secondly a more general recursive Baysian tracker.

Journal ArticleDOI
TL;DR: A novel high speed 4-2 compressor using static and pass-transistor logic, has been designed in a 0.35µm CMOS technology in order to reduce gate level delay and increase the speed.
Abstract: A novel high speed 4-2 compressor using static and pass-transistor logic, has been designed in a 0.35µm CMOS technology. In order to reduce gate level delay and increase the speed, some changes are performed in truth table of conventional 4-2 compressor which leaded to the simplification of logic function for all parameters. Therefore, power dissipation is decreased. In addition, because of similar paths from all inputs to the outputs, the delays are the same. So there will be no need for extra buffers in low latency paths to equalize the delays.

Journal ArticleDOI
TL;DR: A field-programmable gate array (FPGA)-based label processor for in-band optical labels with a processing time independent of the number of label bits is presented, which allows for implementing an optical packet switching architecture that scales to a large port count without compromising the latency.
Abstract: We present a field-programmable gate array (FPGA)-based label processor for in-band optical labels with a processing time independent of the number of label bits. This allows for implementing an optical packet switching architecture that scales to a large port count without compromising the latency. As a proof of concept, we have employed an FPGA board with 100 MHz clock to validate the operation of the label processor in a 160 Gb/s optical packet switching system. Experimental results show successful three label bits processing and 160 Gb/s packets switching with 1 dB power penalty and 470 ns of latency. Projections on the label processor performance by using more powerful FPGAs indicate that 60 label bits ( optical addresses) can be processed within 31 ns.

Journal ArticleDOI
TL;DR: In this article, the authors describe how Facebook analyzes big data and how it applies it to Facebook's own data, and how Facebook is analyzed big data, using big data.
Abstract: How Facebook is analyzing big data.

Proceedings ArticleDOI
29 Nov 2012
TL;DR: LA-MAC is a low-latency asynchronous access method for efficient forwarding in wireless sensor networks suitable for current and future sensor networks that increasingly provide support for multiple applications, handle heterogeneous traffic, and become organized according to some complex structure.
Abstract: The paper presents LA-MAC, a low-latency asynchronous access method for efficient forwarding in wireless sensor networks. It is suitable for current and future sensor networks that increasingly provide support for multiple applications, handle heterogeneous traffic, and become organized according to some complex structure (tree, DAG, partial mesh). It takes advantage of the network structure so that a parent of some nodes becomes a coordinator that schedules transmissions in a localized region. Allowing burst transmissions improves the network capacity so that the network can handle load fluctuations. At the same time, the method reduces energy consumption by decreasing the overhead of node coordination per frame. The paper reports on the results of extensive simulations that compare LA-MAC with B-MAC and X-MAC, two representative methods based on preamble sampling. They show excellent performance of LA-MAC with respect to latency, delivery ratio, and consumed energy.

Patent
11 Apr 2012
TL;DR: In this paper, a network of processing devices includes a medium for low-latency interfaces for providing point-to-point connections between each of the processing devices, and a switch within each processing device is arranged to facilitate communications in any combination between the processing resources and the local point to point interfaces within each processor device.
Abstract: A network of processing devices includes a medium for low-latency interfaces for providing point-to-point connections between each of the processing devices A switch within each processing device is arranged to facilitate communications in any combination between the processing resources and the local point-to-point interfaces within each processing device A networking layer is provided above the low-latency interface stack, which facilitates re-use of software and exploits existing protocols for providing the point-to-point connections Higher speeds are achieved for switching between the relatively low numbers of processor resources within each processing device, while low-latency point-to-point communications are achieved using the low-latency interfaces for accessing processor resources that are external to a processing device

Patent
10 Jul 2012
TL;DR: In this article, a technique for securing transmit opening helps enhance the operation of a station that employs the technique, which may facilitate low latency response to a protocol data requester, for instance.
Abstract: A technique for securing transmit opening helps enhance the operation of a station that employs the technique. The technique may facilitate low latency response to a protocol data requester, for instance. In one aspect, the technique provides a way for the protocol data responder to hold its transmit opening to transmit the protocol response data to the protocol data requester. The technique may allow the protocol data responder to hold the transmit opening until the protocol response data is ready and available for the protocol data responder to send.

Journal ArticleDOI
Yoonho Park1, Richard Pervin King1, Senthil Nathan1, Wesley Most1, Henrique Andrade2 
TL;DR: This work determines the effectiveness of each system optimization that the hardware and software infrastructure makes available and shows that a stock market data processing system can be built with general‐purpose middleware and run on commodity hardware.
Abstract: A stock market data processing system that can handle high data volumes at low latencies is critical to market makers. Such systems play a critical role in algorithmic trading, risk analysis, market surveillance, and many other related areas. The current systems tend to use specialized software and custom processors. We show that such a system can be built with general-purpose middleware and run on commodity hardware. The middleware we use is IBM System S which includes transport technology from IBM WebSphere MQ Low Latency Messaging (LLM). Our performance evaluation consists of two parts. First, we determined the effectiveness of each system optimization that the hardware and software infrastructure makes available. These optimizations were implemented at all software levels--application, middleware, and operating system. Second, we evaluated our system on different hardware platforms.

Proceedings ArticleDOI
25 Jun 2012
TL;DR: This paper presents a novel set of collective operations implemented using point to point messages, shared memory and accelerator hardware to exploit the hierarchical organization of the P7IH for providing low latency, high bandwidth operations.
Abstract: The Power7 IH (P7IH) is one of IBM's latest generation of supercomputers Like most modern parallel machines, it has a hierarchical organization consisting of simultaneous multithreading (SMT) within a core, multiple cores per processor, multiple processors per node (SMP), and multiple SMPs per cluster A low latency/high bandwidth network with specialized accelerators is used to interconnect the SMP nodes System software is tuned to exploit the hierarchical organization of the machineIn this paper we present a novel set of collective operations that take advantage of the P7IH hardware We discuss non blocking collective operations implemented using point to point messages, shared memory and accelerator hardware We show how collectives can be composed to exploit the hierarchical organization of the P7IH for providing low latency, high bandwidth operations We demonstrate the scalability of the collectives we designed by including experimental results on a P7IH system with up to 4096 cores

Patent
14 Mar 2012
TL;DR: In this paper, a dynamically reconfigurable asynchronous arbitration node for use in an adaptive asynchronous interconnection network is provided, which includes a circuit, an output channel and two input channels.
Abstract: A dynamically reconfigurable asynchronous arbitration node for use in an adaptive asynchronous interconnection network is provided. The arbitration node includes a circuit, an output channel and two input channels—a first input channel and a second input channel. The circuit supports a default-arbitration mode and a biased-input mode. The circuit is configured to generate data for the output channel by mediating between input traffic including data received at the first and second input channels, if the arbitration node is operating in the default-arbitration mode, or by providing a direct path to the output channel for one of the first input channel and the second input channel that is biased, if the arbitration node is operating in the biased-input mode. The circuit is further configured to monitor the input traffic and implement a mode change based on a history of the observed input traffic in accordance with a mode-change policy.

Posted Content
TL;DR: The DNP provides inter-tile services for both on-chip and off-chip communications with a uniform RDMA style API, over a multi-dimensional direct network with a (possibly) hybrid topology.
Abstract: One of the most demanding challenges for the designers of parallel computing architectures is to deliver an efficient network infrastructure providing low latency, high bandwidth communications while preserving scalability. Besides off-chip communications between processors, recent multi-tile (i.e. multi-core) architectures face the challenge for an efficient on-chip interconnection network between processor's tiles. In this paper, we present a configurable and scalable architecture, based on our Distributed Network Processor (DNP) IP Library, targeting systems ranging from single MPSoCs to massive HPC platforms. The DNP provides inter-tile services for both on-chip and off-chip communications with a uniform RDMA style API, over a multi-dimensional direct network with a (possibly) hybrid topology.

Patent
07 Dec 2012
TL;DR: In this paper, a first portion of the packet is written into a first cell of a plurality of cells of a buffer in the network device, each of the cells has a size that is less than a minimum size of packets received by the device.
Abstract: Buffer designs and write/read configurations for a buffer in a network device are provided. According to one aspect, a first portion of the packet is written into a first cell of a plurality of cells of a buffer in the network device. Each of the cells has a size that is less than a minimum size of packets received by the network device. The first portion of the packet can be read from the first cell while concurrently writing a second portion of the packet to a second cell.

Journal Article
TL;DR: This paper will provide an Design and implementation of on-chip router architecture that allows routing function for each input port and distributed arbiters which gives high level of parallelism.
Abstract: Technology scaling continuously increasing number of component and complexity for System on Chip systems [1]. For effective global on-chip communication, on-chip routers provide essential routing functionality with low complexity and relatively high performance [1]. The low latency and high speed is achieved by allowing routing function for each input port and distributed arbiters which gives high level of parallelism [4]. This paper will provide an Design and implementation of on-chip router architecture.