scispace - formally typeset
Search or ask a question

Showing papers on "Latency (engineering) published in 2002"


Proceedings ArticleDOI
07 Aug 2002
TL;DR: A new active queue management (AQM) algorithm called GREEN is introduced that provides high link utilization whilst maintaining low delay and packet loss and enables low latency interactive applications such as telephony and network games.
Abstract: In this paper we introduce a new active queue management (AQM) algorithm called GREEN GREEN provides high link utilization whilst maintaining low delay and packet loss GREEN enables low latency interactive applications such as telephony and network games GREEN is shown to outperform the current AQM algorithms Certain performance problems with current AQMs are discussed

86 citations


Patent
22 Nov 2002
TL;DR: In this paper, an apparatus and method for low latency power management on a serial data link are described, which includes the detection of an idle exit condition during receiver operation in an electrical idle state and data synchronization is performed according to one or more received data synchronization training patterns.
Abstract: An apparatus and method for low latency power management on a serial data link are described. In one embodiment, the method includes the detection of an electrical idle exit condition during receiver operation in an electrical idle state. Once detected, data synchronization is performed according to one or more received data synchronization training patterns. Finally, when the synchronization is performed within a determined synchronization re-establishment period, the receiver will resume operation according to a normal power state. Accordingly, the embodiment described illustrates an open loop, low latency power resumption operation for power management within 3GIO links.

84 citations


Journal ArticleDOI
TL;DR: This work demonstrates a novel optical time division multiplexing packet-level system-synchronization and address-comparison technique, which relies on cascaded semiconductor-based optical logic gates operating at 50-Gb/s line rates.
Abstract: We demonstrate a novel optical time division multiplexing packet-level system-synchronization and address-comparison technique, which relies on cascaded semiconductor-based optical logic gates operating at 50-Gb/s line rates. Synchronous global clock distribution is used to achieve fixed length packet-synchronization that is resistant to channel-induced timing delays, and straightforward to achieve using a single optical logic gate. Four-bit address processing is achieved using a pulse-position modulated header input to a single optical logic gate, which provides Boolean XOR functionality, low latency, and stability over >1 h time periods with low switching energy <100 fJ.

58 citations


Proceedings Article
01 Jan 2002
TL;DR: A low latency real-time Broadcast News recognition system capable of transcribing live television newscasts with reasonable accuracy and recent modeling and efficiency improvements that yield a 22% word error rate on the Hub4e98 test set while running faster than real- time.
Abstract: In this paper, we present a low latency real-time Broadcast News recognition system capable of transcribing live television newscasts with reasonable accuracy. We describe our recent modeling and efficiency improvements that yield a 22% word error rate on the Hub4e98 test set while running faster than real-time. These include the discriminative training of a feature transform and the acoustic model, and the optimization of the likelihood computation. We give experimental results that show the accuracy of the system at different speeds. We also explain how we achieved low latency, presenting measurements that show the typical system latency is less than 1 second.

55 citations


Proceedings ArticleDOI
07 Aug 2002
TL;DR: This paper introduces a fast handoff mechanism, NeighborCasting, for use in wireless IP networks that utilize neighboring foreign agent (FA) information and demonstrates that the handoff latency is substantially reduced, while the typical overhead is minimally increased.
Abstract: This paper introduces a fast handoff mechanism, NeighborCasting, for use in wireless IP networks that utilize neighboring foreign agent (FA) information. NeighborCasting is based on the policy of utilizing, or perhaps even wasting, wired bandwidth between foreign agents, while minimizing RF (radio frequency) bandwidth exchanges, so that handoff latency is minimized. We demonstrate that the handoff latency is substantially reduced, while the typical overhead is minimally increased. Handoff latency is minimized by initiating data forwarding to the possible new foreign agent candidates (i.e., the neighbor foreign agents) at the time that the mobile node initiates the link-layer handoff procedure. NeighborCasting builds upon the Mobile IP handoff procedure by adding a small number of additional message types. The handoff mechanism is a unified procedure for inter-domain, intra-domain and inter-technology (e.g., LAN to WAN or TDMA to CDMA) handoffs and provides flexible choices to the network, while maintaining transparency to the mobile node. The neighbor FA discovery process is a distributed and dynamic mechanism, and the fast handoff schemes are scalable and reliable.

50 citations


Journal ArticleDOI
TL;DR: The simulation results show that HIPIQS can deliver performance close to that of output queuing approaches over a range of message sizes, system sizes, and traffic and can be used to build high performance switches that are useful for both parallel system interconnects and for building computer networks.
Abstract: Switch-based interconnects are used in a number of application domains, including parallel system interconnects, local area networks, and wide area networks. However, very few switches have been designed that are suitable for more than one of these application domains. Such a switch must offer both extremely low latency and very high throughput for a variety of different message sizes. While some architectures with output queuing have been shown to perform extremely well in terms of throughput, their performance can suffer when used in systems where a significant portion of the packets are extremely small. On the other hand, architectures with input queuing offer limited throughput or require fairly complex and centralized arbitration that increases latency. In this paper, we present a new input queue-based switch architecture called HIPIQS (HIgh-Performance Input-Queued Switch). It offers low latency for a range of message sizes and provides throughput comparable to that of output queuing approaches. Furthermore, it allows simple and distributed arbitration. HIPIQS uses a dynamically allocated multiqueue organization, pipelined access to multibank input buffers, and small cross-point buffers to deliver high performance. Our simulation results show that HIPIQS can deliver performance close to that of output queuing approaches over a range of message sizes, system sizes, and traffic. The switch architecture can therefore be used to build high performance switches that are useful for both parallel system interconnects and for building computer networks.

44 citations


Journal ArticleDOI
TL;DR: Different than the original TPC decoder, which performs row and column decoding in a serial fashion, a parallel decoder structure is proposed, showing that decoding latency of TPCs can be halved while maintaining virtually the same performance level.
Abstract: There has been intensive focus on turbo product codes (TPCs) which have low decoding complexity and achieve near-optimum performances at low signal-to-noise ratios. Different than the original TPC decoder, which performs row and column decoding in a serial fashion, we propose a parallel decoder structure. Simulation results show that with this approach, decoding latency of TPCs can be halved while maintaining virtually the same performance level.

42 citations


Proceedings ArticleDOI
07 Aug 2002
TL;DR: This paper presents an implementation of a convolutional turbo codec core based on innovative solutions for broadband turbo coding, implemented in a CMOS 0.18 /spl mu/m technology, and yields a final throughput up to 80.7 Mb/s.
Abstract: Turbo coding has reached the step in which its astonishing coding gain is already being proven in real applications. Moreover, its applicability to future broadband communications systems is starting to be investigated. In order to be useful in this domain, special turbo codec architectures that cope with low latency, high throughput, low power consumption and high flexibility are needed. This paper presents an implementation of a convolutional turbo codec core based on innovative solutions for those requirements. The combination of a systematic data storage and transfer optimization with high and low level architectural solutions yields a final throughput up to 80.7 Mb/s, a decoding latency of 10 /spl mu/s and a power consumption of less than 50 nJ/bit. The 14.7 mm/sup 2/ full-duplex full-parallel core, implemented in a CMOS 0.18 /spl mu/m technology, is a complete flexible solution for broadband turbo coding.

33 citations


Patent
20 Feb 2002
TL;DR: In this article, the frame switch point is placed at the completion of frame decoding and the bottom border of the scaled image therewith while maintaining low latency of decoded data, and high latency operation is provided only when necessitated by minimal spill buffer capacity and in combination with fractional image size reduction in the decoding path.
Abstract: Loss of decoding time prior to the vertical synchronization signal when motion video is arbitrarily scaled and positioned by placing the frame switch point at the completion of frame decoding and synchronizing the bottom border of the scaled image therewith while maintaining low latency of decoded data. High latency operation is provided only when necessitated by minimal spill buffer capacity and in combination with fractional image size reduction in the decoding path in order to maintain image resolution without requiring additional memory.

28 citations


Patent
25 Feb 2002
TL;DR: In this paper, a global interrupt and barrier network is implemented that implements logic for generating global interrupts and barrier signals for controlling global asynchronous operations perfomed by processing elements at selected processing nodes (12) of computing structure in accordance with a processing algorithm.
Abstract: A system and method for generating global asynchronous signals in a computing structure. Particularly, a global interrupt and barrier network is implemented that implements logic for generating global interrupt and barrier signals for controlling global asynchronous operations perfomed by processing elements at selected processing nodes (12) of computing structure in accordance with a processing algorithm; and includes the physical interconnecting of the processing nodes (12) for communicating the global interrupt and barrier signals to the elements via low latency paths. The global asynchronous signals respectively initiate interrupt and barrier operations at the processing nodes (12) at times selected for otpimizing performance of the processing algorithms. In one embodiment, the global interrupt and barrier network is implemented in a scalable, massively parallel supercomputing device structure comprising a plurality of processing nodes interconnected by multiple independent networks.

28 citations


Journal ArticleDOI
TL;DR: A new class of low-cost, bounded-delay multicast heuristics for WDM networks that decouple the cost of establishing the multicast tree from the delay incurred by data transmission due to lightwave conversion and processing at intermediate nodes along the transmission path are presented.

Book ChapterDOI
25 Aug 2002
TL;DR: A color segmentation algorithm for embedded real-time systems with a special focus on latencies is presented, part of a Hardware-Software-System that realizes fast reactions on visual stimuli in highly dynamic environments.
Abstract: This paper presents a color segmentation algorithm for embedded real-time systems with a special focus on latencies The algorithm is part of a Hardware-Software-System that realizes fast reactions on visual stimuli in highly dynamic environments There is furthermore the constraint to use low-cost hardware to build the system Our system is implemented on a RoboCup middle size league prototype robot

Proceedings ArticleDOI
Hyeong-Ju Kang1, In-Cheol Park1
13 May 2002
TL;DR: In this article, the authors proposed a new decoding structure of Reed-Solomon codes that can operate as fast as the serial structure and has as short latency as the parallel structure.
Abstract: This paper presents a new decoding structure of Reed-Solomon ( RS) codes that are widely used for channel coding. Although many decoding structures have been developed, the serial structures have long latency and the parallel structures are not fast enough to deal with the demands of high-speed decoding. To achieve both short latency and fast ope,ration, the summation of the products of syndromes is eliminated and the difference used to calculate the error locator polynomial is incrementally updated. The proposed structure called a dual-line structure can operate as fast as the serial structure and has as short latency as the parallel structure. In addition, the dual-line structure is regular and easy to implement. Experimental results confirm these advantages at the cost of a small hardware increase.

Proceedings ArticleDOI
12 May 2002
TL;DR: The experiences with Early Cancellation --- an optimization for Time-Warp that cancels messages in place upon early discovery of a rollback are presented and it is believed that there is a large scope for additional optimizations using this model.
Abstract: Parallel Discrete Event Simulation (PDES) on a cluster of workstations is a fine grained application where the communication performance can dictate the effiency of the simulation. The high performance Local/System Area Networks used in high-end clusters are capable of delivering data with high bandwidth and low latency. Unfortunately, the communication rate far out-paces the capabilities of workstation nodes to handle it (I/0 bus, memory bus, CPU resources). For this reason, many vendors are offering a programmable processor on the NIC to allow application specific optimization of the communication path. This invites a new implementation model for distributed applications where: (i) application specific communication optimizations can be implemented on the NIC; (ii) portions of the application that are most heavily communicating can be migrated to the NIC; (iii) some messages can be filtered out at the NIC without burdening the primary processor resources; and (iv) critical events are detected and handled early. The aim of our research is to investigate the utility of this model for PDES and to gain initial experiences in the implementation challenges and potential performance improvement. In this paper, we present our experiences with Early Cancellation --- an optimization for Time-Warp that cancels messages in place upon early discovery of a rollback. We believe that there is a large scope for additional optimizations using this model.

Journal Article
TL;DR: In this article, a 20-data-channel transceiver with a control channel allows uncoded data transfer with 13ns latency and achieves 10GB/s with 20ps resolution.
Abstract: A 20-data-channel transceiver with a control channel allows uncoded data transfer with 13ns latency. A digital DLL (Delay Locked Loop) with a ring-interpolator tracks phase with 20ps resolution. A pre-emphasis driver enables 2Gbps transmission per channel over a 7m cable at 1.5V supply. The effective full-duplex bandwidth reaches 10GB/s.

Proceedings Article
22 Sep 2002
TL;DR: The round-trip time for AOTF on this incompletely tuned DIMMnet-1 is 7.5 times faster than Myrinet2000 and the barrier synchronization time is 4 times fasterthan that of an SR8000 supercomputer, showing that DIMmnet- 1 holds promise for applications in which scalable performance with traditional approaches is difficult because of frequent data exchange.
Abstract: DIMMnet-1 is a high performance network interface for PC clusters that can be directly plugged into the DIMM slot of a PC. By using both low latency AOTF (Atomic On-The-Fly) sending and high bandwidth BOTF (Block On-The-Fly) sending, it can overcome the overhead caused by standard I/O such as the PCI bus. Two types of DIMMnet-1 prototypeboards (providing optical and electrical network interfaces) containing a Martini network interface controller chip are currently available. They can be plugged into a 100MHz DIMM slot of a PC with a Pentium-3, Pentium-4 or Athlon processor. The round-trip time for AOTF onthis incompletely tuned DIMMnet-1 is 7.5 times faster than Myrinet2000. The barrier synchronization time for AOTF is 4 times faster than that of an SR8000 supercomputer. Theinter-two-node floating sum operation time is 1903 ns. This shows that DIMMnet-1 holds promise for applications in which scalable performance with traditional approaches is difficult because of frequent data exchange.

Patent
29 Mar 2002
TL;DR: In this paper, a storage processor particularly suited to RAID systems provides high throughput for applications such as streaming video data and is configured as an ASIC with a high degree of parallelism in its interconnections.
Abstract: A storage processor particularly suited to RAID systems provides high throughput for applications such as streaming video data An embodiment is configured as an ASIC with a high degree of parallelism in its interconnections The communications architecture provides saturation of user data pathways with low complexity and low latency by employing multiple memory channels under software control, an efficient parity calculation mechanism and other features

Patent
06 May 2002
TL;DR: The IEEE floating-point adder (FP-adder) as discussed by the authors achieves low latency by combining various optimization techniques, including a non-standard separation into two paths, a simple rounding algorithm, unifying rounding cases for addition and subtraction, sign-magnitude computation of a difference based on one's complement subtraction and compound adders, and fast circuits for approximate counting of leading zeros from borrow-save representation.
Abstract: An IEEE floating-point adder (FP-adder) design. The adder accepts normalized numbers, supports all four IEEE rounding modes, and outputs the correctly normalized rounded sum/difference in the format required by the IEEE Standard. The latency of the design for double precision is roughly 24 logic levels, not including delays of latches between pipeline stages. Moreover, the design can be easily partitioned into two stages comprised of twelve logic levels each, and hence, can be used with clock periods that allow for twelve logic levels between latches. The FP-adder design achieves a low latency by combining various optimization techniques, including a non-standard separation into two paths, a simple rounding algorithm, unifying rounding cases for addition and subtraction, sign-magnitude computation of a difference based on one's complement subtraction, compound adders, and fast circuits for approximate counting of leading zeros from borrow-save representation. A comparison of the design with other implementations suggests a reduction in the latency by at least two logic levels as well as simplified rounding implementation. A reduced precision version of the FP adder has been verified by exhaustive testing.

Patent
25 Apr 2002
TL;DR: In this paper, a buffer control circuit (BCCMC) is used to enable/disable buffer coupled to a terminated bus, which can detect transactions and speculatively enable the buffers before the transaction is completely decoded.
Abstract: A memory controller (MC) includes a buffer control circuit (BCC) to enable/disable buffer coupled to a terminated bus. The BCC can detect transactions and speculatively enable the buffers before the transaction is completely decoded. If the transaction is targeted for the terminated bus, the buffers will be ready to drive signals onto the terminated bus by the time the transaction is ready to be performed, thereby eliminating the “enable buffer” delay incurred in some conventional MCs. If the transaction is not targeted for the terminated bus, the BCC disables the buffers to save power. In MCs that queue transactions, the BCC can snoop the queue to find transactions targeted for the terminated bus and begin enabling the buffers before these particular transactions are fully decoded.

Book ChapterDOI
16 Sep 2002
TL;DR: This paper focuses on nonlinear isotropic diffusion filtering which is discretize by means of an additive operator splitting (AOS) and develops an algorithmic implementation with excellent scaling properties on massively connected low latency networks.
Abstract: This paper deals with parallelization and implementation aspects of PDE based image processing models for large cluster environments with distributed memory. As an example we focus on nonlinear isotropic diffusion filtering which we discretize by means of an additive operator splitting (AOS). We start by decomposing the algorithm into small modules that shall be parallelized separately. For this purpose image partitioning strategies are discussed and their impact on the communication pattern and volume is analyzed. Based on the results we develop an algorithmic implementation with excellent scaling properties on massively connected low latency networks. Test runs on a high-end Myrinet cluster yield almost linear speedup factors up to 209 for 256 processors. This results in typical denoising times of 0.5 seconds for five iterations on a 256 × 256 × 128 data cube.


Proceedings ArticleDOI
25 Sep 2002
TL;DR: In this article, the authors demonstrate a simple scheme to overcome the limitations of global interconnects due to their latency and power consumption, based on the utilization of upper-level metals and reduced voltage swing.
Abstract: Global interconnects have been identified as a serious limitation to chip scaling, due to their latency and power consumption. We demonstrate a simple scheme to overcome these limitations, based on the utilization of upper-level metals and reduced voltage swing. The upper-level metal allows velocity of light delay if properly dimensioned and power is optimized by an appropriate choice of voltage swing and receiver amplifier.

Journal ArticleDOI
TL;DR: The architecture of a synchronized event-based control and data acquisition system that aims to improve significantly the performance of actual systems is presented and explores recent developments in data transport, signal processing and system synchronization.

Patent
25 Feb 2002
TL;DR: In this article, a low latency memory system access is provided in association with a weakly-ordered multiprocessor system, where each processor(12-1, 12-2) shares resources, and each shared resource has an associated lock within a locking device.
Abstract: A low latency memory system access is provided in association with a weakly-ordered multiprocessor system(Fig.1). Each processor(12-1, 12-2) in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device(10) that provides support for synchronization between the multiple processors(12-1, 12-2) in the multiprocessor and the orderly sharing of the resources. A processor(12-1, 12-2) only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor(12-1, 12-2) to own a l ock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor(12-1, 12-2) only performs a read operation and the hardware locking device(10) performs a subsequent write operation rather than the processor(12-1, 12-2).

Patent
Vipin S. Boyanapalli1
30 Jun 2002
TL;DR: In this paper, a low latency Forward Error Correction (FEC) method and apparatus for low latency forward error correction is described, which can be implemented utilizing shift registers, at least one Linear Feedback Shift Register (LFSR), and a local reference table.
Abstract: A method and apparatus for low latency Forward Error Correction (FEC) is described. The low latency FEC can be implemented utilizing shift registers, at least one Linear Feedback Shift Register (LFSR), and a local reference table.

Proceedings ArticleDOI
10 Jan 2002
TL;DR: In this article, a high performance network interface prototype for PC clusters called DIMMnet-1 is presented, which can be directly plugged into the slot of a PC and uses both low latency AOTF (atomic on-the-fly) sending and high bandwidth BOTF sending to overcome the overhead caused by standard I/O bus such as a PCI bus.
Abstract: A high performance network interface prototype for PC clusters called DIMMnet-1 that can be directly plugged into a DIMM slot of a PC is presented. By using both a low latency AOTF (atomic on-the-fly) sending and a high bandwidth BOTF (block on-the-fly) sending, it can overcome the overhead caused by standard I/O bus such as a PCI bus. Currently, two types of DIMMnet-1 prototype boards (providing optical and electrical network interface) equipped with a network interface controller chip Martini are available. They can be plugged into a 100 MHz DIMM slot of a PC with Pentium 3, Pentium 4 or Athlon. Experimental evaluation results of communication performance with the AOTF sending on a real system are shown. Estimated bandwidth with the BOTF sending is also shown.

Patent
12 Jun 2002
TL;DR: The distributed data handling and processing resources system of the present invention includes a) a number of data handlers and processing resource nodes that collectively perform a desired data processing and processing function, and, b) a low latency, shared bandwidth databus for interconnecting the data handling/processing resource nodes as discussed by the authors.
Abstract: The distributed data handling and processing resources system of the present invention includes a) a number of data handling and processing resource nodes that collectively perform a desired data handling and processing function, each data handling and processing resource node for providing a data handling/processing subfunction; and, b) a low latency, shared bandwidth databus for interconnecting the data handling and processing resource nodes. In the least, among the data handling and processing resource nodes, is a processing unit (PU) node for providing a control and data handling/processing subfunction; and, an input/output (I/O) node for providing a data handling/processing subfunction for data collection/distribution to an external environment. The present invention preferably uses the IEEE-1394b databus due to its unique and specialized low latency, shared bandwidth characteristics.

Proceedings ArticleDOI
27 Oct 2002
TL;DR: Simulation results show the potency of PCSMA for implementing low latency, high throughput and efficient connectivity, and a new and better data link that could replace CSMA with relative ease is tested.
Abstract: While the results of this paper are similar to those of previous research, in this paper the technical difficulties present previously are eliminated, producing better results, enabling one to more readily see the benefits of Prioritized CSMA (PCSMA). A new analysis section also helps to generalize this research so that it is not limited to exploration of the new concept of PCSMA. Commercially available network simulation software, OPNET version 7.0, simulations are presented involving an important application of the Aeronautical Telecommunications Network (ATN), Controller Pilot Data Link Communications (CPDLC) over the Very High Frequency Data Link Mode 2 (VDL-2). Communication is modeled for essentially all incoming and outgoing nonstop air-traffic for just three United States cities: Cleveland, Cincinnati, and Detroit. Collision-less PCSMA is successfully tested and compared with the traditional CSMA typically associated with VDL-2. The performance measures include latency, throughput, and packet loss. As expected, PCSMA is much quicker and more efficient than traditional CSMA. These simulation results show the potency of PCSMA for implementing low latency, high throughput and efficient connectivity. We are also testing a new and better data link that could replace CSMA with relative ease.

Proceedings ArticleDOI
23 Oct 2002
TL;DR: A replication algorithm, which can be embedded in storage APIs provided by cluster storage systems and can improve the availability of storage, and implements three data-consistency criteria for developers to select for their applications.
Abstract: This paper presents a replication algorithm, which can be embedded in storage APIs provided by cluster storage systems and can improve the availability of storage. Compared with previous methods, this orthogonal algorithm is independent of data types and upper-level applications. It implements three data-consistency criteria for developers to select for their applications. Availability is object based and can be dynamically adjusted. Moreover, low latency of commands is obtained by reducing the communication among replicas as much as possible. We implemented this method partially in TODS, which is a distributed object persistent system running on COCs.

Proceedings ArticleDOI
21 May 2002
TL;DR: Experimental evaluation illustrates that when using enhanced communication features such as DMA transfers, memory-mapped interfaces and zero-copy mechanisms, overall performance is considerably improved compared to using conventional, CPU and kernel bounded, communication primitives.
Abstract: This paper describes the performance benefits attained using enhanced network interfaces to achieve low latency communication. We make use of DMA communication mode, to send data to other nodes, while the CPU performs useful calculations. Zero-copy communication is achieved through pinned-down physical memory regions, provided by NIC's driver modules. Our testbed concerns the parallel execution of tiled nested loops onto a Linux PC cluster with PCI-SCI NICs (Dolphin D330). Tiles are essentially exchanging data and should also have large Computational grain, so that their parallel execution becomes beneficial. We schedule tiles much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive, atomic tile executions. The applied nonblocking schedule resembles a pipelined data-path where computation phases are overlapped with communication ones, instead of being interleaved with them. Experimental evaluation illustrates that when using enhanced communication features such as DMA transfers, memory-mapped interfaces and zero-copy mechanisms, overall performance is considerably improved compared to using conventional, CPU and kernel bounded, communication primitives.