Showing papers on "Latency (engineering) published in 2002"

PDF

Open Access

Proceedings Article•DOI•

GREEN: an active queue management algorithm for a self managed Internet

[...]

B. Wydrowski¹, Moshe Zukerman¹•Institutions (1)

07 Aug 2002

TL;DR: A new active queue management (AQM) algorithm called GREEN is introduced that provides high link utilization whilst maintaining low delay and packet loss and enables low latency interactive applications such as telephony and network games.

...read moreread less

Abstract: In this paper we introduce a new active queue management (AQM) algorithm called GREEN GREEN provides high link utilization whilst maintaining low delay and packet loss GREEN enables low latency interactive applications such as telephony and network games GREEN is shown to outperform the current AQM algorithms Certain performance problems with current AQMs are discussed

...read moreread less

86 citations

Patent•

Apparatus and method for low latency power management on a serial data link

[...]

Andrew W. Martwick¹, Ken Drottar, David S. Dunning, Zale T. Schoenborn, Andrew M. Volk, Ronald W. Swartz, Dennis J. Miller - Show less +3 more•Institutions (1)

Intel¹

22 Nov 2002

TL;DR: In this paper, an apparatus and method for low latency power management on a serial data link are described, which includes the detection of an idle exit condition during receiver operation in an electrical idle state and data synchronization is performed according to one or more received data synchronization training patterns.

...read moreread less

Abstract: An apparatus and method for low latency power management on a serial data link are described. In one embodiment, the method includes the detection of an electrical idle exit condition during receiver operation in an electrical idle state. Once detected, data synchronization is performed according to one or more received data synchronization training patterns. Finally, when the synchronization is performed within a determined synchronization re-establishment period, the receiver will resume operation according to a normal power state. Accordingly, the embodiment described illustrates an open loop, low latency power resumption operation for power management within 3GIO links.

...read moreread less

84 citations

Journal Article•DOI•

40-Gb/s all-optical packet synchronization and address comparison for OTDM networks

[...]

Scott A. Hamilton¹, Bryan S. Robinson•Institutions (1)

Massachusetts Institute of Technology¹

07 Aug 2002-IEEE Photonics Technology Letters

TL;DR: This work demonstrates a novel optical time division multiplexing packet-level system-synchronization and address-comparison technique, which relies on cascaded semiconductor-based optical logic gates operating at 50-Gb/s line rates.

...read moreread less

Abstract: We demonstrate a novel optical time division multiplexing packet-level system-synchronization and address-comparison technique, which relies on cascaded semiconductor-based optical logic gates operating at 50-Gb/s line rates. Synchronous global clock distribution is used to achieve fixed length packet-synchronization that is resistant to channel-induced timing delays, and straightforward to achieve using a single optical logic gate. Four-bit address processing is achieved using a pulse-position modulated header input to a single optical logic gate, which provides Boolean XOR functionality, low latency, and stability over >1 h time periods with low switching energy <100 fJ.

...read moreread less

58 citations

Proceedings Article•

Towards automatic closed captioning : low latency real time broadcast news transcription.

[...]

Murat Saraclar¹, Michael A. Riley, Enrico Bocchieri, Vincent Goffin•Institutions (1)

AT&T¹

01 Jan 2002

TL;DR: A low latency real-time Broadcast News recognition system capable of transcribing live television newscasts with reasonable accuracy and recent modeling and efficiency improvements that yield a 22% word error rate on the Hub4e98 test set while running faster than real- time.

...read moreread less

Abstract: In this paper, we present a low latency real-time Broadcast News recognition system capable of transcribing live television newscasts with reasonable accuracy. We describe our recent modeling and efficiency improvements that yield a 22% word error rate on the Hub4e98 test set while running faster than real-time. These include the discriminative training of a feature transform and the acoustic model, and the optimization of the likelihood computation. We give experimental results that show the accuracy of the system at different speeds. We also explain how we achieved low latency, presenting measurements that show the typical system latency is less than 1 second.

...read moreread less

55 citations

Proceedings Article•DOI•

Low latency handoff for wireless IP QoS with NeighborCasting

[...]

Eunsoo Shim¹, Hung-Yu Wei¹, Yusun Chang¹, Richard D. Gitlin¹•Institutions (1)

Columbia University¹

07 Aug 2002

TL;DR: This paper introduces a fast handoff mechanism, NeighborCasting, for use in wireless IP networks that utilize neighboring foreign agent (FA) information and demonstrates that the handoff latency is substantially reduced, while the typical overhead is minimally increased.

...read moreread less

Abstract: This paper introduces a fast handoff mechanism, NeighborCasting, for use in wireless IP networks that utilize neighboring foreign agent (FA) information. NeighborCasting is based on the policy of utilizing, or perhaps even wasting, wired bandwidth between foreign agents, while minimizing RF (radio frequency) bandwidth exchanges, so that handoff latency is minimized. We demonstrate that the handoff latency is substantially reduced, while the typical overhead is minimally increased. Handoff latency is minimized by initiating data forwarding to the possible new foreign agent candidates (i.e., the neighbor foreign agents) at the time that the mobile node initiates the link-layer handoff procedure. NeighborCasting builds upon the Mobile IP handoff procedure by adding a small number of additional message types. The handoff mechanism is a unified procedure for inter-domain, intra-domain and inter-technology (e.g., LAN to WAN or TDMA to CDMA) handoffs and provides flexible choices to the network, while maintaining transparency to the mobile node. The neighbor FA discovery process is a distributed and dynamic mechanism, and the fast handoff schemes are scalable and reliable.

...read moreread less

50 citations

Journal Article•DOI•

HIPIQS: a high-performance switch architecture using input queuing

[...]

Rajeev Sivaram¹, Craig B. Stunkel¹, Dhabaleswar K. Panda²•Institutions (2)

IBM¹, Ohio State University²

01 Mar 2002-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The simulation results show that HIPIQS can deliver performance close to that of output queuing approaches over a range of message sizes, system sizes, and traffic and can be used to build high performance switches that are useful for both parallel system interconnects and for building computer networks.

...read moreread less

Abstract: Switch-based interconnects are used in a number of application domains, including parallel system interconnects, local area networks, and wide area networks. However, very few switches have been designed that are suitable for more than one of these application domains. Such a switch must offer both extremely low latency and very high throughput for a variety of different message sizes. While some architectures with output queuing have been shown to perform extremely well in terms of throughput, their performance can suffer when used in systems where a significant portion of the packets are extremely small. On the other hand, architectures with input queuing offer limited throughput or require fairly complex and centralized arbitration that increases latency. In this paper, we present a new input queue-based switch architecture called HIPIQS (HIgh-Performance Input-Queued Switch). It offers low latency for a range of message sizes and provides throughput comparable to that of output queuing approaches. Furthermore, it allows simple and distributed arbitration. HIPIQS uses a dynamically allocated multiqueue organization, pipelined access to multibank input buffers, and small cross-point buffers to deliver high performance. Our simulation results show that HIPIQS can deliver performance close to that of output queuing approaches over a range of message sizes, system sizes, and traffic. The switch architecture can therefore be used to build high performance switches that are useful for both parallel system interconnects and for building computer networks.

...read moreread less

44 citations

Journal Article•DOI•

A parallel decoder for low latency decoding of turbo product codes

[...]

Cenk Argon¹, Stephen McLaughlin¹•Institutions (1)

Georgia Institute of Technology¹

07 Aug 2002-IEEE Communications Letters

TL;DR: Different than the original TPC decoder, which performs row and column decoding in a serial fashion, a parallel decoder structure is proposed, showing that decoding latency of TPCs can be halved while maintaining virtually the same performance level.

...read moreread less

Abstract: There has been intensive focus on turbo product codes (TPCs) which have low decoding complexity and achieve near-optimum performances at low signal-to-noise ratios. Different than the original TPC decoder, which performs row and column decoding in a serial fashion, we propose a parallel decoder structure. Simulation results show that with this approach, decoding latency of TPCs can be halved while maintaining virtually the same performance level.

...read moreread less

42 citations

Proceedings Article•DOI•

A 80 Mb/s low-power scalable turbo codec core

[...]

A. Giulietti¹, Bruno Bougard¹, Veerle Derudder¹, Steven Dupont¹, Jan-Willem Weijers¹, L. Van der Perre¹ - Show less +2 more•Institutions (1)

Katholieke Universiteit Leuven¹

07 Aug 2002

TL;DR: This paper presents an implementation of a convolutional turbo codec core based on innovative solutions for broadband turbo coding, implemented in a CMOS 0.18 /spl mu/m technology, and yields a final throughput up to 80.7 Mb/s.

...read moreread less

Abstract: Turbo coding has reached the step in which its astonishing coding gain is already being proven in real applications. Moreover, its applicability to future broadband communications systems is starting to be investigated. In order to be useful in this domain, special turbo codec architectures that cope with low latency, high throughput, low power consumption and high flexibility are needed. This paper presents an implementation of a convolutional turbo codec core based on innovative solutions for those requirements. The combination of a systematic data storage and transfer optimization with high and low level architectural solutions yields a final throughput up to 80.7 Mb/s, a decoding latency of 10 /spl mu/s and a power consumption of less than 50 nJ/bit. The 14.7 mm/sup 2/ full-duplex full-parallel core, implemented in a CMOS 0.18 /spl mu/m technology, is a complete flexible solution for broadband turbo coding.

...read moreread less

33 citations

Patent•

Low latency video decoder with high-quality, variable scaling and minimal frame buffer memory

[...]

Francesco A. Campisano, Dennis P. Cheney, David A. Hrusecky

20 Feb 2002

TL;DR: In this article, the frame switch point is placed at the completion of frame decoding and the bottom border of the scaled image therewith while maintaining low latency of decoded data, and high latency operation is provided only when necessitated by minimal spill buffer capacity and in combination with fractional image size reduction in the decoding path.

...read moreread less

Abstract: Loss of decoding time prior to the vertical synchronization signal when motion video is arbitrarily scaled and positioned by placing the frame switch point at the completion of frame decoding and synchronizing the bottom border of the scaled image therewith while maintaining low latency of decoded data. High latency operation is provided only when necessitated by minimal spill buffer capacity and in combination with fractional image size reduction in the decoding path in order to maintain image resolution without requiring additional memory.

...read moreread less

28 citations

Patent•

Global interrupt and barrier networks

[...]

Matthias A. Blumrich¹, Dong Chen, Paul W. Coteus, Alan Gara, Mark E. Giampapa, Philip Heidelberger, Gerard V. Kopscay¹, Burkhard Steinmacher-Burow, Todd E. Takken - Show less +5 more•Institutions (1)

IBM¹

25 Feb 2002

TL;DR: In this paper, a global interrupt and barrier network is implemented that implements logic for generating global interrupts and barrier signals for controlling global asynchronous operations perfomed by processing elements at selected processing nodes (12) of computing structure in accordance with a processing algorithm.

...read moreread less

Abstract: A system and method for generating global asynchronous signals in a computing structure. Particularly, a global interrupt and barrier network is implemented that implements logic for generating global interrupt and barrier signals for controlling global asynchronous operations perfomed by processing elements at selected processing nodes (12) of computing structure in accordance with a processing algorithm; and includes the physical interconnecting of the processing nodes (12) for communicating the global interrupt and barrier signals to the elements via low latency paths. The global asynchronous signals respectively initiate interrupt and barrier operations at the processing nodes (12) at times selected for otpimizing performance of the processing algorithms. In one embodiment, the global interrupt and barrier network is implemented in a scalable, massively parallel supercomputing device structure comprising a plurality of processing nodes interconnected by multiple independent networks.

...read moreread less

28 citations

Journal Article•DOI•

Low-cost, delay-bounded point-to-multipoint communication to support multicasting over WDM networks

[...]

Taieb Znati¹, Tawfig Alrabiah¹, Rami Melhem¹•Institutions (1)

University of Pittsburgh¹

15 Mar 2002-Computer Networks

TL;DR: A new class of low-cost, bounded-delay multicast heuristics for WDM networks that decouple the cost of establishing the multicast tree from the delay incurred by data transmission due to lightwave conversion and processing at intermediate nodes along the transmission path are presented.

...read moreread less

Book Chapter•DOI•

Low Latency Color Segmentation on Embedded Real-Time Systems

[...]

Dirk Stickling, Bernd Kleinjohann

25 Aug 2002

TL;DR: A color segmentation algorithm for embedded real-time systems with a special focus on latencies is presented, part of a Hardware-Software-System that realizes fast reactions on visual stimuli in highly dynamic environments.

...read moreread less

Abstract: This paper presents a color segmentation algorithm for embedded real-time systems with a special focus on latencies The algorithm is part of a Hardware-Software-System that realizes fast reactions on visual stimuli in highly dynamic environments There is furthermore the constraint to use low-cost hardware to build the system Our system is implemented on a RoboCup middle size league prototype robot

...read moreread less

Proceedings Article•DOI•

A high-speed and low-latency Reed-Solomon decoder based on a dual-line structure

[...]

Hyeong-Ju Kang¹, In-Cheol Park¹•Institutions (1)

KAIST¹

13 May 2002

TL;DR: In this article, the authors proposed a new decoding structure of Reed-Solomon codes that can operate as fast as the serial structure and has as short latency as the parallel structure.

...read moreread less

Abstract: This paper presents a new decoding structure of Reed-Solomon ( RS) codes that are widely used for channel coding. Although many decoding structures have been developed, the serial structures have long latency and the parallel structures are not fast enough to deal with the demands of high-speed decoding. To achieve both short latency and fast ope,ration, the summation of the products of syndromes is eliminated and the difference used to calculate the error locator polynomial is incrementally updated. The proposed structure called a dual-line structure can operate as fast as the serial structure and has as short latency as the parallel structure. In addition, the dual-line structure is regular and easy to implement. Experimental results confirm these advantages at the cost of a small hardware increase.

...read moreread less

Proceedings Article•DOI•

Early cancellation:an active NIC optimization for time-warp

[...]

Ranjit Noronha¹, Nael Abu-Ghazaleh¹•Institutions (1)

Binghamton University¹

12 May 2002

TL;DR: The experiences with Early Cancellation --- an optimization for Time-Warp that cancels messages in place upon early discovery of a rollback are presented and it is believed that there is a large scope for additional optimizations using this model.

...read moreread less

Abstract: Parallel Discrete Event Simulation (PDES) on a cluster of workstations is a fine grained application where the communication performance can dictate the effiency of the simulation. The high performance Local/System Area Networks used in high-end clusters are capable of delivering data with high bandwidth and low latency. Unfortunately, the communication rate far out-paces the capabilities of workstation nodes to handle it (I/0 bus, memory bus, CPU resources). For this reason, many vendors are offering a programmable processor on the NIC to allow application specific optimization of the communication path. This invites a new implementation model for distributed applications where: (i) application specific communication optimizations can be implemented on the NIC; (ii) portions of the application that are most heavily communicating can be migrated to the NIC; (iii) some messages can be filtered out at the NIC without burdening the primary processor resources; and (iv) critical events are detected and handled early. The aim of our research is to investigate the utility of this model for PDES and to gain initial experiences in the implementation challenges and potential performance improvement. In this paper, we present our experiences with Early Cancellation --- an optimization for Time-Warp that cancels messages in place upon early discovery of a rollback. We believe that there is a large scope for additional optimizations using this model.

...read moreread less

Journal Article•

A 2Gbps 21ch low-latency transceiver circuit for inter-processor communication

[...]

Toshio Tanahashi, Masakazu Kurisu, Hiroshi Yamaguchi, Takaaki Nedachi, Tsutomu Matsuzaki, Muneo Fukaishi - Show less +2 more

01 Jan 2002-Nec Research & Development

TL;DR: In this article, a 20-data-channel transceiver with a control channel allows uncoded data transfer with 13ns latency and achieves 10GB/s with 20ps resolution.

...read moreread less

Abstract: A 20-data-channel transceiver with a control channel allows uncoded data transfer with 13ns latency. A digital DLL (Delay Locked Loop) with a ring-interpolator tracks phase with 20ps resolution. A pre-emphasis driver enables 2Gbps transmission per channel over a 7m cable at 1.5V supply. The effective full-duplex bandwidth reaches 10GB/s.

...read moreread less

Proceedings Article•

Low Latency Communication on DIMMnet-1 Network Interface Plugged into a DIMM Slot

[...]

Noboru Tanabe, Yoshihiro Hamada, Hironori Nakajo, Hideki Imashiro, Junji Yamamoto, Tomohiro Kudoh, Hideharu Amano - Show less +3 more

22 Sep 2002

TL;DR: The round-trip time for AOTF on this incompletely tuned DIMMnet-1 is 7.5 times faster than Myrinet2000 and the barrier synchronization time is 4 times fasterthan that of an SR8000 supercomputer, showing that DIMmnet- 1 holds promise for applications in which scalable performance with traditional approaches is difficult because of frequent data exchange.

...read moreread less

Abstract: DIMMnet-1 is a high performance network interface for PC clusters that can be directly plugged into the DIMM slot of a PC. By using both low latency AOTF (Atomic On-The-Fly) sending and high bandwidth BOTF (Block On-The-Fly) sending, it can overcome the overhead caused by standard I/O such as the PCI bus. Two types of DIMMnet-1 prototypeboards (providing optical and electrical network interfaces) containing a Martini network interface controller chip are currently available. They can be plugged into a 100MHz DIMM slot of a PC with a Pentium-3, Pentium-4 or Athlon processor. The round-trip time for AOTF onthis incompletely tuned DIMMnet-1 is 7.5 times faster than Myrinet2000. The barrier synchronization time for AOTF is 4 times faster than that of an SR8000 supercomputer. Theinter-two-node floating sum operation time is 1903 ns. This shows that DIMMnet-1 holds promise for applications in which scalable performance with traditional approaches is difficult because of frequent data exchange.

...read moreread less

Patent•

Storage processor architecture for high throughput applications providing efficient user data channel loading

[...]

Robert C. Solomon¹, Jeffrey A. Brown¹•Institutions (1)

EMC Corporation¹

29 Mar 2002

TL;DR: In this paper, a storage processor particularly suited to RAID systems provides high throughput for applications such as streaming video data and is configured as an ASIC with a high degree of parallelism in its interconnections.

...read moreread less

Abstract: A storage processor particularly suited to RAID systems provides high throughput for applications such as streaming video data An embodiment is configured as an ASIC with a high degree of parallelism in its interconnections The communications architecture provides saturation of user data pathways with low complexity and low latency by employing multiple memory channels under software control, an efficient parity calculation mechanism and other features

...read moreread less

Patent•

Fast IEEE floating-point adder

[...]

P.-M. Seidel¹, Guy Even¹•Institutions (1)

Southern Methodist University¹

06 May 2002

TL;DR: The IEEE floating-point adder (FP-adder) as discussed by the authors achieves low latency by combining various optimization techniques, including a non-standard separation into two paths, a simple rounding algorithm, unifying rounding cases for addition and subtraction, sign-magnitude computation of a difference based on one's complement subtraction and compound adders, and fast circuits for approximate counting of leading zeros from borrow-save representation.

...read moreread less

Abstract: An IEEE floating-point adder (FP-adder) design. The adder accepts normalized numbers, supports all four IEEE rounding modes, and outputs the correctly normalized rounded sum/difference in the format required by the IEEE Standard. The latency of the design for double precision is roughly 24 logic levels, not including delays of latches between pipeline stages. Moreover, the design can be easily partitioned into two stages comprised of twelve logic levels each, and hence, can be used with clock periods that allow for twelve logic levels between latches. The FP-adder design achieves a low latency by combining various optimization techniques, including a non-standard separation into two paths, a simple rounding algorithm, unifying rounding cases for addition and subtraction, sign-magnitude computation of a difference based on one's complement subtraction, compound adders, and fast circuits for approximate counting of leading zeros from borrow-save representation. A comparison of the design with other implementations suggests a reduction in the latency by at least two logic levels as well as simplified rounding implementation. A reduced precision version of the FP adder has been verified by exhaustive testing.

...read moreread less

Patent•

Low latency buffer control system and method

[...]

Jeffrey R. Wilcox, Opher D. Kahn, Alon Naveh

25 Apr 2002

TL;DR: In this paper, a buffer control circuit (BCCMC) is used to enable/disable buffer coupled to a terminated bus, which can detect transactions and speculatively enable the buffers before the transaction is completely decoded.

...read moreread less

Abstract: A memory controller (MC) includes a buffer control circuit (BCC) to enable/disable buffer coupled to a terminated bus. The BCC can detect transactions and speculatively enable the buffers before the transaction is completely decoded. If the transaction is targeted for the terminated bus, the buffers will be ready to drive signals onto the terminated bus by the time the transaction is ready to be performed, thereby eliminating the “enable buffer” delay incurred in some conventional MCs. If the transaction is not targeted for the terminated bus, the BCC disables the buffers to save power. In MCs that queue transactions, the BCC can snoop the queue to find transactions targeted for the terminated bus and begin enabling the buffers before these particular transactions are fully decoded.

...read moreread less

Book Chapter•DOI•

Designing 3-D Nonlinear Diffusion Filters for High Performance Cluster Computing

[...]

Andrés Bruhn¹, Tobias Jakob², Markus Fischer², Timo Kohlberger², Joachim Weickert¹, Ulrich Brüning², Christoph Schnörr² - Show less +3 more•Institutions (2)

Saarland University¹, University of Mannheim²

16 Sep 2002

TL;DR: This paper focuses on nonlinear isotropic diffusion filtering which is discretize by means of an additive operator splitting (AOS) and develops an algorithmic implementation with excellent scaling properties on massively connected low latency networks.

...read moreread less

Abstract: This paper deals with parallelization and implementation aspects of PDE based image processing models for large cluster environments with distributed memory. As an example we focus on nonlinear isotropic diffusion filtering which we discretize by means of an additive operator splitting (AOS). We start by decomposing the algorithm into small modules that shall be parallelized separately. For this purpose image partitioning strategies are discussed and their impact on the communication pattern and volume is analyzed. Based on the results we develop an algorithmic implementation with excellent scaling properties on massively connected low latency networks. Test runs on a high-end Myrinet cluster yield almost linear speedup factors up to 209 for 256 processors. This results in typical denoising times of 0.5 seconds for five iterations on a 256 × 256 × 128 data cube.

...read moreread less

Low Latency Traffic Interval Shaping Algorithm for Traffic Access Control

[...]

Huiling Zhu, Zhengxin Ma, Zhigang Cao, Wang Yongqian

01 Apr 2002

Proceedings Article•DOI•

Low-power, low-latency global interconnect

[...]

Peter Caputa¹, Christer Svensson¹•Institutions (1)

Linköping University¹

25 Sep 2002

TL;DR: In this article, the authors demonstrate a simple scheme to overcome the limitations of global interconnects due to their latency and power consumption, based on the utilization of upper-level metals and reduced voltage swing.

...read moreread less

Abstract: Global interconnects have been identified as a serious limitation to chip scaling, due to their latency and power consumption. We demonstrate a simple scheme to overcome these limitations, based on the utilization of upper-level metals and reduced voltage swing. The upper-level metal allows velocity of light delay if properly dimensioned and power is optimized by an appropriate choice of voltage swing and receiver amplifier.

...read moreread less

Journal Article•DOI•

A distributed, hardware reconfigurable and packet switched real-time control and data acquisition system

[...]

António J.N. Batista¹, A. Combo¹, Jorge Sousa¹, C.A.F Varandas¹•Institutions (1)

European Atomic Energy Community¹

01 Jun 2002-Fusion Engineering and Design

TL;DR: The architecture of a synchronized event-based control and data acquisition system that aims to improve significantly the performance of actual systems is presented and explores recent developments in data transport, signal processing and system synchronization.

...read moreread less

Patent•

Low latency memoray system access

[...]

Matthias A. Blumrich¹, Dong Chen¹, Paul W. Coteus¹, Alan Gara¹, Mark E. Giampapa¹, Philip Heidelberger¹, Dirk Hoenicke¹, Martin Ohmacht¹, Burkhard D. Steinmarcher-Burow¹, Todd E. Takken¹, Pavlos M. Vranas¹ - Show less +7 more•Institutions (1)

IBM¹

25 Feb 2002

TL;DR: In this article, a low latency memory system access is provided in association with a weakly-ordered multiprocessor system, where each processor(12-1, 12-2) shares resources, and each shared resource has an associated lock within a locking device.

...read moreread less

Abstract: A low latency memory system access is provided in association with a weakly-ordered multiprocessor system(Fig.1). Each processor(12-1, 12-2) in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device(10) that provides support for synchronization between the multiple processors(12-1, 12-2) in the multiprocessor and the orderly sharing of the resources. A processor(12-1, 12-2) only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor(12-1, 12-2) to own a l ock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor(12-1, 12-2) only performs a read operation and the hardware locking device(10) performs a subsequent write operation rather than the processor(12-1, 12-2).

...read moreread less

Patent•

Efficient method and apparatus for low latency forward error correction

[...]

Vipin S. Boyanapalli¹•Institutions (1)

Intel¹

30 Jun 2002

TL;DR: In this paper, a low latency Forward Error Correction (FEC) method and apparatus for low latency forward error correction is described, which can be implemented utilizing shift registers, at least one Linear Feedback Shift Register (LFSR), and a local reference table.

...read moreread less

Abstract: A method and apparatus for low latency Forward Error Correction (FEC) is described. The low latency FEC can be implemented utilizing shift registers, at least one Linear Feedback Shift Register (LFSR), and a local reference table.

...read moreread less

Proceedings Article•DOI•

A low latency high bandwidth network interface prototype for PC cluster

[...]

Noboru Tanabe¹, Yoshihiro Hamada², Hironori Nakajo², Hideki Imashiro³, Junji Yamamoto³, Tomohiro Kudoh⁴, Hideharu Amano - Show less +3 more•Institutions (4)

Toshiba¹, Tokyo University of Agriculture and Technology², Hitachi³, National Institute of Advanced Industrial Science and Technology⁴

10 Jan 2002

TL;DR: In this article, a high performance network interface prototype for PC clusters called DIMMnet-1 is presented, which can be directly plugged into the slot of a PC and uses both low latency AOTF (atomic on-the-fly) sending and high bandwidth BOTF sending to overcome the overhead caused by standard I/O bus such as a PCI bus.

...read moreread less

Abstract: A high performance network interface prototype for PC clusters called DIMMnet-1 that can be directly plugged into a DIMM slot of a PC is presented. By using both a low latency AOTF (atomic on-the-fly) sending and a high bandwidth BOTF (block on-the-fly) sending, it can overcome the overhead caused by standard I/O bus such as a PCI bus. Currently, two types of DIMMnet-1 prototype boards (providing optical and electrical network interface) equipped with a network interface controller chip Martini are available. They can be plugged into a 100 MHz DIMM slot of a PC with Pentium 3, Pentium 4 or Athlon. Experimental evaluation results of communication performance with the AOTF sending on a real system are shown. Estimated bandwidth with the BOTF sending is also shown.

...read moreread less

Patent•

Distributed data handling and processing resources system

[...]

Gary A. Kinstler

12 Jun 2002

TL;DR: The distributed data handling and processing resources system of the present invention includes a) a number of data handlers and processing resource nodes that collectively perform a desired data processing and processing function, and, b) a low latency, shared bandwidth databus for interconnecting the data handling/processing resource nodes as discussed by the authors.

...read moreread less

Abstract: The distributed data handling and processing resources system of the present invention includes a) a number of data handling and processing resource nodes that collectively perform a desired data handling and processing function, each data handling and processing resource node for providing a data handling/processing subfunction; and, b) a low latency, shared bandwidth databus for interconnecting the data handling and processing resource nodes. In the least, among the data handling and processing resource nodes, is a processing unit (PU) node for providing a control and data handling/processing subfunction; and, an input/output (I/O) node for providing a data handling/processing subfunction for data collection/distribution to an external environment. The present invention preferably uses the IEEE-1394b databus due to its unique and specialized low latency, shared bandwidth characteristics.

...read moreread less

Proceedings Article•DOI•

Dual purpose simulation: new data link test and comparison with VDL-2

[...]

D.C. Robinson¹•Institutions (1)

Glenn Research Center¹

27 Oct 2002

TL;DR: Simulation results show the potency of PCSMA for implementing low latency, high throughput and efficient connectivity, and a new and better data link that could replace CSMA with relative ease is tested.

...read moreread less

Abstract: While the results of this paper are similar to those of previous research, in this paper the technical difficulties present previously are eliminated, producing better results, enabling one to more readily see the benefits of Prioritized CSMA (PCSMA). A new analysis section also helps to generalize this research so that it is not limited to exploration of the new concept of PCSMA. Commercially available network simulation software, OPNET version 7.0, simulations are presented involving an important application of the Aeronautical Telecommunications Network (ATN), Controller Pilot Data Link Communications (CPDLC) over the Very High Frequency Data Link Mode 2 (VDL-2). Communication is modeled for essentially all incoming and outgoing nonstop air-traffic for just three United States cities: Cleveland, Cincinnati, and Detroit. Collision-less PCSMA is successfully tested and compared with the traditional CSMA typically associated with VDL-2. The performance measures include latency, throughput, and packet loss. As expected, PCSMA is much quicker and more efficient than traditional CSMA. These simulation results show the potency of PCSMA for implementing low latency, high throughput and efficient connectivity. We are also testing a new and better data link that could replace CSMA with relative ease.

...read moreread less

Proceedings Article•DOI•

Orthogonal replication algorithm on cluster storage system

[...]

Hu Jinfeng¹, Zhang Youhui¹, Zheng Weimin¹•Institutions (1)

Tsinghua University¹

23 Oct 2002

TL;DR: A replication algorithm, which can be embedded in storage APIs provided by cluster storage systems and can improve the availability of storage, and implements three data-consistency criteria for developers to select for their applications.

...read moreread less

Abstract: This paper presents a replication algorithm, which can be embedded in storage APIs provided by cluster storage systems and can improve the availability of storage. Compared with previous methods, this orthogonal algorithm is independent of data types and upper-level applications. It implements three data-consistency criteria for developers to select for their applications. Availability is object based and can be dynamically adjusted. Moreover, low latency of commands is obtained by reducing the communication among replicas as much as possible. We implemented this method partially in TODS, which is a distributed object persistent system running on COCs.

...read moreread less

Proceedings Article•DOI•

Efficient Utilization of Memory Mapped NICs onto Clusters using Pipelined Schedules

[...]

A. Sotiropoulos¹, Georgios Tsoukalas, Nectarios Koziris¹•Institutions (1)

National Technical University of Athens¹

21 May 2002

TL;DR: Experimental evaluation illustrates that when using enhanced communication features such as DMA transfers, memory-mapped interfaces and zero-copy mechanisms, overall performance is considerably improved compared to using conventional, CPU and kernel bounded, communication primitives.

...read moreread less

Abstract: This paper describes the performance benefits attained using enhanced network interfaces to achieve low latency communication. We make use of DMA communication mode, to send data to other nodes, while the CPU performs useful calculations. Zero-copy communication is achieved through pinned-down physical memory regions, provided by NIC's driver modules. Our testbed concerns the parallel execution of tiled nested loops onto a Linux PC cluster with PCI-SCI NICs (Dolphin D330). Tiles are essentially exchanging data and should also have large Computational grain, so that their parallel execution becomes beneficial. We schedule tiles much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive, atomic tile executions. The applied nonblocking schedule resembles a pipelined data-path where computation phases are overlapped with communication ones, instead of being interleaved with them. Experimental evaluation illustrates that when using enhanced communication features such as DMA transfers, memory-mapped interfaces and zero-copy mechanisms, overall performance is considerably improved compared to using conventional, CPU and kernel bounded, communication primitives.

...read moreread less