scispace - formally typeset
Search or ask a question

Showing papers on "Latency (engineering) published in 1995"


Journal ArticleDOI
TL;DR: The differences in communication characteristics between workstation clusters built from standard hardware and software components and state-of-the-art multiprocessors are discussed, and a prototype implementation of an active message communication layer is evaluated.
Abstract: Today's communication architectures for parallel machines reduce communication overheads and latencies by over an order of magnitude. However, carrying over these techniques to workstation clusters connected by an ATM network presents major design challenges. We discuss the differences in communication characteristics between workstation clusters built from standard hardware and software components and state-of-the-art multiprocessors, and then evaluate a prototype implementation of an active message communication layer. Application round-trip latencies of about 50 microseconds for small messages roughly compare to a similar implementation on the Thinking Machines CM-5 multiprocessor. >

79 citations


Proceedings ArticleDOI
01 May 1995
TL;DR: OTDM technology is discussed only to the extent necessary to understand its characteristics and capabilities and it is expected that cost-reduced OTDM systems will become competitive with the next generation of interconnect systems.
Abstract: Crossbar switches are rarely considered for large, scalable multiprocessor interconnect systems because they require O(n2) switching elements, are difficult to control efficiently and are hard to implement once their size becomes too large to fit on one integrated circuit. However these problems are technology dependent and a recent innovation in fiber optic devices has led to a new implementation of crossbar switches that does not share these problems while retaining the full advantages of a crossbar switch: low latency, high throughput, complete connectivity and multi-cast capability. Moreover, this new technology has several characteristics that allow a distributed control system which scales linearly in the number of attached nodes.The innovation that led to this research is an optical and-gate that can be used to demultiplex multiple high speed data streams that are carried on one common optical medium. Optical time domain multiplexing can combine the data from many nodes and broadcast the result back to all nodes. This paper discusses OTDM technology only to the extent necessary to understand its characteristics and capabilities. The main contribution lies in the description and analysis of interconnect architectures that utilize OTDM to achieve a level performance that is beyond electronic means. It is expected that cost-reduced OTDM systems will become competitive with the next generation of interconnect systems.

56 citations


Patent
Ronald Mraz1, Michael M. Tsao1
10 Jan 1995
TL;DR: In this paper, the authors proposed a high performance, standard IO interconnect "bridge" hardware for a parallel machine with a packet switching network in place, combining new hardware and new software, this bridge connects parallel processors to the external world.
Abstract: This invention is a high performance, standard IO interconnect "bridge" hardware for a parallel machine with a packet switching network in place. Combining new hardware and new software, this bridge connects parallel processors to the external world. The hardware is a "bridge" connecting an internal inter-processor switch to external asynchronous transfer node networks. The software is a "mirror" for making the connections. The invention provides high bandwidth, low latency and deterministic performance, and is inexpensive to build.

24 citations


Patent
21 Mar 1995
TL;DR: In this article, a low-latency recovery device for serial data transmission is presented. But the receiver device operates asynchronously in respect to a transmitting device and does not have a metastability proof latch.
Abstract: A receiver device is provided with a low latency recovery apparatus for recovering serially transmitted digital data. The receiver device operates asynchronously in respect to a transmitting device. The low latency recovery apparatus synchronizes the receiver device in one clock time to support throughput of high speed transmission messages received from interconnection networks or interface cables. A metastability proof latch is provided. A synchronization method provides individual alignment for each incoming message. There is instantaneous response to back-to-back messages from different sources. Synchronization is accomplished in the receiving device by implementing a clocking system capable of generating N phase-shifted clocks all operating at the same frequency as the incoming data. The N clocks are shifted an approximately equal amount in relation to each other. The data recovery apparatus selects the one of N clocks which is best in synchronization with the incoming serial data and then to receive the message correctly. The apparatus has a two wire interface for serial data and a bracketing control signal. Serial data is synchronized first to the selected clock and then to a local clock. The bracketing control signals when each message recovery is complete and triggers the start of another message recovery in as little as one clock time.

19 citations


Proceedings ArticleDOI
30 May 1995
TL;DR: A parallel asynchronous implementation of a FIFO buffer is described and compared with the conventional alternative asynchronous implementation, Sutherland's micropipeline, and a high-throughput multiple-burst signalling scheme is supported, in which a second burst of data is transmitted at the same time as the previous burst is acknowledged, effectively increasing the overall throughput.
Abstract: A parallel asynchronous implementation of a FIFO buffer is described and compared with the conventional alternative asynchronous implementation, Sutherland's micropipeline. The parallel design has the potential for significant reductions in propagation delay at the cost of insignificant increases in cycle-time (i.e. reduced throughput) and area. Although in certain applications, e.g. DSP, only high throughput may be important, in others, e.g. packet switching, throughout and propagation delay both matter. We consider the parallel design to be most useful as part of the interface circuitry required by devices that asynchronously exchange data in bursts over inter-chip communication wires and use a single acknowledge signal for each burst of data. In particular, a high-throughput multiple-burst signalling scheme is supported, in which a second burst of data is transmitted at the same time as the previous burst is acknowledged, effectively increasing the overall throughput.

18 citations


Proceedings ArticleDOI
02 Oct 1995
TL;DR: A new class of direct interconnection networks called MDCE (Multidimensional Directed Cycles Ensemble extension) has many desirable features for RWC-1 including small degree, low latency, and high throughput, and MDCE is thus adopted for a R WC-1 network.
Abstract: The RWC-1 is a massively parallel computer based on a multi-threaded architecture. This architecture requires extremely high communication performance with reasonable hardware cost. ln this paper, we first introduce a new class of direct interconnection networks called MDCE (Multidimensional Directed Cycles Ensemble extension). MDCE has many desirable features for RWC-1 including small degree, low latency, and high throughput. MDCE is thus adopted for a RWC-1 network. We have designed an MDCE router and fabricated an experimental VLSI chip. We explain the design details in this paper. The chip employs operating system support features as well as communication functions, and enables advanced resource management, A prototype chip with about 125,000 gates has been fabricated using 0.6-/spl mu/m CMOS gate array technology. Its clock runs at 50 MHz and a transmission rate of 300 M bytes per second per communication port is achieved.

12 citations


Proceedings ArticleDOI
23 Oct 1995
TL;DR: This paper proposes an optical interconnect architecture for over a hundred processors, which contains a dedicated channel for each processor to eliminate global arbitration and to provide bandwidth that scales with the number of processors in the machine.
Abstract: Low latency, high bandwidth interconnection networks that directly link arbitrary pairs of processing elements without contention are very desirable for parallel computers. Most communication networks in parallel machines have made compromises due to the limitations of electronics. Many of the optical interconnection schemes proposed have simply replaced the point-to-point copper wiring with fiber optics and have not made use of the unique properties of optics. This paper proposes an optical interconnect architecture for over a hundred processors, which contains a dedicated channel for each processor to eliminate global arbitration and to provide bandwidth that scales with the number of processors in the machine. Unlike electrical buses, this architecture is not limited by the medium (fiber optics) used to connect the transmitters and receivers. Each processor has an array of receivers, one receiver for each processor channel. The architecture of the receiver array permits a variety of different parallel programming models to be efficiently supported.

8 citations


Journal ArticleDOI
TL;DR: An analytic performance model of pipelined communication in k-ary n cubes is presented intended to capture and study key performance issues and the modeling of throughput and latency is addressed.
Abstract: Pipelined communication using virtual channels can realize low latency, high throughput, inter-processor communication. This paper presents an analytic performance model of pipelined communication in k-ary n cubes. The model contains elements intended to capture and study key performance issues. In addition to the modeling of throughput and latency, the following issues are addressed using this model: (1) the tradeoff between full-duplex vs. half-duplex links, (2) the effects of intranode delay, (3) the effects of buffer size for each virtual channel. Detailed simulation experiments under a variety of conditions establish the viability of this model. >

6 citations


Proceedings ArticleDOI
24 Oct 1995
TL;DR: A novel low latency and high throughput programmable motion estimator architecture is proposed, which can efficiently implement both fall search and hierarchical search algorithms in motion estimation in VLSI.
Abstract: In this paper, a novel low latency and high throughput programmable motion estimator architecture is proposed, which can efficiently implement both fall search and hierarchical search algorithms in motion estimation in VLSI.

4 citations


Proceedings ArticleDOI
20 Sep 1995
TL;DR: The drawbacks of the demand-priority protocol are revealed and the advantages of using service strategies different from those included in the current version of the draft standard are clearly shown.
Abstract: The demand-priority protocol currently in the process of standardization by IEEE 802.12 aims at supporting interactive multimedia applications by providing a low latency service for high-priority traffic. This goal may be achieved in case of large frames and/or small distances. Otherwise, improvements are required. This paper starts with a brief description of the basic characteristics of the network topology and the medium access control protocol. Then, it presents simulation results for normal and high priority traffic in scenarios with variable bit rate high priority loads for networks of different sizes. It reveals the drawbacks of the demand-priority protocol and clearly shows the advantages of using service strategies different from those included in the current version of the draft standard.

4 citations


Proceedings ArticleDOI
S. Johnson1, S. Scott1
11 Sep 1995
TL;DR: A new system channel is developing to meet the need for a new supercomputer system interconnect that integrates control and data on a single, physical path while providing low latency and variance for control messages.
Abstract: The evolution of system architectures and system configurations has created the need for a new supercomputer system interconnect. Attributes required of the new interconnect include commonality among system and subsystem types, scalability, low latency, high bandwidth, a high level of resiliency, and flexibility. Cray Research Inc. is developing a new system channel to meet these interconnect requirements in future systems. The channel has a ring-based architecture, but can also function as a point-to-point link. It integrates control and data on a single, physical path while providing low latency and variance for control messages. Extensive features for client isolation, diagnostic capabilities, and fault tolerance have been incorporated into the design. The attributes and features of this channel are discussed along with implementation and protocol specifics.

Book ChapterDOI
02 Jan 1995
TL;DR: Neural networks are an effective approach to solve non-standard or non-algorithmic problems such as system control, classification and pattern recognition that are balanced by the costs both of design and silicon implementation.
Abstract: Neural networks are an effective approach to solve non-standard or non-algorithmic problems such as system control, classification and pattern recognition. These important capabilities, are balanced by the costs both of design and silicon implementation.


Journal ArticleDOI
TL;DR: The dedicated network interface hardware designed for the Cenju-3 system achieves low latency and high throughput and is presented in this paper.
Abstract: Cenju-3 is a parallel computer in which up to 256 processing elements (PEs) are connected by a highspeed multistage interconnection network. In designing the system, the architecture is tuned for up to a 256 processor system. A VR4400 with 1 MB of secondary cache memory is implemented on a multi-chip-module to realize a compact and high-performance PE. The multistage network is implemented very compactly. The number of the cables is equal to the number of processors. The dedicated network interface hardware designed for the system achieves low latency and high throughput. This paper presents the machine architecture and its evaluation.

01 Jan 1995
TL;DR: Simulation results for normal and high priority traffic in scenarios with variable bit rate high priority loads for networks of different sizes are presented and the drawbacks of the demand-priority protocol are revealed.
Abstract: The demand-priority protocol currently in the process of standardization by IEEE 802. I2 aims at supporting interactive multimedia applications by providing a low latency service for high-priority traffic. This goal may be achieved in case of large frames and/or small distances. Otherwise, improvements are required. This paper starts with a brief description of basic characteristics of the network topology and the medium access control protocol. Then, it presents simulation results for normal and high priority traffic in scenarios with variable bit rate high priority loads for networks of different sizes. It reveals the drawbacks of the demand-priority protocol and clearly shows the advantages of using service strategies different from those included in the current version of the draft standard.

Journal ArticleDOI
TL;DR: The common third level trigger and forth level reconstruction farm for the future HERA-B experiment will have to perform full on-line event reconstruction and calibration for an expected input rate of 2000 events/s.
Abstract: The common third level trigger and forth level reconstruction farm for the future HERA-B experiment will have to perform full on-line event reconstruction and calibration for an expected input rate of 2000 events/s. More than a hundred powerful RISC processors connected in a network capable of distributing several hundreds of MB/s with low latency are likely to be necessary for this task. Proper simulation of the real time multi-processor systems is central for an optimal design (hardware and software protocol) of a scalable and flexible parallel data processing architecture. A discrete event, process oriented simulation developed in concurrent μC++ is used as a framework for modelling and evaluating different farm architectures. An object oriented graphic interface to the simulation allows the monitoring of various features and provides an easier way to optimize the system.

01 Nov 1995
TL;DR: This paper develops and analyzes a dilated high performance fault tolerant fast packet multistage interconnection network (MIN) and shows that the new design has considerably higher performance in the presence of a faulty switching element or link in comparison to dilated networks.
Abstract: We develop and analyze a dilated high performance fault tolerant fast packet multistage interconnection network (MIN) in this paper. In this new design, the links at the input and the output stages of a dilated banyan-based MIN are rearranged to create multiple routes for each source-destination pair in the network after removing one stage in the network. These multiple paths are link- and node-disjoint. Fault tolerance at low latency is achieved by sending multiple copies of each input packet simultaneously using different routes and different priorities. This guarantees that high throughput is maintained even in the presence of faults. Throughput is analyzed using simulation and analysis and we show that the new design has considerably higher performance in the presence of a faulty switching element (SE) or link in comparison to dilated networks. We also analyze the reliability and show that the new design has superior reliability in comparison to competing proposals.