scispace - formally typeset
Search or ask a question

Showing papers on "Latency (engineering) published in 1994"


Proceedings ArticleDOI
01 Apr 1994
TL;DR: A low-latency, high-bandwidth, virtual memory-mapped network interface for the SHRIMP multicomputer project at Princeton University is described, demonstrating that the approach can reduce the message passing overhead to a few user-level instructions.
Abstract: The network interfaces of existing multicomputers require a significant amount of software overhead to provide protection and to implement message passing protocols. This paper describes the design of a low-latency, high-bandwidth, virtual memory-mapped network interface for the SHRIMP multicomputer project at Princeton University. Without sacrificing protection, the network interface achieves low latency by using virtual memory mapping and write-latency hiding techniques, and obtains high bandwidth by providing a user-level block data transfer mechanism. We have implemented several message passing primitives in an experimental environment, demonstrating that our approach can reduce the message passing overhead to a few user-level instructions.

389 citations


Proceedings ArticleDOI
14 Nov 1994
TL;DR: This work describes the locking architecture of a new operating system, HURRICANE, designed for large scale shared-memory multiprocessors, which uses a hybrid coarse-grain/fine-grain locking strategy and a clustered kernel that bounds the number of processors that can compete for a lock to reduce second order effects.
Abstract: We describe the locking architecture of a new operating system, HURRICANE, designed for large scale shared-memory multiprocessors. Many papers already describe kernel locking techniques, and some of the techniques we use have been previously described by others. However, our work is novel in the particular combination of techniques used, as well as several of the individual techniques themselves. Moreover, it is the way the techniques work together that is the source of our performance advantages and scalability. Briefly, we use: • a hybrid coarse-grain/fine-grain locking strategy that has the low latency and space overhead of a coarse-grain locking strategy while having the high concurrency of a fine-grain locking strategy; • replication of data structures to increase access bandwidth and improve concurrency; • a clustered kernel that bounds the number of processors that can compete for a lock so as to reduce second order effects such as memory and interconnect contention; • Distributed Locks to further reduce second order effects, with modifications that reduce the uncontended latency of these locks to close to that of spin locks.

30 citations


Journal ArticleDOI
TL;DR: An asynchronous circuit for an arbiter cell that can be used to construct cascaded multiway arbitration circuits and has a short response delay at the input request-grant handshake link.
Abstract: We present an asynchronous circuit for an arbiter cell that can be used to construct cascaded multiway arbitration circuits. The circuit is completely speed-independent. It has a short response delay at the input request-grant handshake link due to both a) the propagation of requests in parallel with starting arbitration and b) the concurrent resetting of request-grant handshakes in different cascades of a request-grant propagation chain. >

20 citations


Proceedings ArticleDOI
26 Apr 1994
TL;DR: The results seem to indicate that efficient performance in the shared and distributed memory modes of operation stems from the operational concurrency of the WDM channels and the high data rate and low latency possible from the use of the new fiber channel optical receivers and transmitters.
Abstract: We present the design of a wavelength division multiplexed fiber optic bus for multiprocessors that allows both the shared memory model and the distributed memory model to be supported efficiently. We establish this by presenting the results of a fairly detailed trace-driven simulation of the performance of a system using the proposed interconnection. We also discuss some of the engineering issues involved in the design of such a WDM fiber bus and briefly mention approaches for cutting down on the cost of the optonic interface components. Our results seem to indicate that efficient performance in the shared and distributed memory modes of operation stems from the operational concurrency of the WDM channels and the high data rate and low latency possible from the use of the new fiber channel optical receivers and transmitters. >

12 citations


Proceedings ArticleDOI
31 Oct 1994
TL;DR: An architecture is described which accommodates the difficult physical constraints in a WDM optical network, especially the inevitable wavelength drifts many times the source linewidths, by using an adaptive receiver system.
Abstract: An architecture is described which accommodates the difficult physical constraints in a WDM optical network, especially the inevitable wavelength drifts many times the source linewidths, by using an adaptive receiver system. The equally difficult and conflicting requirements of a multi-media, multi-Gb/s network of fixed and low latency, guaranteed bandwidth, and multi-cast have also been met by adapting the DTM protocol to mixed WDM and TDM optical formats.

11 citations


Proceedings ArticleDOI
02 May 1994
TL;DR: A high quality, fast dithering and interpolation algorithm used to convert YCbCr directly into 8 bit palletized images is presented and proposed, of dealing with compression and decompression tasks at a very low cost to achieve 30 fps SIF performance for desktop applications.
Abstract: We present a high performance implementation of a MPEG decoder, written entirely on a high level language. The decoder implementation fully complies with the MPEG-I standard and decodes all (I, P, B) frame types in MPEG video bitstreams and is portable. Versions of this decoder are implemented on Windows 3.1, and on Windows NT (X86, MIPS, ALPHA). A comparison of the performance of the decoder between the various platforms is made. We present a high quality, fast dithering and interpolation algorithm used to convert YCbCr directly into 8 bit palletized images. We propose a new method called Collaborative Compression, of dealing with compression and decompression tasks at a very low cost to achieve 30 fps SIF performance for desktop applications. Collaborative Compression is a systems approach to partitioning the functionality between CPU-centric (i.e. software) and hardware-assist (VLSI) in order to achieve the optimal cost solution. The CPU provides glue programmability to tie the accelerated and non-accelerated parts of the algorithm together. The advent of high bandwidth, low latency busses (VL Bus and PCI) enable a high speed data pathway between the distributed computational elements.

10 citations


Proceedings ArticleDOI
15 May 1994
TL;DR: The Tactus system as discussed by the authors offers a systematic approach to prefetching, precomputation, choice points, and synchronous cuts for interactive media presentations where there are a small number of choices.
Abstract: Multimedia streams usually require prefetching and buffering to ensure steady, glitch-free delivery to audio and video displays, but buffering causes undesirable latency. This latency may be manifested as startup delays, glitches, dropouts, and loss of synchronization. In interactive media presentations where there are a small number of choices, alternative streams can be prefetched to reduce latency. This technique is supported by the Tactus system, which manages the computation and synchronization of multimedia data. Tactus offers a systematic approach to prefetching, precomputation, choice points, and synchronous cuts. Tactus consists of an object-oriented client toolkit for media generation and a synchronization server for media presentation. >

9 citations


Journal ArticleDOI
E.H. Kristiansen1, J.W. Bothner, T.I. Hulaas, E. Rongved, T.B. Skaali 
TL;DR: Detailed simulations of processor networks based on the Scalable Coherent Interface show that SCI is suitable as a data carrier in data acquisition systems where the total bandwidth need is in the multi GBytes/s range and a low latency is required.
Abstract: Detailed simulations of processor networks based on the Scalable Coherent Interface (SCI) show that SCI is suitable as a data carrier in data acquisition systems where the total bandwidth need is in the multi GBytes/s range and a low latency is required. The objective of these simulations was to find topologies with low latency and high bandwidth, but also with the cost of implementation in mind. A ring-to-ring bridge has been used as the building element for the networks. The simulations have been performed on regular k-ary n-cube type topologies from a few tens of nodes and up to about 500 nodes under different load conditions. Among the parameters which have been manipulated in the simulations are the number of nodes, topology structure, number of outstanding requests and load in the system. >

6 citations


Proceedings ArticleDOI
01 Aug 1994
TL;DR: This work shows how to use barriers, in particular Integrated Network Barriers to achieve high bandwidth utilization which is arbitrarily close to 100%, which provides low latency and fairness to processors.
Abstract: In bandwidth limited computers, such as meshes and tori, it is important to achieve high bandwidth across the bisection. Traditional techniques achieve bandwidth in the range of 30–70%. We show how to use barriers, in particular Integrated Network Barriers to achieve high bandwidth utilization which is arbitrarily close to 100%. This technique also provides low latency and fairness to processors. Moreover, it works globally and therefore is not dependent on local approximations of network traffic.

6 citations


Proceedings Article
01 Jan 1994
TL;DR: In interactive media presentations where there are a small number of choices, alternative streams can be prefetched to reduce latency, supported by the Tactus system, which manages the computation and synchronization of multimedia data.
Abstract: Multimedia streams usually require prefetching and buffering to ens’ure steady, glitch-fnee dehery to audio and video displays, but buffering ca’uses undesirable latency. This latency may be manifested as startup delays, glitches, dropouts, and loss of synchronization. In interactive media presentations .where there are a small number of choices, alternative streams can be prefetched to reduce latency. This technique is supported by the Tactus system, which manages the computation and synchronization of multimedia data. Tactus offers a systematic approach to prefetching, precomputation, choice points, and synchronous c,uts. Tactus consists of an object-oriented client toolkit for media generation and a synchronization seruer for me

5 citations


Journal ArticleDOI
TL;DR: A hierarchical, all-optical wavelength division multiplexed (WDM) network that is being built to support the communication requirements of a large distributed shared memory system, which has static slot assignment that provides excellent throughput but long latencies due to cycle synchronization.
Abstract: The above paper by L. Bhuyan and D.P. Agrawal (1984) described a hierarchical, all-optical wavelength division multiplexed (WDM) network that is being built to support the communication requirements of a large distributed shared memory system. Dynamically adaptable bandwidth allocation is supported, both within and between levels of the hierarchy, and is highly scalable through wavelength re-use at each hierarchical level. The mixed radix scheme introduced was used for processor numbering, but was not specified. The system was described in term of a time multiplexed access protocol, generalized to the multichannel WDM environment, which has static slot assignment that provides excellent throughput but long latencies due to cycle synchronization. The actual system implementation will use a hybrid WDMA strategy, which is collisionless, provides low latency support, dynamic bandwidth allocation within a hierarchical level, and a fast reliable broadcast. >

Proceedings ArticleDOI
14 Dec 1994
TL;DR: A 4-dimensional network which achieves a significant reduction in density with only a small increase in data transfer delays is discussed, which is useful in the exploitation of small grain parallelism.
Abstract: The Melbourne University Optoelectronic Multicomputer Project is investigating dense optical interconnection networks capable of providing low latency data transfers of small data items. Such capabilities are useful in the exploitation of small grain parallelism. In many cases, reducing the grain size of tasks increases the amount of parallelism which can be found in the program. Our networks use an organization of data transfers called PANDORA (PArallel Newscasts on a Dense Optical Reconfigurable Array). The communication patterns on a PANDORA network are pre-determined, removing the overhead of sending and decoding addressing information. Instead the data is recognized by the time of arrival and the channel on which it arrives. Previous efforts have focused on 2-dimensional multiple broadcasting networks where each node may broadcast a different data item on the row and columns of the network. For large processor arrays, we have to reduce the density of the interconnection network as full interconnection on each row becomes too expensive. This paper discusses a 4-dimensional network which achieves a significant reduction in density with only a small increase in data transfer delays. >

01 Jan 1994
TL;DR: The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed.
Abstract: This short note presents the results of an implementation of several algorithms for checkpointing and restarting parallel programs on shared-memory multiprocessors. The algorithms are compared according to the metrics of overall checkpointing time, overhead imposed by the checkpointer on the target program, and amount of time during which the checkpointer interrupts the target program. The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed.