scispace - formally typeset
Search or ask a question

Showing papers on "Overhead (computing) published in 1987"


Journal ArticleDOI
TL;DR: Depending on the types and number of tolerated faults, this paper presents upper bounds on the achievable synchronization accuracy for external and internal synchronization in a distributed real-time system.
Abstract: The generation of a fault-tolerant global time base with known accuracy of synchronization is one of the important operating system functions in a distributed real-time system. Depending on the types and number of tolerated faults, this paper presents upper bounds on the achievable synchronization accuracy for external and internal synchronization in a distributed real-time system. The concept of continuous versus instantaneous synchronization is introduced in order to generate a uniform common time base for local, global, and external time measurements. In the last section, the functions of a VLSI clock synchronization unit, which improves the synchronization accuracy and reduces the CPU load, are described. With this unit, the CPU overhead and the network traffic for clock synchronization in state-of-the-art distributed real-time systems can be reduced to less than 1 percent.

625 citations


Journal ArticleDOI
01 Nov 1987
TL;DR: The packet filter is described, a kernel-resident, protocol-independent packet demultiplexer, which performs quite well, and has been in production use for several years.
Abstract: Code to implement network protocols can be either inside the kernel of an operating system or in user-level processes. Kernel-resident code is hard to develop, debug, and maintain, but user-level implementations typically incur significant overhead and perform poorly.The performance of user-level network code depends on the mechanism used to demultiplex received packets. Demultiplexing in a user-level process increases the rate of context switches and system calls, resulting in poor performance. Demultiplexing in the kernel eliminates unnecessary overhead.This paper describes the packet filter, a kernel-resident, protocol-independent packet demultiplexer. Individual user processes have great flexibility in selecting which packets they will receive. Protocol implementations using the packet filter perform quite well, and have been in production use for several years.

338 citations


Journal ArticleDOI
TL;DR: A broadcast primitive that provides properties of authenticated broadcasts is presented that gives a methodology for deriving non-authenticated algorithms and is applied to various problems and obtained simpler and more efficient solutions than those previously known.
Abstract: Fault-tolerant algorithms for distributed systems with arbitrary failures are simpler to develop and prove correct if messages can be authenticated. However, using digital signatures for message authentication usually incurs substantial overhead in communication and computation. To exploit the simplicity provided by authentication without this overhead, we present a broadcast primitive that provides properties of authenticated broadcasts. This gives a methodology for deriving non-authenticated algorithms. Starting with an authenticated algorithm, we replace signed communication with the broadcast primitive to obtain an equivalent non-authenticated algorithm. We have applied this approach to various problems and in each case obtained simpler and more efficient solutions than those previously known.

240 citations


Journal ArticleDOI
TL;DR: In this paper, a trace-driven simulation study of dynamic load balancing in homogeneous distributed systems supporting broadcasting is presented, where information about job CPU and input/output (I/O) demands collected from production systems is used as input to a simulation model that includes a representative CPU scheduling policy and considers the message exchange and job transfer cost explicitly.

201 citations


Proceedings ArticleDOI
01 Jun 1987
TL;DR: This paper presents a design for instruction issue logic that resolves dependencies dynamically and, at the same time, guarantees a precise state of the machine, without a significant hardware overhead.
Abstract: The performance of pipelined processors is severely limited by data dependencies. In order to achieve high performance, a mechanism to alleviate the effects of data dependencies must exist. If a pipelined CPU with multiple functional units is to be used in the presence of a virtual memory hierarchy, a mechanism must also exist for determining the state of the machine precisely. In this paper, we combine the issues of dependency-resolution and preciseness of state. We present a design for instruction issue logic that resolves dependencies dynamically and, at the same time, guarantees a precise state of the machine, without a significant hardware overhead. Detailed simulation studies for the proposed mechanism, using the Lawrence Livermore loops as a benchmark, are presented.

153 citations


Proceedings ArticleDOI
01 Jun 1987
TL;DR: Simulation results suggest that this architecture reduces message reception overhead by more than an order of magnitude and includes a novel memory organization that permits both indexed and associative accesses and that incorporates an instruction buffer and message queue.
Abstract: We propose a machine architecture for a high-performance processing node for a message-passing, MIMD concurrent computer. The principal mechanisms for attaining this goal are the direct execution and buffering of messages and a memory-based architecture that permits very fast context switches. Our architecture also includes a novel memory organization that permits both indexed and associative accesses and that incorporates an instruction buffer and message queue. Simulation results suggest that this architecture reduces message reception overhead by more than an order of magnitude.

140 citations


Proceedings Article
01 Jan 1987
TL;DR: Two very useful extensions to Prolog's computation model, dif and freeze, were introduced with Prolog II are presented and a method for their incorporation into the Warren Abstract Machine is presented.
Abstract: Two very useful extensions to Prolog's computation model, dif and freeze were introduced with Prolog II. A method for their incorporation into the Warren Abstract Machine is presented. Under reasonable assumptions, the method does not incur any overhead on programs not using these extensions. The clause indexing mechanism is also discussed, as it is not unrelated to the freeze mechanism.

98 citations


Patent
10 Aug 1987
TL;DR: A windstrap secured to a lightweight, pliable, roll-type overhead door across the width thereof reinforces the door structure maintaining it in position across a doorway when the door is in the lowered, extended position and isolating the two areas on respective sides of the door when there is a pressure differential there between such as due to wind as discussed by the authors.
Abstract: A windstrap secured to a lightweight, pliable, roll-type overhead door across the width thereof reinforces the door structure maintaining it in position across a doorway when the door is in the lowered, extended position and isolating the two areas on respective sides of the door when there is a pressure differential therebetween such as due to wind.

53 citations


Journal ArticleDOI
TL;DR: The authors outline a method to design easily testable sequential circuits that achieve scan designs using standard (unmodified) flip-flops.
Abstract: Classical scan designs require properly augmented flip-flops, often called scan flip-flops. Problems stem from the high area overhead implied by the need for these flip-flops or the inability to modify standard flip-flops. The authors outline a method to design easily testable sequential circuits that achieve scan designs using standard (unmodified) flip-flops.

51 citations


Journal ArticleDOI
TL;DR: Methods for reducing communication traffic and overhead on a multiprocessor and the results of testing these methods on the Intel iPSC Hypercube are reported.
Abstract: The efficient implementation of algorithms on multiprocessor machines requires that the effects of communication delays be minimized. The effects of these delays on the performance of a model problem on a hypercube multiprocessor architecture is investigated and methods are developed for increasing algorithm efficiency. The model problem under investigation is the solution by red-black Successive Over Relaxation YOUN71 of the heat equation; most of the techniques described here also apply equally well to the solution of elliptic partial differential equations by red-black or multicolor SOR methods. Methods for reducing communication traffic and overhead on a multiprocessor are identified and results of testing these methods on the Intel iPSC Hypercube reported. Methods for partitioning a problem's domain across processors, for reducing communication traffic during a global convergence check, for reducing the number of global convergence checks employed during an iteration, and for concurrently iterating on multiple time-steps in a time-dependent problem. Empirical results show that use of these models can markedly reduce a numewrical problem's execution time.

49 citations


Proceedings Article
01 Dec 1987
TL;DR: The numerical solution of an elliptic partial differential equation is examined in order to study the relationship between problem size and architecture and identifies the smallest grid size which fully benefits from using all available processors.
Abstract: The communication and synchronization overhead inherent in parallel processing can lead to situations where adding processors to the solution method actually increases execution time. Problem type, problem size, and architecture type all affect the optimal number of processors to employ. The numerical solution of an elliptic partial differential equation is examined in order to study the relationship between problem size and architecture. The equation's domain is discretized into n sup 2 grid points which are divided into partitions and mapped onto the individual processor memories. The relationships between grid size, stencil type, partitioning strategy, processor execution time, and communication network type are analytically quantified. In so doing, the optimal number of processors was determined to assign to the solution, and identified (1) the smallest grid size which fully benefits from using all available processors, (2) the leverage on performance given by increasing processor speed or communication network speed, and (3) the suitability of various architectures for large numerical problems.

Journal ArticleDOI
TL;DR: This paper presents the parallel version of the Sieve, a straightforward algorithm for finding all prime numbers in a given range that serves as a test of some of the capabilities of a parallel machine.
Abstract: The Sieve of Eratosthenes for finding prime numbers in recent years has seen much use as a benchmark algorithm for serial computers while its intrinsically parallel nature has gone largely unnoticed. The implementation of a parallel version of this algorithm for a real parallel computer, the Flex/32, is described and its performance discussed. It is shown that the algorithm is sensitive to several fundamental performance parameters of parallel machines, such as spawning time, signaling time, memory access, and overhead of process switching. Because of the nature of the algorithm, it is impossible to get any speedup beyond 4 or 5 processors unless some form of dynamic load balancing is employed. We describe the performance of our algorithm with and without load balancing and compare it with theoretical lower bounds and simulated results. It is straightforward to understand this algorithm and to check the final results. However, its efficient implementation on a real parallel machine requires thoughtful design, especially if dynamic load balancing is desired. The fundamental operations required by the algorithm are very simple: this means that the slightest overhead appears prominently in performance data. The Sieve thus serves not only as a very severe test of the capabilities of a parallel processor but is also an interesting challenge for the programmer.

Proceedings ArticleDOI
01 Jun 1987
TL;DR: This research partitioned an actual message-based operating system into communication and computation parts interacting through shared queues and measured its performance on a multiprocessor, designed hardware support in the form of a special-purpose smart bus and smart shared memory and demonstrated the benefits of these components through analytical modeling using Generalized Timed Petri Nets.
Abstract: In recent years there has been increasing interest in message-based operating systems, particularly in distributed environments. Such systems consist of a small message-passing kernel supporting a collection of system server processes that provide such services as resource management, file service, and global communications. For such an architecture to be practical, it is essential that basic messages be fast, since they often replace what would be a simple procedure call or “kernel call” in a more traditional system. Careful study of several operating systems shows that the limiting factor, especially for small messages, is typically not network bandwidth but processing overhead. Therefore, we propose using a special-purpose coprocessor to support message passing. Our research has two parts: First, we partitioned an actual message-based operating system into communication and computation parts interacting through shared queues and measured its performance on a multiprocessor. Second, we designed hardware support in the form of a special-purpose smart bus and smart shared memory and demonstrated the benefits of these components through analytical modeling using Generalized Timed Petri Nets. Our analysis shows good agreement with the experimental results and indicates that substantial benefits may be obtained from both the partitioning of the software and the addition of a small amount of special-purpose hardware.

Journal ArticleDOI
TL;DR: An algorithm for reducing the number of synchronized memory references to shared data elements in multiprocessed loops is presented and the correctness of this algorithm is proved.
Abstract: In this correspondence we present and prove the correctness of an algorithm for reducing the number of synchronized memory references to shared data elements in multiprocessed loops. Optimizing compilers for shared memory multiprocessors can use this algorithm to reduce synchronization overhead. The algorithm has been implemented as a new module in the multiprocessors version of Parafrase, the restructuring system of the University of Illinois. We present a brief discussion of experiments we performed to asses the effectiveness of this algorithm in reducing the synchronization overhead for the 61 subroutines of EISPACK, a package for computing matrix eigenvectors and eigenvalues.

Patent
03 Jun 1987
TL;DR: In this article, a process for uniformly measuring the performance characteristic of a computer peripheral by accommodating for variations in the clock rate of the host computer system is disclosed, where after connecting the target to the host and initializing the system automatically calibrates itself to the clock-rate, the user may then define a select test, a set of test, or a continuous set of tests to be run on the target.
Abstract: A process for uniformly measuring the performance characteristic of a computer peripheral by accommodating for variations in the clock rate of the host computer system is disclosed, where after connecting the target to the host and initializing the system automatically calibrates itself to the clock rate of the host and determines the parameters of the target. The user may then define a select test, a set of test, or a continuous set of tests to be run on the target. In performing the selected test or tests, the system determines the amount of overhead time associated with the host and target, and the data transfer time, before determining the various base access times of the target. Upon the determination of a base access time, the host overhead time is then removed to yield an accurate access time measurement that is independent of variable characteristics of the host computer system.

Journal ArticleDOI
TL;DR: This paper shows how multiversion time-stamping protocols for atomicity can be extended to induce fewer delays and restarts by exploiting semantic information about objects such as queues, directories, or counters.
Abstract: Atomic transactions are a widely accepted approach to implementing and reasoning about fault-tolerant distributed programs. This paper shows how multiversion time-stamping protocols for atomicity can be extended to induce fewer delays and restarts by exploiting semantic information about objects such as queues, directories, or counters. This technique relies on static preanalysis of conflicts between operations, and incurs no additioiwal runtime overhead. This technique is deadlock-free, and it is applicable to objects of arbitrary type.

Patent
Titolo Andrea1
29 Sep 1987
Abstract: The subject of the invention consists of an overhead valve control for internal combustion engines provided with at least three valves per cylinder controlled through at least one overhead camshaft (1). The control is designed simultaneously to control two mated and parallel valves (3), and is so shaped as to be actuated by a variable profile cam (2).

Patent
21 May 1987
TL;DR: In this paper, a sensing member for detecting condition of the overhead power transmission or distribution lines or the ground wire is secured to an anchor and clamp member for anchoring and clamping the power line or overhead ground wire so as to prevent undesired movement of the sensing member.
Abstract: In a overhead power transmission system, a sensing member for detecting condition of the overhead power transmission or distribution lines or the ground wire is secured to an anchor and clamp member for anchoring and clamping the power line or overhead ground wire so as to prevent undesired movement of the sensing member.

Patent
29 Jun 1987
TL;DR: In this paper, horizontal lines are projected onto a person from a plurality of overhead projectors, each projecting at a 45 degrees angle, all of the projectors have parallel optical axes, the person being photographed with the projected raster superimposed.
Abstract: Horizontal lines are projected onto a person from a plurality of overhead projectors, each projecting at a 45 degrees angle, all of the projectors have parallel optical axes, the person being photographed with the projected raster superimposed.

Journal ArticleDOI
TL;DR: In this article, a model for execution time as a function of the number of processes used in a computation is developed for shared memory multiprocessor systems, where the main focus is on the effect of sequential code, code which can be executed by only a limited number of processors.
Abstract: This paper discusses execution time versus number of simultaneous operations in parallel computing systems. The main focus is on shared memory multiprocessors. A model for execution time as a function of the number of processes used in a computation is developed. The model addresses the effect of sequential code, code which can be executed by only a limited number of processes, hardware limits to speedup, critical section synchronization overhead and the influence of task granularity. The model is shown to correspond very closely to experimental measurements of execution time on the HEP pipelined, shared memory multiprocessor. Use of the model as an analysis tool in complex parallel programs is indicated.


01 Jan 1987
TL;DR: The problems that are considered include: finding representations of entities and single valued properties, selecting a set of indices to support access to groups of entities occurring as class extensions or as values of many-valued properties, mapping transactions to forms that automatically maintain indices, and compiling queries.
Abstract: In this thesis, we consider some of the problems of physical design for the more recently proposed data models. These newer models, called semantic data models, adopt concepts developed by artificial intelligence researchers investigating the general problem of knowledge representation. Our results apply to a particular choice of model, called LDM, that is also developed in the thesis. LDM incorporates the most common features of other semantic data models including a capability for a generalization hierarchy that supports multiple inheritance, support for many-valued properties and a non-procedural query language. This has the advantage that implementors of these other models can then apply our techniques for physical design to solve similar implementation problems. The performance issues we address are based on the assumption that all encoding of information is memory resident. With this assumption, some problems, such as the choice of representation for entities and simple property values, become important issues. Other issues relating to access strategies for implementing queries or to the choice of index types and their selection, are fundamentally changed. The assumption also permits us to ignore clustering problems (problems concerning the judicious placement of data in order to reduce retrieval overhead), since they then have much less relative significance to overall performance. The problems that are considered include: finding representations of entities and single valued properties, selecting a set of indices to support access to groups of entities occurring as class extensions or as values of many-valued properties, mapping transactions to forms that automatically maintain indices, and compiling queries.

01 Oct 1987
TL;DR: The rollback chip is proposed to manage the state of a processor and provide an efficient rollback mechanism within a node of a parallel computer and may be used in other applications using the Time Wrap mechanism, notably distributed database concurrency control.
Abstract: : Distributed simulation offers an attractive means of meeting the high computational demands of discrete event simulation programs. The Time Wrap mechanism has been proposed to ensure correct sequencing of events in distributed simulation programs without blocking processes unnecessarily. However, the overhead of state saving and rollback in Time Wrap is one obstacle that may severely degrade performance. A special purpose hardware component, the rollback chip (RBC), is proposed to manage the state of a processor and provide an efficient rollback mechanism within a node of a parallel computer. The chip may be viewed as a special purpose memory management unit that lies on the data path between processor and memory. The algorithm implemented by the rollback chip is described, as well as extensions to the basic design. Implementation of the chip is briefly discussed. In addition to distributed simulation, the rollback chip may be used in other applications using the Time Wrap mechanism, notably distributed database concurrency control.

Patent
15 Jul 1987
TL;DR: In this paper, a tandem sawmill assembly with two linearly spaced sawing stations is described, where a conveyor system moves the log through the first sawing station and up an inclined ramp for rotating the log ninety degrees prior to moving the log to the second sawing stage.
Abstract: A tandem sawmill assembly is disclosed having two linearly spaced sawing stations. A conveyor system moves the log through the first sawing station and up an inclined ramp for rotating the log ninety degrees prior to moving the log through the second sawing station. An overhead roller engages the re-oriented log and advances it to a centering platform where centering arms engage the log to orient the log axially with respect to the second sawing station. The conveyor then moves the aligned log through the second sawing station while a second overhead roller applies presssure to the upper surface of the log.

Journal ArticleDOI
TL;DR: It is shown that a reasonably flexible interprocess communication can be supported with only a small increase in complexity and overhead.


01 Jul 1987
TL;DR: This thesis presents efficient algorithms applicable to the simulation of special classes of systems such that almost no overhead messages are required, and develops a new sequential simulation algorithm based on a distributed one.
Abstract: In this thesis we present efficient algorithms for distributed simulation, and for the related problems of termination detection and sequential simulation. We present distributed simulation algorithms applicable to the simulation of special classes of systems such that almost no overhead messages are required. By contrast, previous distributed simulation algorithms, although applicable to the general class of any discrete event system, usually require too many overhead messages. First, we define a simple distributed simulation algorithm with nearly zero overhead messages for simulating feedforward systems. We develop an approximate method to predict its performance in simulating a class of feedforward queuing networks. We evaluate the performance of the scheme in simulating specific subclasses of these queuing networks. We show that the scheme offers a high performance for serial-parallel networks. Next we define another distributed simulation scheme for a class of distributed systems who topologies may have cycles. One important problem in devising distributed simulation algorithms is that of efficient detection of termination. With this in mind, we devise a class of termination detection algorithms using markers. Finally, we develop a new sequential simulation algorithm based on a distributed one. This algorithm often reduces the event list manipulations of traditional event list driven simulation.


Book ChapterDOI
01 Jan 1987
TL;DR: The steps of distributed recovery using distributed system checkpoints are described and by measurement of the runtime overhead of a realistic application (2D-Poisson-multigrid) its efficiency is discussed in comparasion to recovery techniques using central system checkpoints.
Abstract: This paper describes a technique for distributed recovery in multiprocessor ring configurations, which has been developed and implemented for the multiprocessor system DIRMU 25 — a 25 processor system which is operational at the University of Erlangen-Nuremberg. First a short overview of the DIRMU hardware architecture and the distributed operating system DIRMOS is given. The steps of distributed recovery using distributed system checkpoints are described. By measurement of the runtime overhead of a realistic application (2D-Poisson-multigrid) its efficiency is discussed in comparasion to recovery techniques using central system checkpoints.