scispace - formally typeset
Open Access

Workload, Performance, and Reliability of Digital Computing Systems.

TLDR
A new modeling methodology to characterize failure processes in Time-sharing systems due to hardware transients and software errors is summarized, which gives quantitative relationships between performance, workload, and (lack of) reliability for digital compiuting systems.
Abstract
In this paper a new modeling methodology to characterize failure processes in Time-sharing systems due to hardware transients and software errors is summarized. The basic assumption made is that the instantaneous failure rate of a system resource can be approximated by a deterministic function of time plus a zero-mean stationary Gaussian process, both depending on the usage of the resource considered. The probability density function of the time to failure obtained under this assumption has a decreasing hazard function, partially explaining why other decreasing hazard function densities such as the Weibull fit experimental data so well. %rttiermore, by considering the Kernel of the Operating System as a sysSem resource, this methodology sets the basis for independent methods of evaluating the contribution of software to system unreliability, and gives some non obvious hints about how system reliability could be improved. A real system has been characterized according to this methodology, and an extremely good fit between predicted and observed behavior has been found. Also, the predicted system behavior according to this methology is compared with the predictions of other models such as the exponential, Weibull. and periodic failure rate. The work presented in this paper describes a new modeling methodology. This methodology gives quantitative relationships between performance, workload, and (lack of) reliability for digital compiuting systems. Current methodologies for reliability assessment may provide good models for explaining and predicting the behavior of systems in the presence of hard (recurrent) faults, but the effect and charcterization of transient (non recurrent) faults and software (either design or implementation) errors is still. very elusive. These current reliability measures do not give individual users a feeling of the impact of unreliability on performance in genetpal purpose system operating under a variety of workloads. That is, there are no general methods for a quantitative assessment of the! benefits of fault.tolerance. 2 Prior work

read more

Citations
More filters
Proceedings ArticleDOI

A large-scale study of failures in high-performance computing systems

TL;DR: Analysis of failure data collected at two large high-performance computing sites finds that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate.
Journal ArticleDOI

A Large-Scale Study of Failures in High-Performance Computing Systems

TL;DR: Analysis of failure data collected at two large high-performance computing sites finds that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate.
Proceedings ArticleDOI

Software defects and their impact on system availability-a study of field failures in operating systems

TL;DR: It is shown that the impact of an overlay defect is, on average, much higher than that of a regular defect, that boundary conditions and allocation management are the major causes of overlay defects, not timing, and that most overlays are small and corrupt data near the data that the programmer meant to update.
Journal ArticleDOI

Closed-Form Solutions of Performability

TL;DR: This paper considers the modeling of a degradable buffer/multiprocessor system whose performance Y is the (normalized) average throughput rate realized during a bounded interval of time and shows that a closed-form solution of performability can indeed be obtained.
References
More filters
Journal ArticleDOI

Topics in the Theory of Random Noise

Journal ArticleDOI

The CRAY-1 computer system

TL;DR: The CRAY-1 is the only computer to have been built to date that satisfies ERDA's Class VI requirement (a computer capable of processing from 20 to 60 million floating point operations per second) and its Fortran compiler (CFT) is designed to give the scientific user immediate access to the benefits of the Cray-1's vector processing architecture.
Journal ArticleDOI

A theory of software reliability and its application

TL;DR: The reliability model that has been developed can be used in making system tradeoffs involving software or software and hardware components and provides a soundly based unit of measure for the comparative evaluation of various programming techniques that are expected to enhance reliability.
Journal ArticleDOI

Characterization of cyclostationary random signal processes

TL;DR: This paper examines two methods for representing nonstationary processes that reveal the special properties possessed by CS processes, and shows that the HSR is particularly appropriate for characterizing the structural properties of CS processes and that the TSR provides natural models for many types of communication signal formats.
Related Papers (5)