Workload, Performance, and Reliability of Digital Computing Systems.

Open Access

Workload, Performance, and Reliability of Digital Computing Systems.

TLDR

A new modeling methodology to characterize failure processes in Time-sharing systems due to hardware transients and software errors is summarized, which gives quantitative relationships between performance, workload, and (lack of) reliability for digital compiuting systems.

Abstract:

In this paper a new modeling methodology to characterize failure processes in Time-sharing systems due to hardware transients and software errors is summarized. The basic assumption made is that the instantaneous failure rate of a system resource can be approximated by a deterministic function of time plus a zero-mean stationary Gaussian process, both depending on the usage of the resource considered. The probability density function of the time to failure obtained under this assumption has a decreasing hazard function, partially explaining why other decreasing hazard function densities such as the Weibull fit experimental data so well. %rttiermore, by considering the Kernel of the Operating System as a sysSem resource, this methodology sets the basis for independent methods of evaluating the contribution of software to system unreliability, and gives some non obvious hints about how system reliability could be improved. A real system has been characterized according to this methodology, and an extremely good fit between predicted and observed behavior has been found. Also, the predicted system behavior according to this methology is compared with the predictions of other models such as the exponential, Weibull. and periodic failure rate. The work presented in this paper describes a new modeling methodology. This methodology gives quantitative relationships between performance, workload, and (lack of) reliability for digital compiuting systems. Current methodologies for reliability assessment may provide good models for explaining and predicting the behavior of systems in the presence of hard (recurrent) faults, but the effect and charcterization of transient (non recurrent) faults and software (either design or implementation) errors is still. very elusive. These current reliability measures do not give individual users a feeling of the impact of unreliability on performance in genetpal purpose system operating under a variety of workloads. That is, there are no general methods for a quantitative assessment of the! benefits of fault.tolerance. 2 Prior work

Workload, Performance, and Reliability of Digital Computing Systems.

Citations

A large-scale study of failures in high-performance computing systems

A Large-Scale Study of Failures in High-Performance Computing Systems

Experimental evaluation

Software defects and their impact on system availability-a study of field failures in operating systems

Closed-Form Solutions of Performability

References

Topics in the Theory of Random Noise

Dynamic Probabilistic Systems

The CRAY-1 computer system

A theory of software reliability and its application

Characterization of cyclostationary random signal processes

Related Papers (5)

Measurement and modeling of computer reliability as affected by system activity

A large-scale study of failures in high-performance computing systems

Networked Windows NT system field failure data analysis

Failure data analysis of a large-scale heterogeneous server environment

A census of Tandem system availability between 1985 and 1990