scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Reliability in 1990"


Journal ArticleDOI
TL;DR: A census of customer outages reported to Tandem indicates that software is now the major source of reported outages, followed by system operations, a dramatic shift from the statistics for 1985.
Abstract: A census of customer outages reported to Tandem showing a clear improvement in the reliability of hardware and maintenance has been taken. It indicates that software is now the major source of reported outages (62%), followed by system operations (15%). This is a dramatic shift from the statistics for 1985. Even after discounting systematic underreporting of operations and environmental outages, the conclusion is clear: hardware faults and hardware maintenance are no longer a major source of outages. As the other components of the system become increasingly reliable, software necessarily becomes the dominant cause of outages. Achieving higher availability requires improvement in software quality and software fault tolerance, simpler operations, and tolerance of operational faults. >

347 citations


Journal ArticleDOI
TL;DR: The authors present the results of an analysis that demonstrates that the log is composed of at least two error processes: transient and intermittent, and it is shown that the DFT can extract intermittent errors from the error log and uses only one fifth of the error-log entry points required for failure prediction.
Abstract: Most error-log analysis studies perform a statistical fit to the data assuming a single underlying error process. The authors present the results of an analysis that demonstrates that the log is composed of at least two error processes: transient and intermittent. The mixing of data from multiple processes requires many more events to verify a hypotheses using traditional statistical analysis. Based on the shape of the interarrival time function of the intermittent errors observed from actual error logs, a failure-prediction heuristic, the dispersion frame technique (DFT), is developed. The DFT was implemented in a distributed system for the campus-wide Andrew file system at Carnegie Mellon University. Data collected from 13 file servers over a 22-month period were analyzed using both the DFT and conventional statistical methods. It is shown that the DFT can extract intermittent errors from the error log and uses only one fifth of the error-log entry points required by statistical methods for failure prediction. The DFT achieved a 93.7% success rate in predicting failures in both electromechanical and electronic devices. >

216 citations


Journal ArticleDOI
TL;DR: The authors derive reliability functions and mean time to failure of four different memory systems subject to transient errors at exponentially distributed arrival times and derive easy-to-use expressions for MTTF of memories.
Abstract: The authors analyze the problem of transient-error recovery in fault-tolerant memory systems, using a scrubbing technique. This technique is based on single-error-correction and double-error-detection (SEC-DED) codes. When a single error is detected in a memory word, the error is corrected and the word is rewritten in its original location. Two models are discussed: (1) exponentially distributed scrubbing, where a memory word is assumed to be checked in an exponentially distributed time period, and (2) deterministic scrubbing, where a memory word is checked periodically. Reliability and mean-time-to-failure (MTTF) equations are derived and estimated. The results of the scrubbing techniques are compared with those of memory systems without redundancies and with only SEC-DED codes. A major contribution of the analysis is easy-to-use expressions for MTTF of memories. The authors derive reliability functions and mean time to failure of four different memory systems subject to transient errors at exponentially distributed arrival times. >

202 citations


Journal ArticleDOI
TL;DR: In this paper, the relationship between the consecutive-k-out-of-n:F system and the consecutive k-out of n:G system is studied, theorems for such systems are developed, and available results for one type of system are applied to the other, and a case study illustrates reliability analysis and optimal design of a train operation system.
Abstract: A consecutive-k-out-of-n:F (consecutive-k-out-of-n:G) system consists of an ordered sequence of n components such that the system is failed (good) if and only if at least k consecutive components in the system are failed (good). In the present work, the relationship between the consecutive-k-out-of-n:F system and the consecutive-k-out-of-n:G system is studied, theorems for such systems are developed, and available results for one type of system are applied to the other. The topics include system reliability, reliability bounds, component reliability importance, and optimal system design. A case study illustrates reliability analysis and optimal design of a train operation system. An optimal configuration rule is suggested by use of the Birnbaum importance index. >

141 citations


Journal ArticleDOI
TL;DR: In a preliminary effort to understand and catalog how networks behave under various conditions, two cases of anomalous behavior are analyzed in detail.
Abstract: Fault detection and diagnosis depend critically on good fault definitions, but the dynamic, noisy, and nonstationary character of networks makes it hard to define what a fault is in a network environment. The authors take the position that a fault or failure is a violation of expectations. In accordance with empirically based expectations, operating behaviors of networks (and other devices) can be classified as being either normal or anomalous. Because network failures most frequently manifest themselves as performance degradations or deviations from expected behavior, periods of anomalous performance can be attributed to causes assignable as network faults. The half-year case study presented used a system in which observations of distributed-computing network behavior were automatically and systematically classified as normal or anomalous. Anomalous behaviors were traced to faulty conditions. In a preliminary effort to understand and catalog how networks behave under various conditions, two cases of anomalous behavior are analyzed in detail. Examples are taken from the distributed file-system network at Carnegie Mellon University. >

114 citations


Journal ArticleDOI
TL;DR: A software reliability growth model based on a nonhomogeneous Poisson process is introduced that describes the time-dependent behavior of software errors detected and testing-resource expenditures spent during the testing.
Abstract: Two kinds of software-testing management problems are considered: testing-resource allocation to best use specified testing resources during module testing, and a testing-resource control problem concerning how to spend the allocated amount of testing-resource expenditures during it. A software reliability growth model based on a nonhomogeneous Poisson process is introduced. The model describes the time-dependent behavior of software errors detected and testing-resource expenditures spent during the testing. The optimal allocation and control of testing resources among software modules can improve reliability and shorten the testing stage. Based on the model, numerical examples of these two software testing management problems are presented. >

108 citations


Journal ArticleDOI
TL;DR: The author's method requires considerably less computer time to obtain results comparable to those of the other methods, and it has a low degree of programming difficulty.
Abstract: A computationally simple approach for reliability-redundancy optimization problems is proposed. It is compared by means of a simulation study with the other two existing approaches: (1) the LMBB method, which incorporates the Lagrange multiplier technique in conjunction with the Kuhn-Tucker condition and the branch-and-bound method, and (2) sequential search techniques in combination with heuristic redundancy allocation methods, including an extension of combinations of four heuristics and two search techniques. Using 100 sets of randomly generated test problems with nonlinear constraints for both series systems and a complex system, the authors measured and evaluated the performance of these approaches in terms of optimality rate, error rate, and execution time. In general, the author's method requires considerably less computer time to obtain results comparable to those of the other methods, and it has a low degree of programming difficulty. >

98 citations


Journal ArticleDOI
TL;DR: In this article, a two-dimensional version of the consecutive k-out-of-n:F model is considered and bounds on system failure probabilities are determined by comparison with the usual one-dimensional model.
Abstract: A two-dimensional version of the consecutive-k-out-of-n:F model is considered. Bounds on system failure probabilities are determined by comparison with the usual one-dimensional model. Failure probabilities are determined by simulation for a variety of values of k and n. >

96 citations


Journal ArticleDOI
TL;DR: A minor modification of the ALR algorithm called the Abraham-Locks-Wilson (ALW) method is described, an alternative method of ordering paths and terms that obtains a shorter disjoint system formula on a test example than any previous SDP method and allows small computational savings in processing large paths of complex networks.
Abstract: The Abraham-Locks-revised (ALR) sum-of-disjoint products (SDP) algorithm is an efficient method for obtaining a system reliability formula. The author describes a minor modification of the ALR algorithm called the Abraham-Locks-Wilson (ALW) method. The new feature is an alternative method of ordering paths and terms. ALW obtains a shorter disjoint system formula on a test example than any previous SDP method and allows small computational savings in processing large paths of complex networks. As there are different ways to obtain a reliability formula it is useful to use an approach which yields the smallest formula relative to computational effort expended. The extra effort in ordering the terms should be reasonably small and usually leads to improved efficiency in the later stages of the algorithm. ALW allows the analyst to operate in a more efficient way on many problems, particularly if the overlap ordering is used in the early stages of processing but is probably ignored for terms that contain a majority of the Boolean variables. >

81 citations


Journal ArticleDOI
A. Gandini1
TL;DR: In this paper, the authors proposed a method for sensitivity analysis based on generalized perturbation theory (GPT) methodology, widely adopted in reactor-physics studies, and the concept of importance of a state in the Markov model representation of systems is introduced.
Abstract: After reviewing various importance concepts adopted in reliability, the authors propose a method for sensitivity analysis. The method uses the heuristically based generalized perturbation theory (GPT) methodology, widely adopted in reactor-physics studies. The concept of importance of a state in the Markov model representation of systems is introduced. The resulting formulations apply to any response of interest in reliability analysis. The relationship between the GPT method and Birnbaum importance is given. >

77 citations


Journal ArticleDOI
TL;DR: In this article, the failure probability of m-consecutive k-out-of-n:F systems was studied and three theorems concerning such systems were proved.
Abstract: An m-consecutive-k-out-of-n:F system, consists of n components ordered on a line; the system fails if and only if there are at least m nonoverlapping runs of k consecutive failed components. Three theorems concerning such systems are stated and proved. Theorem one is a recursive formula to compute the failure probability of such a system. Theorem two is an exact formula for the failure probability. Theorem three is a limit theorem for the failure probability. >

Journal ArticleDOI
TL;DR: In this article, the authors present a simple, easy-to-understand approximation to the renewal function that is easy to implement on a personal computer and works very well with one term if not too much accuracy is required.
Abstract: The authors present a simple, easy-to-understand approximation to the renewal function that is easy to implement on a personal computer. The key idea is that, for small values of time, the renewal function is almost equal to the cumulative distribution function of the interrenewal time, whereas for larger values of time an asymptotic expansion depending only on the first and second moment of the interrenewal time can be used. The relative error is typically smaller than a few percent for Weibull interrenewal times. The simple approximation methods works very well with one term if not too much accuracy is required (e.g. in the block replacement problem) or if the interrenewal (failure) distribution is not exactly known (e.g. only the first two moments are known). Although the accuracy of the simple approximation can be improved by increasing the number of terms, this strategy is not advocated since speed and simplicity are lost. If high accuracy is required, it is better to use another approximating method (e.g. power series expansion or cubic splines method). >

Journal ArticleDOI
TL;DR: A general theory of software reliability that proposes that software failure rates are the product of the software average error size, apparent error density, and workload is developed and models of these factors that are consistent with the assumptions of classical software-reliability models are developed.
Abstract: A general theory of software reliability that proposes that software failure rates are the product of the software average error size, apparent error density, and workload is developed. Models of these factors that are consistent with the assumptions of classical software-reliability models are developed. The linear, geometric and Rayleigh models are special cases of the general theory. Linear reliability models result from assumptions that the average size of remaining errors and workload are constant and that its apparent error density equals its real error density. Geometric reliability models differ from linear models in assuming that the average-error size decreases geometrically as errors are corrected, whereas the Rayleigh model assumes that the average size of remaining errors increases linearly with time. The theory shows that the abstract proportionality constants of classical models are composed of more fundamental and more intuitively meaningful factors, namely, the initial values of average size of remaining errors, real error density, workload, and error content. It is shown how the assumed behavior of the reliability primitives of software (average-error size, error density, and workload) is modeled to accommodate diverse reliability factors. >

Journal ArticleDOI
TL;DR: The author presents an overview of a methodology for the automated generation of fault trees for electrical/electronic circuits from a representation of a schematic diagram that uses backtracking.
Abstract: The author presents an overview of a methodology for the automated generation of fault trees for electrical/electronic circuits from a representation of a schematic diagram. Existing computer programs for the generation of fault trees are briefly discussed, and their deficiencies are indicated. The approach presented here is quantitative and uses backtracking. It is illustrated by an example. A prototype computer program has been written to implement the methodology for DC circuits. >

Journal ArticleDOI
TL;DR: It is concluded that the fault-injection test sequence has evidenced the limited performance of the self-checking mechanisms implemented on the tested NAC (network attachment controller) and justified the need for the improved self- checking mechanisms implemented in an enhanced NAC architecture using duplicated circuitry.
Abstract: The authors present a study of the validation of a dependable local area network providing multipoint communication services based on an atomic multicast protocol. This protocol is implemented in specialized communication servers, that exhibit the fail-silent property, i.e. a kind of halt-on-failure behavior enforced by self-checking hardware. The tests that have been carried out utilize physical fault injection and have two objectives: (1) to estimate the coverage of the self-checking mechanisms of the communication servers, and (2) to test the properties that characterize the service provided by the atomic multicast protocol in the presence of faults. The testbed that has been developed to carry out the fault-injection experiments is described, and the major results are presented and analyzed. It is concluded that the fault-injection test sequence has evidenced the limited performance of the self-checking mechanisms implemented on the tested NAC (network attachment controller) and justified (especially for the main board) the need for the improved self-checking mechanisms implemented in an enhanced NAC architecture using duplicated circuitry. >

Journal ArticleDOI
TL;DR: For censored Weibull regression data arising from typical accelerated life tests (ALTs), the performance of small-sample normal-theory confidence intervals is summarized by three points: (1) they have highly asymmetric error rates; (2) they can be extremely anti-conservative; and (3) these effects worsen when higher confidence levels are used as mentioned in this paper.
Abstract: For censored Weibull regression data arising from typical accelerated life tests (ALTs), the performance of small-sample normal-theory confidence intervals is summarized by three points: (1) they have highly asymmetric error rates; (2) they can be extremely anti-conservative; and (3) these effects worsen when higher confidence levels are used. Likelihood-ratio-based confidence intervals have much more symmetric error rates which are not as extremely anti-conservative as normal-theory intervals can be. For typical ALTs, likelihood-ratio-based confidence intervals are better than those based on asymptotic normal theory. Likelihood-ratio-based confidence intervals require more computation than intervals based on the asymptotic normality of the maximum-likelihood estimators. The resource spent on computing is, however, usually very small compared to the other costs involved in an ALT. >

Journal ArticleDOI
D. Rosenthal1, B.C. Wadell1
TL;DR: In this article, a quantitative approach is proposed for setting built-in test (BIT) measurement limits and this method is applied to the specific case of a constant failure rate system whose BITE measurements are corrupted by Gaussian noise.
Abstract: Failures detected by built-in test equipment (BITE) occur because a BITE measurement noise or bias as well as actual hardware failures. A quantitative approach is proposed for setting built-in test (BIT) measurement limits and this method is applied to the specific case of a constant failure rate system whose BITE measurements are corrupted by Gaussian noise. Guidelines for setting BIT measurement limits are presented for a range of system MTBF (mean time between failures) times and BIT run times. The technique was applied to a BIT for an analog VLSI test system with excellent results, showing it to be a powerful tool for predicting tests with the potential for false alarms. It was discovered that, for this test case, false alarms are avoidable. >

Journal ArticleDOI
TL;DR: It is shown how to schedule checkpoints to minimize the mean total time to finish a given job and applications to the M/G/1 queuing system are touched on.
Abstract: At checkpoints during the operation of a computer, the state of the system is saved. Whenever a machine fails, it is repaired and then reset to the state saved at the latest checkpoint. In the present work, save times are known constants and repair times are random variables; failures are the epochs of a given renewal process. In scheduling the checkpoints, the cost of saves must be traded off against the cost of work lost when the computer fails. It is shown how to schedule checkpoints to minimize the mean total time to finish a given job. Similar optimization results are obtained for the tails of the distribution of the finishing time. Two variants of the basic model are considered. In one of the computer receives maintenance during each save; in the other it does not. Applications to the M/G/1 queuing system are touched on. >

Journal ArticleDOI
Andrew L. Reibman1
TL;DR: In this paper, performability modeling, the combined analysis of reliability and performance, is introduced; some examples of applications where performance and reliability need to be modeled together are given; a strategy for modeling the effect of reliability on performance is outlined and metrics that help quantify this effect are discussed.
Abstract: In many high-reliability systems, subsystem or component failures that do not cause a system failure can still degrade system performance When modeling such systems, ignoring the effect of reliability on performance can lead to incomplete or inaccurate results In the present work, performability modeling, the combined analysis of reliability and performance, is introduced; some examples of applications where performance and reliability need to be modeled together are given; a strategy for modeling the effect of reliability on performance is outlined and metrics that help quantify this effect are discussed Some mathematical models for performability are introduced and an example is used to illustrate how such models can be applied >

Journal ArticleDOI
Attila Csenki1
TL;DR: The concepts of Bayes prediction analysis are used to obtain predictive distributions of the next time to failure of software when its past failure behavior is known and can show an improved predictive performance for some data sets even when compared with some more sophisticated software-reliability models.
Abstract: The concepts of Bayes prediction analysis are used to obtain predictive distributions of the next time to failure of software when its past failure behavior is known. The technique is applied to the Jelinski-Moranda software-reliability model, which in turn can show an improved predictive performance for some data sets even when compared with some more sophisticated software-reliability models. A Bayes software-reliability model is presented which can be applied to obtain the next time to failure PDF (probability distribution function) and CDF (cumulative distribution function) for all testing protocols. The number of initial faults and the per-fault failure rate are assumed to be s-independent and Poisson and gamma distributed respectively. For certain data sets, the technique yields better predictions than some alternative methods if the frequential likelihood and U-plot criteria are adopted. >

Journal ArticleDOI
TL;DR: An independent N-version programming reliability model which distinguishes between correctness and agreement is proposed for treating small output spaces, and the reciprocal of the cardinality of output space is a lower bound on the average reliability of fault-tolerant system versions below which the system reliability begins to deteriorate as more versions are added.
Abstract: Under a voting strategy in a fault-tolerant software system there is a difference between correctness and agreement. An independent N-version programming reliability model which distinguishes between correctness and agreement is proposed for treating small output spaces. An alternative voting strategy, consensus voting, is used to treat cases when there can be agreement among incorrect outputs, a case which can occur with small output spaces. The consensus voting strategy automatically adapts the voting to various version reliability and output-space cardinality characteristics. The majority-voting strategy provides reliability which is a lower bound, and the 2-out-of-n voting strategy provides reliability which is an upper bound, on the reliability by consensus voting. The reciprocal of the cardinality of output space is a lower bound on the average reliability of fault-tolerant system versions below which the system reliability begins to deteriorate as more versions are added. >

Journal ArticleDOI
TL;DR: In this paper, Chen et al. developed a proportional hazards model for estimating thin-oxide dielectric reliability and time-dependent breakdown hazard rates, in terms of the form of the electric-field acceleration factor.
Abstract: Proportional hazards models for estimating thin-oxide dielectric reliability and time-dependent dielectric-breakdown hazard rates are developed. Two groups of models are considered: group one ignores interactions between temperature and electric field, while group two considers several forms of interactions. The inclusion of interaction is not statistically significant at the 1% level. An analysis of the Hokari time-dependent dielectric-breakdown data, in terms of the form of the electric-field acceleration factor, shows that the approach of I.C. Chen et al. (1985) is more appropriate than the approach of D.L. Crook (1979). >

Journal ArticleDOI
TL;DR: A software release problem based on four software reliability growth models (SRGMs) with random life-cycle length is studied and the optimal values of release times are shown to be finite and unique.
Abstract: A software release problem based on four software reliability growth models (SRGMs) with random life-cycle length is studied. Test of the software system is terminated after time T and released (sold) to the user at a price. The price of the software system and three cost components are considered, and average total profit is used as a criterion. The optimal values of release times are shown to be finite and unique. Hence, the optimal solutions can be obtained numerically by, for example, a bisection method. A numerical example indicates that the optimal release time increases as (1) the error rate in each model decreases and (2) the difference between the error fixing cost during the test phase and that during the operational phase increases. The case of unknown model parameters is considered only for the Jelinski-Moranda model because a Bayes model is not available for other SRGMs. The release decision depends on testing time, but other stopping rules, for example based on the number of corrected errors, can be considered. >

Journal ArticleDOI
TL;DR: The authors review and extend available techniques for achieving fault-tolerant programs using models that deal with program reliability for a single run, which seems more practical and straightforward than dealing with distributions as for hardware systems.
Abstract: The authors review and extend available techniques for achieving fault-tolerant programs. The representation of the techniques is uniform and is illustrated by simple examples. For each technique a fault tree has been developed to derive failure probability from the probabilities of the basic fault events. This allows the subsequent analysis of program-failure causes and the reliability modeling of computer programs. Numerical examples are given to support the comparison of the reviewed techniques. The models can be used to evaluate numerical values of program reliability in a relatively simple way. The models deal with program reliability for a single run, which seems more practical and straightforward than dealing with distributions as for hardware systems. Evaluations obtained by using models correspond to those used in the literature; however, the authors' procedures are computationally simpler. >

Journal ArticleDOI
J.A. Nachlas, S.R. Loney, B.A. Binney1
TL;DR: In this paper, two cost models (for perfect and imperfect testing) represent the consequences of possible test realizations, and the probability that any particular component is responsible for the failure is derived and used as a basis for the two models.
Abstract: The selection of efficient testing strategies for repairable systems composed of components arranged in series is considered. Two cost models (for perfect and imperfect testing) represent the consequences of possible test realizations. The probability that any particular component is responsible for the failure is derived and used as a basis for the two models. The model for perfect testing is solved exactly. In the optimal perfect-test sequence the components are tested in decreasing order of the ratio of: (probability that the component is responsible for the system failure) to (component test cost). For imperfect testing, possible diagnostic errors are included in a model for which two heuristic solution strategies are provided. The model represents the consequences of both false-positive and false-negative component-test outcomes. The heuristic strategies yield efficient test sequences. Under reasonable assumptions, the second heuristic strategy is guaranteed to locate the optimal test sequence. The model can quantitatively evaluate the benefits of test-accuracy enhancement plans. These models and algorithms provide convenient methods for selecting efficient test-sequences. This is illustrated by representative examples. >

Journal ArticleDOI
TL;DR: In this paper, a direct exact method for computing the reliability for a consecutive-k-out-of-n:F system with homogeneous Markov dependence is presented, where the probability that any component i fails depends upon, and only upon, the state of the component (i-1).
Abstract: A direct, exact method for computing the reliability for a consecutive-k-out-of-n:F system with homogeneous Markov dependence is presented. This method calculates the reliability for a consecutive-k-out-n:F system where the probability that any component i fails depends upon, and only upon, the state of the component (i-1). >

Journal ArticleDOI
TL;DR: In this article, a simulation-based fault-injection methodology for validating fault-tolerant microprocessor architectures is described, which uses mixed-mode simulation (electrical/logic analysis), and injects transient errors in run-time to assess the resulting fault-impact.
Abstract: A simulation-based fault-injection methodology for validating fault-tolerant microprocessor architectures is described. The approach uses mixed-mode simulation (electrical/logic analysis), and injects transient errors in run-time to assess the resulting fault-impact. To exemplify the methodology, a fault-tolerant architecture which models the digital aspects of a dual-channel, real-time jet-engine controller is used. The level of effectiveness of the dual configuration with respect to single and multiple transients is measured. The results indicate 100% coverage of single transients. Approximately 12% of the multiple transients affect both channels; none result in controller failure since two additional levels of redundancy exist. >

Journal ArticleDOI
TL;DR: The problem of how to choose components for parallel redundancy is studied, and some results are given; some examples are presented to illustrate the approach.
Abstract: Adding parallel redundancy to different components generally yields different system reliability improvements. The effect of such parallel redundancy upon system reliability when applied at various places and in various systems is investigated. The problem of how to choose components for parallel redundancy is studied, and some results are given. Some examples are presented to illustrate the approach. >

Journal ArticleDOI
TL;DR: In this paper, the hazard rate for the Gaussian random variable was shown to lie in a finite interval for all positive values of the standard deviation of the associated normal random variable.
Abstract: It is argued that plots of the hazard rate for the lognormal random variable which have appeared in some recent literature are incorrect and/or misleading; the hazard rate always begins at zero, rises to a maximum, then decreases very slowly to zero. An equation for the location of the maximum of the hazard rate is derived. The maximum lies in a finite interval for all positive values of the standard deviation of the associated normal random variable. Approximations that can be used to compute the hazard rate for parameter values outside of the usual range in the tables associated with the normal (Gaussian) random variable are presented. >

Journal ArticleDOI
TL;DR: In this article, a method to calculate the optimal values of reliability indices for a load point in an electrical distribution system is presented, which is formulated as an optimization problem and solved by the gradient projection method.
Abstract: A method to calculate the optimal values of reliability indices for a load point in an electrical distribution system is presented. The problem is formulated as an optimization problem and solved by the gradient projection method; the objective is to minimize interruption cost. The algorithm is very useful and powerful for extending the existing network and planning new networks. It has been tested on a practical system (the Port-Fouad power network), and the results are discussed. Once the optimal reliability indices are determined, modification of the system by equipment replacement and future system planning can be done in such a way that the interruption cost is minimized. >