Showing papers in &quot;IEEE Transactions on Reliability in 1990&quot;

Error log analysis: statistical modeling and heuristic trend analysis

TL;DR: A census of customer outages reported to Tandem indicates that software is now the major source of reported outages, followed by system operations, a dramatic shift from the statistics for 1985.

...read moreread less

Abstract: A census of customer outages reported to Tandem showing a clear improvement in the reliability of hardware and maintenance has been taken. It indicates that software is now the major source of reported outages (62%), followed by system operations (15%). This is a dramatic shift from the statistics for 1985. Even after discounting systematic underreporting of operations and environmental outages, the conclusion is clear: hardware faults and hardware maintenance are no longer a major source of outages. As the other components of the system become increasingly reliable, software necessarily becomes the dominant cause of outages. Achieving higher availability requires improvement in software quality and software fault tolerance, simpler operations, and tolerance of operational faults. >

...read moreread less

347 citations

Journal Article•DOI•

[...]

T.-T.Y. Lin¹, Daniel P. Siewiorek²•Institutions (2)

University of California, San Diego¹, Carnegie Mellon University²

Reliability of scrubbing recovery-techniques for memory systems

TL;DR: The authors present the results of an analysis that demonstrates that the log is composed of at least two error processes: transient and intermittent, and it is shown that the DFT can extract intermittent errors from the error log and uses only one fifth of the error-log entry points required for failure prediction.

...read moreread less

Abstract: Most error-log analysis studies perform a statistical fit to the data assuming a single underlying error process. The authors present the results of an analysis that demonstrates that the log is composed of at least two error processes: transient and intermittent. The mixing of data from multiple processes requires many more events to verify a hypotheses using traditional statistical analysis. Based on the shape of the interarrival time function of the intermittent errors observed from actual error logs, a failure-prediction heuristic, the dispersion frame technique (DFT), is developed. The DFT was implemented in a distributed system for the campus-wide Andrew file system at Carnegie Mellon University. Data collected from 13 file servers over a 22-month period were analyzed using both the DFT and conventional statistical methods. It is shown that the DFT can extract intermittent errors from the error log and uses only one fifth of the error-log entry points required by statistical methods for failure prediction. The DFT achieved a 93.7% success rate in predicting failures in both electromechanical and electronic devices. >

...read moreread less

216 citations

Journal Article•DOI•

[...]

A.M. Saleh¹, J.J. Serrano², Janak H. Patel³•Institutions (3)

Bell Labs¹, Polytechnic University of Valencia², University of Illinois at Urbana–Champaign³

A consecutive-k-out-of-n:G system: the mirror image of a consecutive-k-out-of-n:F system

TL;DR: The authors derive reliability functions and mean time to failure of four different memory systems subject to transient errors at exponentially distributed arrival times and derive easy-to-use expressions for MTTF of memories.

...read moreread less

Abstract: The authors analyze the problem of transient-error recovery in fault-tolerant memory systems, using a scrubbing technique. This technique is based on single-error-correction and double-error-detection (SEC-DED) codes. When a single error is detected in a memory word, the error is corrected and the word is rewritten in its original location. Two models are discussed: (1) exponentially distributed scrubbing, where a memory word is assumed to be checked in an exponentially distributed time period, and (2) deterministic scrubbing, where a memory word is checked periodically. Reliability and mean-time-to-failure (MTTF) equations are derived and estimated. The results of the scrubbing techniques are compared with those of memory systems without redundancies and with only SEC-DED codes. A major contribution of the analysis is easy-to-use expressions for MTTF of memories. The authors derive reliability functions and mean time to failure of four different memory systems subject to transient errors at exponentially distributed arrival times. >

...read moreread less

202 citations

Journal Article•DOI•

[...]

Way Kuo¹, Weixing Zhang¹, Ming J. Zuo¹•Institutions (1)

Iowa State University¹

A case study of Ethernet anomalies in a distributed computing environment

TL;DR: In this paper, the relationship between the consecutive-k-out-of-n:F system and the consecutive k-out of n:G system is studied, theorems for such systems are developed, and available results for one type of system are applied to the other, and a case study illustrates reliability analysis and optimal design of a train operation system.

...read moreread less

Abstract: A consecutive-k-out-of-n:F (consecutive-k-out-of-n:G) system consists of an ordered sequence of n components such that the system is failed (good) if and only if at least k consecutive components in the system are failed (good). In the present work, the relationship between the consecutive-k-out-of-n:F system and the consecutive-k-out-of-n:G system is studied, theorems for such systems are developed, and available results for one type of system are applied to the other. The topics include system reliability, reliability bounds, component reliability importance, and optimal system design. A case study illustrates reliability analysis and optimal design of a train operation system. An optimal configuration rule is suggested by use of the Birnbaum importance index. >

...read moreread less

141 citations

Journal Article•DOI•

[...]

Roy A. Maxion¹, Frank Feather¹•Institutions (1)

Carnegie Mellon University¹

Optimal allocation and control problems for software-testing resources

TL;DR: In a preliminary effort to understand and catalog how networks behave under various conditions, two cases of anomalous behavior are analyzed in detail.

...read moreread less

Abstract: Fault detection and diagnosis depend critically on good fault definitions, but the dynamic, noisy, and nonstationary character of networks makes it hard to define what a fault is in a network environment. The authors take the position that a fault or failure is a violation of expectations. In accordance with empirically based expectations, operating behaviors of networks (and other devices) can be classified as being either normal or anomalous. Because network failures most frequently manifest themselves as performance degradations or deviations from expected behavior, periods of anomalous performance can be attributed to causes assignable as network faults. The half-year case study presented used a system in which observations of distributed-computing network behavior were automatically and systematically classified as normal or anomalous. Anomalous behaviors were traced to faulty conditions. In a preliminary effort to understand and catalog how networks behave under various conditions, two cases of anomalous behavior are analyzed in detail. Examples are taken from the distributed file-system network at Carnegie Mellon University. >

...read moreread less

114 citations

Journal Article•DOI•

[...]

Hiroshi Ohtera¹, Shigeru Yamada²•Institutions (2)

Okayama University of Science¹, Hiroshima University²

Optimization limits in improving system reliability

TL;DR: A software reliability growth model based on a nonhomogeneous Poisson process is introduced that describes the time-dependent behavior of software errors detected and testing-resource expenditures spent during the testing.

...read moreread less

Abstract: Two kinds of software-testing management problems are considered: testing-resource allocation to best use specified testing resources during module testing, and a testing-resource control problem concerning how to spend the allocated amount of testing-resource expenditures during it. A software reliability growth model based on a nonhomogeneous Poisson process is introduced. The model describes the time-dependent behavior of software errors detected and testing-resource expenditures spent during the testing. The optimal allocation and control of testing resources among software modules can improve reliability and shorten the testing stage. Based on the model, numerical examples of these two software testing management problems are presented. >

...read moreread less

108 citations

Journal Article•DOI•

[...]

Z. Xu¹, Way Kuo¹, Hsin-Hui Lin¹•Institutions (1)

Iowa State University¹

Two-dimensional consecutive-k-out-of-n:F models

TL;DR: The author's method requires considerably less computer time to obtain results comparable to those of the other methods, and it has a low degree of programming difficulty.

...read moreread less

Abstract: A computationally simple approach for reliability-redundancy optimization problems is proposed. It is compared by means of a simulation study with the other two existing approaches: (1) the LMBB method, which incorporates the Lagrange multiplier technique in conjunction with the Kuhn-Tucker condition and the branch-and-bound method, and (2) sequential search techniques in combination with heuristic redundancy allocation methods, including an extension of combinations of four heuristics and two search techniques. Using 100 sets of randomly generated test problems with nonlinear constraints for both series systems and a complex system, the authors measured and evaluated the performance of these approaches in terms of optimality rate, error rate, and execution time. In general, the author's method requires considerably less computer time to obtain results comparable to those of the other methods, and it has a low degree of programming difficulty. >

...read moreread less

98 citations

Journal Article•DOI•

[...]

Anthony A. Salvia¹, W.C. Lasher¹•Institutions (1)

Pennsylvania State University¹

An improved minimizing algorithm for sum of disjoint products (reliability theory)

TL;DR: In this article, a two-dimensional version of the consecutive k-out-of-n:F model is considered and bounds on system failure probabilities are determined by comparison with the usual one-dimensional model.

...read moreread less

Abstract: A two-dimensional version of the consecutive-k-out-of-n:F model is considered. Bounds on system failure probabilities are determined by comparison with the usual one-dimensional model. Failure probabilities are determined by simulation for a variety of values of k and n. >

...read moreread less

96 citations

Journal Article•DOI•

[...]

J.M. Wilson¹•Institutions (1)

Loughborough University¹

Importance and sensitivity analysis in assessing system reliability

TL;DR: A minor modification of the ALR algorithm called the Abraham-Locks-Wilson (ALW) method is described, an alternative method of ordering paths and terms that obtains a shorter disjoint system formula on a test example than any previous SDP method and allows small computational savings in processing large paths of complex networks.

...read moreread less

Abstract: The Abraham-Locks-revised (ALR) sum-of-disjoint products (SDP) algorithm is an efficient method for obtaining a system reliability formula. The author describes a minor modification of the ALR algorithm called the Abraham-Locks-Wilson (ALW) method. The new feature is an alternative method of ordering paths and terms. ALW obtains a shorter disjoint system formula on a test example than any previous SDP method and allows small computational savings in processing large paths of complex networks. As there are different ways to obtain a reliability formula it is useful to use an approach which yields the smallest formula relative to computational effort expended. The extra effort in ordering the terms should be reasonably small and usually leads to improved efficiency in the later stages of the algorithm. ALW allows the analyst to operate in a more efficient way on many problems, particularly if the overlap ordering is used in the early stages of processing but is probably ignored for terms that contain a majority of the Boolean variables. >

...read moreread less

81 citations

Journal Article•DOI•

[...]

A. Gandini¹•Institutions (1)

ENEA¹

m-consecutive-k-out-of-n:F systems

TL;DR: In this paper, the authors proposed a method for sensitivity analysis based on generalized perturbation theory (GPT) methodology, widely adopted in reactor-physics studies, and the concept of importance of a state in the Markov model representation of systems is introduced.

...read moreread less

Abstract: After reviewing various importance concepts adopted in reliability, the authors propose a method for sensitivity analysis. The method uses the heuristically based generalized perturbation theory (GPT) methodology, widely adopted in reactor-physics studies. The concept of importance of a state in the Markov model representation of systems is introduced. The resulting formulations apply to any response of interest in reliability analysis. The relationship between the GPT method and Birnbaum importance is given. >

...read moreread less

77 citations

Journal Article•DOI•

[...]

S. Papastavridis

A simple approximation to the renewal function (reliability theory)

TL;DR: In this article, the failure probability of m-consecutive k-out-of-n:F systems was studied and three theorems concerning such systems were proved.

...read moreread less

Abstract: An m-consecutive-k-out-of-n:F system, consists of n components ordered on a line; the system fails if and only if there are at least m nonoverlapping runs of k consecutive failed components. Three theorems concerning such systems are stated and proved. Theorem one is a recursive formula to compute the failure probability of such a system. Theorem two is an exact formula for the failure probability. Theorem three is a limit theorem for the failure probability. >

...read moreread less

Journal Article•DOI•

[...]

E. Smeitink¹, Rommert Dekker²•Institutions (2)

University of Amsterdam¹, Royal Dutch Shell²

01 Jan 1990-IEEE Transactions on Reliability

TL;DR: In this article, the authors present a simple, easy-to-understand approximation to the renewal function that is easy to implement on a personal computer and works very well with one term if not too much accuracy is required.

...read moreread less

Abstract: The authors present a simple, easy-to-understand approximation to the renewal function that is easy to implement on a personal computer. The key idea is that, for small values of time, the renewal function is almost equal to the cumulative distribution function of the interrenewal time, whereas for larger values of time an asymptotic expansion depending only on the first and second moment of the interrenewal time can be used. The relative error is typically smaller than a few percent for Weibull interrenewal times. The simple approximation methods works very well with one term if not too much accuracy is required (e.g. in the block replacement problem) or if the interrenewal (failure) distribution is not exactly known (e.g. only the first two moments are known). Although the accuracy of the simple approximation can be improved by increasing the number of terms, this strategy is not advocated since speed and simplicity are lost. If high accuracy is required, it is better to use another approximating method (e.g. power series expansion or cubic splines method). >

...read moreread less

Journal Article•DOI•

A general theory of software-reliability modeling

[...]

M. Trachtenberg

An automated methodology for generating a fault tree

TL;DR: A general theory of software reliability that proposes that software failure rates are the product of the software average error size, apparent error density, and workload is developed and models of these factors that are consistent with the assumptions of classical software-reliability models are developed.

...read moreread less

Abstract: A general theory of software reliability that proposes that software failure rates are the product of the software average error size, apparent error density, and workload is developed. Models of these factors that are consistent with the assumptions of classical software-reliability models are developed. The linear, geometric and Rayleigh models are special cases of the general theory. Linear reliability models result from assumptions that the average size of remaining errors and workload are constant and that its apparent error density equals its real error density. Geometric reliability models differ from linear models in assuming that the average-error size decreases geometrically as errors are corrected, whereas the Rayleigh model assumes that the average size of remaining errors increases linearly with time. The theory shows that the abstract proportionality constants of classical models are composed of more fundamental and more intuitively meaningful factors, namely, the initial values of average size of remaining errors, real error density, workload, and error content. It is shown how the assumed behavior of the reliability primitives of software (average-error size, error density, and workload) is modeled to accommodate diverse reliability factors. >

...read moreread less

Journal Article•DOI•

[...]

R.C. de Vries

Experimental evaluation of the fault tolerance of an atomic multicast system

TL;DR: The author presents an overview of a methodology for the automated generation of fault trees for electrical/electronic circuits from a representation of a schematic diagram that uses backtracking.

...read moreread less

Abstract: The author presents an overview of a methodology for the automated generation of fault trees for electrical/electronic circuits from a representation of a schematic diagram. Existing computer programs for the generation of fault trees are briefly discussed, and their deficiencies are indicated. The approach presented here is quantitative and uses backtracking. It is illustrated by an example. A prototype computer program has been written to implement the methodology for DC circuits. >

...read moreread less

Journal Article•DOI•

[...]

Jean Arlat¹, M. Aguera¹, Yves Crouzet¹, Jean-Charles Fabre¹, Eliane Martins¹, David Powell¹ - Show less +2 more•Institutions (1)

Centre national de la recherche scientifique¹

Accuracy of approx confidence bounds using censored Weibull regression data from accelerated life tests

TL;DR: It is concluded that the fault-injection test sequence has evidenced the limited performance of the self-checking mechanisms implemented on the tested NAC (network attachment controller) and justified the need for the improved self- checking mechanisms implemented in an enhanced NAC architecture using duplicated circuitry.

...read moreread less

Abstract: The authors present a study of the validation of a dependable local area network providing multipoint communication services based on an atomic multicast protocol. This protocol is implemented in specialized communication servers, that exhibit the fail-silent property, i.e. a kind of halt-on-failure behavior enforced by self-checking hardware. The tests that have been carried out utilize physical fault injection and have two objectives: (1) to estimate the coverage of the self-checking mechanisms of the communication servers, and (2) to test the properties that characterize the service provided by the atomic multicast protocol in the presence of faults. The testbed that has been developed to carry out the fault-injection experiments is described, and the major results are presented and analyzed. It is concluded that the fault-injection test sequence has evidenced the limited performance of the self-checking mechanisms implemented on the tested NAC (network attachment controller) and justified (especially for the main board) the need for the improved self-checking mechanisms implemented in an enhanced NAC architecture using duplicated circuitry. >

...read moreread less

Journal Article•DOI•

[...]

S.A. Vander Wiel¹, William Q. Meeker¹•Institutions (1)

Iowa State University¹

Predicting and eliminating built-in test false alarms

TL;DR: For censored Weibull regression data arising from typical accelerated life tests (ALTs), the performance of small-sample normal-theory confidence intervals is summarized by three points: (1) they have highly asymmetric error rates; (2) they can be extremely anti-conservative; and (3) these effects worsen when higher confidence levels are used as mentioned in this paper.

...read moreread less

Abstract: For censored Weibull regression data arising from typical accelerated life tests (ALTs), the performance of small-sample normal-theory confidence intervals is summarized by three points: (1) they have highly asymmetric error rates; (2) they can be extremely anti-conservative; and (3) these effects worsen when higher confidence levels are used. Likelihood-ratio-based confidence intervals have much more symmetric error rates which are not as extremely anti-conservative as normal-theory intervals can be. For typical ALTs, likelihood-ratio-based confidence intervals are better than those based on asymptotic normal theory. Likelihood-ratio-based confidence intervals require more computation than intervals based on the asymptotic normality of the maximum-likelihood estimators. The resource spent on computing is, however, usually very small compared to the other costs involved in an ALT. >

...read moreread less

Journal Article•DOI•

[...]

D. Rosenthal¹, B.C. Wadell¹•Institutions (1)

Teradyne¹

Optimal strategies for scheduling checkpoints and preventive maintenance

TL;DR: In this article, a quantitative approach is proposed for setting built-in test (BIT) measurement limits and this method is applied to the specific case of a constant failure rate system whose BITE measurements are corrupted by Gaussian noise.

...read moreread less

Abstract: Failures detected by built-in test equipment (BITE) occur because a BITE measurement noise or bias as well as actual hardware failures. A quantitative approach is proposed for setting built-in test (BIT) measurement limits and this method is applied to the specific case of a constant failure rate system whose BITE measurements are corrupted by Gaussian noise. Guidelines for setting BIT measurement limits are presented for a range of system MTBF (mean time between failures) times and BIT run times. The technique was applied to a BIT for an analog VLSI test system with excellent results, showing it to be a powerful tool for predicting tests with the potential for false alarms. It was discovered that, for this test case, false alarms are avoidable. >

...read moreread less

Journal Article•DOI•

[...]

Edward G. Coffman¹, E. N. Gilbert¹•Institutions (1)

Bell Labs¹

Modeling the effect of reliability on performance

TL;DR: It is shown how to schedule checkpoints to minimize the mean total time to finish a given job and applications to the M/G/1 queuing system are touched on.

...read moreread less

Abstract: At checkpoints during the operation of a computer, the state of the system is saved. Whenever a machine fails, it is repaired and then reset to the state saved at the latest checkpoint. In the present work, save times are known constants and repair times are random variables; failures are the epochs of a given renewal process. In scheduling the checkpoints, the cost of saves must be traded off against the cost of work lost when the computer fails. It is shown how to schedule checkpoints to minimize the mean total time to finish a given job. Similar optimization results are obtained for the tails of the distribution of the finishing time. Two variants of the basic model are considered. In one of the computer receives maintenance during each save; in the other it does not. Applications to the M/G/1 queuing system are touched on. >

...read moreread less

Journal Article•DOI•

[...]

Andrew L. Reibman¹•Institutions (1)

Bell Labs¹

Bayes predictive analysis of a fundamental software reliability model

TL;DR: In this paper, performability modeling, the combined analysis of reliability and performance, is introduced; some examples of applications where performance and reliability need to be modeled together are given; a strategy for modeling the effect of reliability on performance is outlined and metrics that help quantify this effect are discussed.

...read moreread less

Abstract: In many high-reliability systems, subsystem or component failures that do not cause a system failure can still degrade system performance When modeling such systems, ignoring the effect of reliability on performance can lead to incomplete or inaccurate results In the present work, performability modeling, the combined analysis of reliability and performance, is introduced; some examples of applications where performance and reliability need to be modeled together are given; a strategy for modeling the effect of reliability on performance is outlined and metrics that help quantify this effect are discussed Some mathematical models for performability are introduced and an example is used to illustrate how such models can be applied >

...read moreread less

Journal Article•DOI•

[...]

Attila Csenki¹•Institutions (1)

Aston University¹

Reliability of voting in fault-tolerant software systems for small output-spaces

TL;DR: The concepts of Bayes prediction analysis are used to obtain predictive distributions of the next time to failure of software when its past failure behavior is known and can show an improved predictive performance for some data sets even when compared with some more sophisticated software-reliability models.

...read moreread less

Abstract: The concepts of Bayes prediction analysis are used to obtain predictive distributions of the next time to failure of software when its past failure behavior is known. The technique is applied to the Jelinski-Moranda software-reliability model, which in turn can show an improved predictive performance for some data sets even when compared with some more sophisticated software-reliability models. A Bayes software-reliability model is presented which can be applied to obtain the next time to failure PDF (probability distribution function) and CDF (cumulative distribution function) for all testing protocols. The number of initial faults and the per-fault failure rate are assumed to be s-independent and Poisson and gamma distributed respectively. For certain data sets, the technique yields better predictions than some alternative methods if the frequential likelihood and U-plot criteria are adopted. >

...read moreread less

Journal Article•DOI•

[...]

David F. McAllister¹, Chien-En Sun¹, Mladen A. Vouk¹•Institutions (1)

North Carolina State University¹

01 Dec 1990-IEEE Transactions on Reliability

TL;DR: An independent N-version programming reliability model which distinguishes between correctness and agreement is proposed for treating small output spaces, and the reciprocal of the cardinality of output space is a lower bound on the average reliability of fault-tolerant system versions below which the system reliability begins to deteriorate as more versions are added.

...read moreread less

Abstract: Under a voting strategy in a fault-tolerant software system there is a difference between correctness and agreement. An independent N-version programming reliability model which distinguishes between correctness and agreement is proposed for treating small output spaces. An alternative voting strategy, consensus voting, is used to treat cases when there can be agreement among incorrect outputs, a case which can occur with small output spaces. The consensus voting strategy automatically adapts the voting to various version reliability and output-space cardinality characteristics. The majority-voting strategy provides reliability which is a lower bound, and the 2-out-of-n voting strategy provides reliability which is an upper bound, on the reliability by consensus voting. The reciprocal of the cardinality of output space is a lower bound on the average reliability of fault-tolerant system versions below which the system reliability begins to deteriorate as more versions are added. >

...read moreread less

Journal Article•DOI•

Estimation of thin-oxide reliability using proportional hazards models

[...]

Elsayed A. Elsayed¹, C.K. Chan²•Institutions (2)

Rutgers University¹, AT&T²

Optimum software release policy with random life cycle

TL;DR: In this paper, Chen et al. developed a proportional hazards model for estimating thin-oxide dielectric reliability and time-dependent breakdown hazard rates, in terms of the form of the electric-field acceleration factor.

...read moreread less

Abstract: Proportional hazards models for estimating thin-oxide dielectric reliability and time-dependent dielectric-breakdown hazard rates are developed. Two groups of models are considered: group one ignores interactions between temperature and electric field, while group two considers several forms of interactions. The inclusion of interaction is not statistically significant at the 1% level. An analysis of the Hokari time-dependent dielectric-breakdown data, in terms of the form of the electric-field acceleration factor, shows that the approach of I.C. Chen et al. (1985) is more appropriate than the approach of D.L. Crook (1979). >

...read moreread less

Journal Article•DOI•

[...]

Won Young Yun¹, D.S. Bai•Institutions (1)

Pusan National University¹

Fault-tolerant programs and their reliability

TL;DR: A software release problem based on four software reliability growth models (SRGMs) with random life-cycle length is studied and the optimal values of release times are shown to be finite and unique.

...read moreread less

Abstract: A software release problem based on four software reliability growth models (SRGMs) with random life-cycle length is studied. Test of the software system is terminated after time T and released (sold) to the user at a price. The price of the software system and three cost components are considered, and average total profit is used as a criterion. The optimal values of release times are shown to be finite and unique. Hence, the optimal solutions can be obtained numerically by, for example, a bisection method. A numerical example indicates that the optimal release time increases as (1) the error rate in each model decreases and (2) the difference between the error fixing cost during the test phase and that during the operational phase increases. The case of unknown model parameters is considered only for the Jelinski-Moranda model because a Bayes model is not available for other SRGMs. The release decision depends on testing time, but other stopping rules, for example based on the number of corrected errors, can be considered. >

...read moreread less

Journal Article•DOI•

[...]

Fevzi Belli¹, Piotr Jędrzejowicz²•Institutions (2)

University of Paderborn¹, United States Merchant Marine Academy²

Diagnostic-strategy selection for series systems

TL;DR: The authors review and extend available techniques for achieving fault-tolerant programs using models that deal with program reliability for a single run, which seems more practical and straightforward than dealing with distributions as for hardware systems.

...read moreread less

Abstract: The authors review and extend available techniques for achieving fault-tolerant programs. The representation of the techniques is uniform and is illustrated by simple examples. For each technique a fault tree has been developed to derive failure probability from the probabilities of the basic fault events. This allows the subsequent analysis of program-failure causes and the reliability modeling of computer programs. Numerical examples are given to support the comparison of the reviewed techniques. The models can be used to evaluate numerical values of program reliability in a relatively simple way. The models deal with program reliability for a single run, which seems more practical and straightforward than dealing with distributions as for hardware systems. Evaluations obtained by using models correspond to those used in the literature; however, the authors' procedures are computationally simpler. >

...read moreread less

Journal Article•DOI•

[...]

J.A. Nachlas, S.R. Loney, B.A. Binney¹•Institutions (1)

IBM¹

Exact reliability formula for consecutive-k-out-of-n:F systems with homogeneous Markov dependence

TL;DR: In this paper, two cost models (for perfect and imperfect testing) represent the consequences of possible test realizations, and the probability that any particular component is responsible for the failure is derived and used as a basis for the two models.

...read moreread less

Abstract: The selection of efficient testing strategies for repairable systems composed of components arranged in series is considered. Two cost models (for perfect and imperfect testing) represent the consequences of possible test realizations. The probability that any particular component is responsible for the failure is derived and used as a basis for the two models. The model for perfect testing is solved exactly. In the optimal perfect-test sequence the components are tested in decreasing order of the ratio of: (probability that the component is responsible for the system failure) to (component test cost). For imperfect testing, possible diagnostic errors are included in a model for which two heuristic solution strategies are provided. The model represents the consequences of both false-positive and false-negative component-test outcomes. The heuristic strategies yield efficient test sequences. Under reasonable assumptions, the second heuristic strategy is guaranteed to locate the optimal test sequence. The model can quantitatively evaluate the benefits of test-accuracy enhancement plans. These models and algorithms provide convenient methods for selecting efficient test-sequences. This is illustrated by representative examples. >

...read moreread less

Journal Article•DOI•

[...]

Guangping Ge, Linshu Wang

01 Dec 1990-IEEE Transactions on Reliability

TL;DR: In this paper, a direct exact method for computing the reliability for a consecutive-k-out-of-n:F system with homogeneous Markov dependence is presented, where the probability that any component i fails depends upon, and only upon, the state of the component (i-1).

...read moreread less

Abstract: A direct, exact method for computing the reliability for a consecutive-k-out-of-n:F system with homogeneous Markov dependence is presented. This method calculates the reliability for a consecutive-k-out-n:F system where the probability that any component i fails depends upon, and only upon, the state of the component (i-1). >

...read moreread less

Journal Article•DOI•

Simulated fault injection: a methodology to evaluate fault tolerant microprocessor architectures

[...]

Gwan Choi¹, Ravishankar K. Iyer¹, Victor A. Carreno²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Langley Research Center²

On the increase of system reliability by parallel redundancy

TL;DR: In this article, a simulation-based fault-injection methodology for validating fault-tolerant microprocessor architectures is described, which uses mixed-mode simulation (electrical/logic analysis), and injects transient errors in run-time to assess the resulting fault-impact.

...read moreread less

Abstract: A simulation-based fault-injection methodology for validating fault-tolerant microprocessor architectures is described. The approach uses mixed-mode simulation (electrical/logic analysis), and injects transient errors in run-time to assess the resulting fault-impact. To exemplify the methodology, a fault-tolerant architecture which models the digital aspects of a dual-channel, real-time jet-engine controller is used. The level of effectiveness of the dual configuration with respect to single and multiple transients is measured. The results indicate 100% coverage of single transients. Approximately 12% of the multiple transients affect both channels; none result in controller failure since two additional levels of redundancy exist. >

...read moreread less

Journal Article•DOI•

[...]

K. Shen, Min Xie¹•Institutions (1)

Linköping University¹

01 Dec 1990-IEEE Transactions on Reliability

TL;DR: The problem of how to choose components for parallel redundancy is studied, and some results are given; some examples are presented to illustrate the approach.

...read moreread less

Abstract: Adding parallel redundancy to different components generally yields different system reliability improvements. The effect of such parallel redundancy upon system reliability when applied at various places and in various systems is investigated. The problem of how to choose components for parallel redundancy is studied, and some results are given. Some examples are presented to illustrate the approach. >

...read moreread less

Journal Article•DOI•

On the hazard rate of the lognormal distribution

[...]

Arnold L. Sweet¹•Institutions (1)

Purdue University¹

Evaluation of optimal-reliability indices for electrical distribution systems

TL;DR: In this paper, the hazard rate for the Gaussian random variable was shown to lie in a finite interval for all positive values of the standard deviation of the associated normal random variable.

...read moreread less

Abstract: It is argued that plots of the hazard rate for the lognormal random variable which have appeared in some recent literature are incorrect and/or misleading; the hazard rate always begins at zero, rises to a maximum, then decreases very slowly to zero. An equation for the location of the maximum of the hazard rate is derived. The maximum lies in a finite interval for all positive values of the standard deviation of the associated normal random variable. Approximations that can be used to compute the hazard rate for parameter values outside of the usual range in the tables associated with the normal (Gaussian) random variable are presented. >

...read moreread less

Journal Article•DOI•

[...]

Abdelhay A. Sallam, M. Desouky, H. Desouky