scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A 2 thOS: availability analysis and optimisation in SLAs

01 Mar 2012-International Journal of Network Management (Wiley)-Vol. 22, Iss: 2, pp 104-130
TL;DR: This paper presents A2thOS, a framework to calculate the availability of partially outsourcing IT services in the presence of SLAs and to achieve a cost-optimal choice of availability levels for outsourced IT components while guaranteeing a target availability level for the service.
Abstract: Information technology (IT) service availability is at the core of customer satisfaction and business success for today's organisations. Many medium- to large-size organisations outsource part of their IT services to external providers, with service-level agreements describing the agreed availability of outsourced service components. Availability management of partially outsourced IT services is a non-trivial task since classic approaches for calculating availability are not applicable, and IT managers can only rely on their expertise to fulfil it. This often leads to the adoption of non-optimal solutions. In this paper we present A2thOS, a framework to calculate the availability of partially outsourced IT services in the presence of SLAs and to achieve a cost-optimal choice of availability levels for outsourced IT components while guaranteeing a target availability level for the service. Copyright © 2011 John Wiley & Sons, Ltd.

Summary (1 min read)

Jump to:  and [Introduction]

Introduction

  • A framework to calculate the availability of partially outsourced IT services in the presence of SLAs and to achieve a cost-optimal choice of availability levels for outsourced IT components while guaranteeing a target availability level for the service.the authors.
  • Figure 4 shows one possible scheduling for the failure of the components on which Service1 depends on, resulting in Service1 having an availability of αService1 (0.984).
  • To this end the authors distinguish among three types of nodes in a dependency graph: target availability nodes, variable availability nodes and given availability nodes.
  • The analysis engine solves the availability analysis problem, described in Section 3.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A
2
THOS: Availability Analysis and Optimisation in SLAs
Emmanuele Zambon
1
, Sandro Etalle
1,2
and Roel J. Wieringa
1
1
University of Twente
Enschede, The Netherlands
Email: {emmanuele.zambon, sandro.etalle, r.j.wieringa}@utwente.nl
2
Technical University of Eindhoven
Eindhoven, The Netherlands
Email: s.etalle@tue.nl
SUMMARY
IT service availability is at the core of customer satisfaction and business success for today’s organisations. Many medium-large
size organisations outsource part of their IT services to external providers, with Service Level Agreements describing the agreed
availability of outsourced service components. Availability management of partially outsourced IT services is a non trivial task since
classic approaches for calculating availability are not applicable, and IT managers can only rely on their expertise to fulfil it. This
often leads to the adoption of non optimal solutions. In this paper we present A
2
THOS, a framework to calculate the availability of
partially outsourced IT services in the presence of SLAs and to achieve a cost-optimal choice of availability levels for outsourced
IT components while guaranteeing a target availability level for the service. Copyright
c
2010 John Wiley & Sons, Ltd.
KEY WORDS: SLA Management, Availability, Optimisation, Modelling
1. Introduction
Having a functional, cost effective and and properly managed IT infrastructure has become one of the main key success
factors for all kinds of organisations. Nowadays, the IT infrastructure of most large organisations is so complex that it is
often organised in terms of services that are offered as part of an internal market in which different business units offer
and buy IT services to and from each other. In some cases, services are acquired from an external organisation rather
than from an internal business unit (outsourcing). Typically, services offered by an internal provider are customised and
tailored to support the business goals of the organisation, while those offered by external providers are standardised and
large-scale, and therefore are less specific but potentially cheaper than those implemented internally. In some cases,
internal providers outsource some sub-services to external ones, for instance when it lacks specific competencies (e.g.,
SAP configuration). This is a so-called mixed sourcing strategy.
Regardless of whether the service is bought internally or externally, the terms and conditions of the contract are
determined in the so-called Service Level Agreement (SLA). (Figure 1 summarises the concept of mixed-sourced IT
services regulated by SLAs.) For instance, ITIL [15] is one of the most popular frameworks providing guidelines and
best practice for a correct IT service management and it describes this process in detail in [17].
In this paper we focus on IT service availability, which is at the core of customer satisfaction and business success
for organisations [16], and indeed it is one of the main topics in a SLA. In fact a typical SLA includes hard clauses on
the minimal availability of the service offered (for example, it may include that the service should not be “down” for
more than two hours per week, and a penalty fee for each week in which this is not satisfied).
Now, the two concerns we focus on (and at the same time the two questions to which we provide an answer within
the limits of the settings of this paper) are:
1. how can a business unit check and/or guarantee that a given (offered) service will respect some given minimal
availability levels;
2. as (1) while minimising costs.

Figure 1: Mixed-sourced IT service provision regulated by SLAs.
Let us elaborate on these two points and explain why they are not only relevant, but also non-trivial problems.
An IT service is usually offered by a system consisting of several components. These components can interact in
non-trivial ways: for instance a component could be crucial to the service in a way that if the component is unavailable
then the service becomes unavailable as well; other components my be organised in such a way (e.g., exploiting
redundancy) that only if a number of them fails the service will be affected. In addition, a component may depend in
a non-trivial way on sub-services which are in turn regulated by other SLAs.
To ensure that the minimal service availability remains within the agreed margins, IT managers can take reactive
(e.g., monitoring, measuring) and/or proactive measures. A key proactive measure is planning and designing service
availability when services are created or changed. At the business level, planning service availability allows the service
provider to set availability figures on the SLAs that both satisfy the customer needs and can be guaranteed by the
technical infrastructure providing the service. To achieve this at the technical level the service provider needs to
(a) calculate the availability of the IT system providing the service(s) based on the information available on system
components, and (b) make appropriate system design choices to support a specific availability level by selecting the
system components based on their contribution to the availability of the system.
Reliability studies have introduced a number of by now standard techniques (e.g., Continuous Time Markov Chains
(CTMC) [19] and Petri Nets [9]) which allow one to compute system availability when the mean time between
component failures and the mean time to repair a component is known. However, in the context of mixed-sourced
IT services, this information is usually not available. Instead, SLAs between the external and the internal provider
typically only include the minimal guaranteed availability of the component. Therefore, it is not possible to apply
these standard techniques to calculate the system availability (see Section 2 for details).
Regarding the second point, the service catalogue of most IT outsourcing companies include different availability
levels (e.g., gold, silver and bronze) with different associated prices (same service, only different availability levels, at
different costs). Service providers need to minimise the cost of outsourced (sub)services while guaranteeing that their
own service achieves the desired minimal availability level. Given the interactions mentioned above, this is a nontrivial
optimisation problem: one needs to determine the combination of minimal availability levels for the sub-services in
such a way that the total cost is minimal while ensuring that the resulting service achieves the availability specified in
the SLAs. This cannot be solved without the use of specific optimisation algorithms and typically IT managers choose
non-optimal, conservative solutions.
Contribution We present A
2
THOS, a framework for the analysis and optimisation of the availability of mixed-
sourced IT services. The framework consists of (1) a modelling technique to represent partially-outsourced IT systems,
their components and the services they provide, based on dependency graphs, (2) a procedure to calculate (a lower
bound of) the system availability given the (lower bounds of) components availability, and (3) a procedure to select
the optimum availability level for outsourced components in order to guarantee a desired target availability level for
the service(s) and to minimise costs.
1

A dependency graph is an AND/OR graph in which nodes represent system components and services, and edges
between nodes represent the functional dependency of one node with the other. We use the graph in order to calculate
a state function describing the availability of each service based on the state of the components (operational or not
operational). We then use the state function and the information about components availability to determine a lower
bound for the availability of the service, by setting up a linear programming problem. Based on this procedure, we
finally present the procedure to set up an integer programming problem which allows one to determine the cost-optimal
combination of availability levels for outsourced components in order to guarantee a target service availability. We
show the practical use of A
2
THOS by implementing it in a tool which we apply to the service availability planning of
an industrial case.
Limitation of the approach A
2
THOS uses an AND/OR graph to represent the system, thus it is unable to explicitly
represent failure recovery mechanisms such as spare parts. Spare parts are used to implement warm and cold standby
mechanisms. For example, to shorten the downtime caused by a server breakdown, the system administrators can keep
another server ready to replace the broken one. This second server is the spare part. When it is always running (but not
operating) and the workload of the broken server is automatically routed to the spare server, this mechanism is called
hot standby. When the workload of the broken server needs to be manually routed to the spare server, this mechanism
is called warm standby. When the spare server is not readily available, but it needs a setup phase before the workload
of the broken server can be redirected to it, the mechanism is called cold standby. Our representation allows us to
explicitly model hot standby mechanisms by using OR nodes, but it is not applicable in case of warm and cold standby
mechanisms. We share this limitation with other well-known modelling techniques, such as traditional Fault Trees and
Reliability Block Diagrams.
Organisation The rest of the paper is organised as follows. In Section 2 we present the related work in the fields of
reliability and IT service composition. In Section 3 we present dependency graphs and we provide the mathematical
foundation for using them to calculate service availability. In Section 4 we present the procedure to find the optimal
choice of availability level for outsourced components. In Section 5 we describe the tool we created to implement the
A
2
THOS framework and the benchmarks we conducted to test its scalability performances. Finally, in Section 6 we
show how we applied A
2
THOS to a practical case of service availability planning in an industrial context.
2. Related Works
In this section we discuss related works in four relevant areas for our problem: (1) the general approach to calculate
system availability, (2) modelling techniques to represent the system under analysis, (3) existing tools and (4) other
approaches taking into account availability to optimise IT service composition.
The general approach Referring to a classic formulation [2] taken from the reliability theory, a repairable system
is a system which can be repaired after a failure.
In the simplest case, the system m for which availability must be determined is represented by the state function
χ(m, t) which assumes value 1 if m is operating within tolerances at time t, 0 otherwise. The general way of calculating
the availability of a repairable system is to assume it has an independent, exponential distribution of failure and repair
time (a so-called stationary alternating renewal process [14]). However, to do so one must know at least two properties
of the system: its failure rate λ, and its repair rate µ. The first property specifies how often the system will fail on
average, i.e., its Mean Time Between Failure (MTBF): λ =
1
MTBF
. The second one specifies its Mean Time To Repair
(MTTR): µ =
1
MTTR
. Under this assumption the limiting availability is then obtained by the formula
¯
A =
µ
µ+λ
.
In the general case, the system can assume more than two states. Such a system is called complex. A complex
system is a system which is made of interconnected components that as a whole exhibit one or more properties
depending on the properties of the individual component. For example, a complex system can be made of two “simple”
components (i.e., two components that can independently be either in operative or in repairing state). The state of the
system depends on the state of the two components: the system may work properly even if one component only is
operative, or it may need both components to be operative. To model the state of the system, a state formula is used.
2

Components can have more than two states (e.g., operative, planned maintenance, emergency repair, etc.). To compute
the availability of complex systems, Continuous Time Markov Chains (CTMC) [19], or Petri Nets [9] are used. To
employ such techniques, one has to (1) define a state formula of the system based on the component’s state, and (2)
know the transaction probability of each component from one state to the other.
In our case, the information available in the SLAs for outsourced components concerns only a minimal availability
in a given time frame (e.g., one month). Therefore, classic techniques are not applicable to this problem, as the internal
states of each component and the probability of state transition (i.e., failure and repair rate) are only known by the
outsourcing company.
System modelling Several approaches have been proposed in the literature for system reliability modelling. Fault
trees (FTs) and Reliability Block Diagrams (RDBs) are the most used ones. However, we should mention that also
other approaches have been proposed, e.g., Torres-Toledano and Sucar [22] use bayesian networks, and Leangsuksun
et al. [13] use an UML representation (although in this second case the authors do not provide the mathematical
support for reliability analysis). In FTs, a number of components (called basic events) are linked together to make up a
system according to AND/OR relationships. The same behaviour is achieved in RBDs through SERIES/PARALLEL
compositions. According to [9], FTs are easy to use, as they do not require very skilled modellers, and relatively fast
to evaluate, as it is possible to use very efficient combinatorial solving techniques to obtain most of the reliability
indexes.
In FTs, the system state is represented by the top event, i.e., the root of the tree. It is possible to build a boolean
equation from the FT, and to reduce it to the minimal cut set, i.e., the smallest set of combinations of basic events
(component failures) which all need to occur for the top event to take place (system failure) [23]. Based on the
minimal cut set, a combination of combinatorial techniques and CTMC or PetriNets is then used to calculate the
system (limiting) availability.
According to Flamini et al. [9], the main limitation of FTs and RBDs consists in the lack of modelling power, as they
do not allow to model maintenance-related issues explicitly. To solve this problem, FTs and RDBs have been extended
into Dynamic Fault Trees [6] and Dynamic Reliability Block Diagrams [5], allowing one to model maintenance-related
issues.
The modelling notation we use in this paper (dependency graphs) can be seen as a condensed form of fault trees.
With a single dependency graph we are able to model a forest of fault trees sharing (some of) the basic events (i.e.,
the failure of a component), but with different top events. A single dependency graph can thus model separately the
failure of all the business services which the IT system provides, and for which a specific availability level must be
calculated. In fact, it is possible to (automatically) transform any dependency graph into a forest of FTs, as well as
in a set of RBD, as we show in Appendix B. We share with FTs the use of minimal cut sets, which in our notation
are called Dependency Sets (see Section 3), but the availability calculation we apply to dependency graphs is different
from the one used in FTs (for the reason we mentioned above).
Tools IBM Tivoli [12] and HP Business Availability Centre [11] are two of the most popular configuration
management tools. These tools are meant to support IT managers in the configuration and maintenance of complex IT
systems. Among the many features they possess, they can be used to manage SLAs, including availability levels. One
can assign to each IT component the availability level imposed by SLAs, and keep track of the actual availability levels
to check for SLA compliancy. However, to the best of our knowledge there is no support for the analytical calculation
of the service availability.
Galileo [21], Coral [4], Relex [18] and BlockSim [3] are tools operating with Dynamic Fault Trees. Although
integrating the A
2
THOS engines in one of these tools would be useful, this was not possible: Relex and BlockSim are
commercial tools, Coral is mostly a MatLab library without a GUI, and Galileo is free software, but not open source.
For these reasons we developed our prototype as an independent Java/Prolog tool.
Availability in service composition In the field of IT service composition, several approaches have been proposed
that consider availability as one of the QoS parameters to optimise the performances of the resulting composite IT
service. Gu et al. [10] propose QUEST, a framework to schedule dynamically a composite IT service while satisfying
QoS requirements (e.g., response time and availability) imposed by SLAs. Zeng et al. [26], Yu et al. [24] and Ardagna
3

et al. [1] propose scheduling techniques to create a cost-optimal execution plan for composite web services which
respect QoS parameters (including availability) defined in SLA contracts.
In all these works, an estimation of the availability of the composite service is made by multiplying the availability
level of the components (expressed as a real number in the interval [0,1]. This is possible thanks to two simplifying
assumptions. First, all the components must be available at the same time for the system to operate (i.e., the system
is an AND-combination of its components and it becomes unavailable in the moment that any of its component is
unavailable). Secondly, the resulting availability is not a lower bound, i.e., there can be a run of the composite service
in which the resulting availability is lower than the calculated one. Differently from these approaches, A
2
THOS is
able to deal with a wider range of dependencies, namely combinations of AND and OR dependencies. In the sequel
we also argue in more detail why OR dependencies are necessary to model complex IT services correctly. A
2
THOS
also allows one to calculate an absolute the lower bound for the availability, which can be safely included in an SLA
contract.
3. Analysis of the minimal service availability
We now present the theoretical foundations of A
2
THOS. Let us first start with an intuitive explanation. We model
the system using a dependency graph, in which a node represents a component of the system that at any given time
may (or may not) be available. A directed edge from node m to node n indicates that m depends on n, i.e. that the
availability of m depends also from the availability of n in a way that we are about to explain.
In a dependency graph, a node m can be unavailable because of an internal failure, or because (some) nodes it
depends on are unavailable. To model internal failure, to each node m we associate a (virtual) internal node m
0
.
On the other hand, to model the fact that m becomes unavailable because one or more nodes it depends on are
unavailable, we then consider nodes of two types: AND and OR .
(a) AND (b) OR
Figure 2: Two simple dependency graphs, respectively with AND and OR nodes
If m is a node in a dependency graph and n
1
, . . . , n
k
are the nodes m depends on, we say that
m is unavailable at time t iff its internal node m
0
is unavailable at time t or
n
1
, . . . , n
k
are all unavailable at time t, in case m is an AND node,
at least one node in n
1
, . . . , n
k
is unavailable at time t, in case m is an OR node.
Formally,
Definition 3.1 (Dependency graph) A dependency graph hN, Ei is a directed and acyclic graph (DAG) where N is
the set of nodes, and is partitioned in AND-N and OR-N, and E is the set of edges E {hu, vi | u, v N }.
Given a graph hN, Ei, we call N
0
the set of the internal nodes of g; N
0
= {n
0
internal of n | n N}.
Running example - Part 1. In this example we analyse the availability of an IT system providing two IT services
(Service1 and Service2), and implemented by means of three applications (App1, App2 and App3) running
on five different servers (Srv1, Srv2, Srv3, Srv4, Srv5). Service1 is implemented by App1 and App2 in
such a way that the service goes off-line only when both applications are off-line (OR dependency). Service2 is
4

Citations
More filters
DissertationDOI
20 Jan 2011
TL;DR: A graph-based framework for modelling the availability dependencies of the components of an IT infrastructure is proposed and techniques based on this framework are developed to support availability planning.
Abstract: The availability of an organisation’s IT infrastructure is of vital importance for supporting business activities. IT outages are a cause of competitive liability, chipping away at a company financial performance and reputation. To achieve the maximum possible IT availability within the available budget, organisations need to carry out a set of analysis activities to prioritise efforts and take decisions based on the business needs. This set of analysis activities is called IT availability planning. Most (large) organisations address IT availability planning from one or more of the three main angles: information risk management, business continuity and service level management. Information risk management consists of identifying, analysing, evaluating and mitigating the risks that can affect the information processed by an organisation and the information-processing (IT) systems. Business continuity consists of creating a logistic plan, called business continuity plan, which contains the procedures and all the useful information needed to recover an organisations’ critical processes after major disruption. Service level management mainly consists of organising, documenting and ensuring a certain quality level (e.g. the availability level) for the services offered by IT systems to the business units of an organisation. There exist several standard documents that provide the guidelines to set up the processes of risk, business continuity and service level management. However, to be as generally applicable as possible, these standards do not include implementation details. Consequently, to do IT availability planning each organisation needs to develop the concrete techniques that suit its needs. To be of practical use, these techniques must be accurate enough to deal with the increasing complexity of IT infrastructures, but remain feasible within the budget available to organisations. As we argue in this dissertation, basic approaches currently adopted by organisations are feasible but often lack of accuracy. In this thesis we propose a graph-based framework for modelling the availability dependencies of the components of an IT infrastructure and we develop techniques based on this framework to support availability planning.

65 citations

Journal ArticleDOI
TL;DR: A Petri net Monte Carlo simulation is developed that estimates the availability and costs of a specific design of an IT service redundancy allocation problem and two meta-heuristics, namely a genetic algorithm and tabu search, are adapted.

30 citations

Journal ArticleDOI
01 Jan 2015
TL;DR: The approach is based on model-driven principles and uses both UML and Bayesian Networks to capture, analyse and optimise cloud deployment configurations and is extensible to the operational phases of the life-cycle.
Abstract: This paper proposes an approach to support cloud brokers finding optimal configurations in the deployment of dependability and security sensitive cloud applications. The approach is based on model-driven principles and uses both UML and Bayesian Networks to capture, analyse and optimise cloud deployment configurations. While the paper is most focused on the initial allocation phase, the approach is extensible to the operational phases of the life-cycle. In such a way, a continuous improvement of cloud applications may be realised by monitoring, enforcing and re-negotiating cloud resources following detected anomalies and failures.

18 citations

Journal ArticleDOI
TL;DR: The use of this language for allocating cloud resources to maximise service dependability by definition of a model-driven approach able to guide the software engineering to define a cloud infrastructure using a semi-automated process using both high-level languages such as UML as well as Bayesian networks.
Abstract: Bayesian networks have demonstrated their capability in several applications spanning from reasoning under uncertainty in artificial intelligence to dependability modelling and analysis. This paper focuses on the use of this language for allocating cloud resources to maximise service dependability. This objective is accomplished by the definition of a model-driven approach able to guide the software engineering to define a cloud infrastructure (applications, services, virtual and concrete resources) using a semi-automated process. This process exploits both high-level languages such as UML as well as Bayesian networks. Using all their features (backward analysis, ease of usage, low analysis time), Bayesian networks are used in this process as a driver for the optimization, learning and estimation phases. The paper discusses all the issues that the application of Bayesian networks in the proposed process arises.

5 citations

References
More filters
Book
01 Jan 1965

2,722 citations

Book
01 Jan 2006
TL;DR: Researchers from other fields should find in this handbook an effective way to learn about constraint programming and to possibly use some of the constraint programming concepts and techniques in their work, thus providing a means for a fruitful cross-fertilization among different research areas.
Abstract: Constraint programming is a powerful paradigm for solving combinatorial search problems that draws on a wide range of techniques from artificial intelligence, computer science, databases, programming languages, and operations research. Constraint programming is currently applied with success to many domains, such as scheduling, planning, vehicle routing, configuration, networks, and bioinformatics. The aim of this handbook is to capture the full breadth and depth of the constraint programming field and to be encyclopedic in its scope and coverage. While there are several excellent books on constraint programming, such books necessarily focus on the main notions and techniques and cannot cover also extensions, applications, and languages. The handbook gives a reasonably complete coverage of all these lines of work, based on constraint programming, so that a reader can have a rather precise idea of the whole field and its potential. Of course each line of work is dealt with in a survey-like style, where some details may be neglected in favor of coverage. However, the extensive bibliography of each chapter will help the interested readers to find suitable sources for the missing details. Each chapter of the handbook is intended to be a self-contained survey of a topic, and is written by one or more authors who are leading researchers in the area. The intended audience of the handbook is researchers, graduate students, higher-year undergraduates and practitioners who wish to learn about the state-of-the-art in constraint programming. No prior knowledge about the field is necessary to be able to read the chapters and gather useful knowledge. Researchers from other fields should find in this handbook an effective way to learn about constraint programming and to possibly use some of the constraint programming concepts and techniques in their work, thus providing a means for a fruitful cross-fertilization among different research areas. The handbook is organized in two parts. The first part covers the basic foundations of constraint programming, including the history, the notion of constraint propagation, basic search methods, global constraints, tractability and computational complexity, and important issues in modeling a problem as a constraint problem. The second part covers constraint languages and solver, several useful extensions to the basic framework (such as interval constraints, structured domains, and distributed CSPs), and successful application areas for constraint programming. - Covers the whole field of constraint programming - Survey-style chapters - Five chapters on applications Table of Contents Foreword (Ugo Montanari) Part I : Foundations Chapter 1. Introduction (Francesca Rossi, Peter van Beek, Toby Walsh) Chapter 2. Constraint Satisfaction: An Emerging Paradigm (Eugene C. Freuder, Alan K. Mackworth) Chapter 3. Constraint Propagation (Christian Bessiere) Chapter 4. Backtracking Search Algorithms (Peter van Beek) Chapter 5. Local Search Methods (Holger H. Hoos, Edward Tsang) Chapter 6. Global Constraints (Willem-Jan van Hoeve, Irit Katriel) Chapter 7. Tractable Structures for CSPs (Rina Dechter) Chapter 8. The Complexity of Constraint Languages (David Cohen, Peter Jeavons) Chapter 9. Soft Constraints (Pedro Meseguer, Francesca Rossi, Thomas Schiex) Chapter 10. Symmetry in Constraint Programming (Ian P. Gent, Karen E. Petrie, Jean-Francois Puget) Chapter 11. Modelling (Barbara M. Smith) Part II : Extensions, Languages, and Applications Chapter 12. Constraint Logic Programming (Kim Marriott, Peter J. Stuckey, Mark Wallace) Chapter 13. Constraints in Procedural and Concurrent Languages (Thom Fruehwirth, Laurent Michel, Christian Schulte) Chapter 14. Finite Domain Constraint Programming Systems (Christian Schulte, Mats Carlsson) Chapter 15. Operations Research Methods in Constraint Programming (John Hooker) Chapter 16. Continuous and Interval Constraints(Frederic Benhamou, Laurent Granvilliers) Chapter 17. Constraints over Structured Domains (Carmen Gervet) Chapter 18. Randomness and Structure (Carla Gomes, Toby Walsh) Chapter 19. Temporal CSPs (Manolis Koubarakis) Chapter 20. Distributed Constraint Programming (Boi Faltings) Chapter 21. Uncertainty and Change (Kenneth N. Brown, Ian Miguel) Chapter 22. Constraint-Based Scheduling and Planning (Philippe Baptiste, Philippe Laborie, Claude Le Pape, Wim Nuijten) Chapter 23. Vehicle Routing (Philip Kilby, Paul Shaw) Chapter 24. Configuration (Ulrich Junker) Chapter 25. Constraint Applications in Networks (Helmut Simonis) Chapter 26. Bioinformatics and Constraints (Rolf Backofen, David Gilbert)

1,527 citations

Book
17 Dec 1987
TL;DR: This handbook has been developed not only to serve as text for the System Safety and Reliability Course, but also to make available to others a set of otherwise undocumented material on fault tree construction and evaluation.
Abstract: Introduction: Since 1975, a short course entitled "System Safety and Reliability Analysis" has been presented to over 200 NRC personnel and contractors. The course has been taught jointly by David F. Haasl, Institute of System Sciences, Professor Norman H. Roberts, University of Washington, and members of the Probabilistic Analysis Staff, NRC, as part of a risk assessment training program sponsored by the Probabilistic Analysis Staff. This handbook has been developed not only to serve as text for the System Safety and Reliability Course, but also to make available to others a set of otherwise undocumented material on fault tree construction and evaluation. The publication of this handbook is in accordance with the recommendations of the Risk Assessment Review Group Report (NUREG/CR-0400) in which it was stated that the fault/event tree methodology both can and should be used more widely by the NRC. It is hoped that this document will help to codify and systematize the fault tree approach to systems analysis.

1,266 citations


"A 2 thOS: availability analysis and..." refers background in this paper

  • ...the smallest set of combinations of basic events (component failures) which all need to occur for the top event to take place (system failure) [15]....

    [...]

Journal ArticleDOI

884 citations


"A 2 thOS: availability analysis and..." refers background or methods in this paper

  • ...Reliability studies have introduced a number of (by now) standard techniques (e.g. continuous‐time Markov chains (CTMC) [2] and Petri nets [3]) which allow one to compute system availability when the mean time between component failures and the mean time to repair a component is known....

    [...]

  • ...To compute the availability of complex systems, CTMC [2] or Petri nets [3] are used....

    [...]

  • ...continuous‐time Markov chains (CTMC) [2] and Petri nets [3]) which allow one to compute system availability when the mean time between component failures and the mean time to repair a component is known....

    [...]

  • ...Based on the minimal cut set, a combination of combinatorial techniques and CTMC or Petri nets is then used to calculate the system (limiting) availability....

    [...]

Journal ArticleDOI
TL;DR: HARP (Hybrid Automated Reliability Predictor) is a software package developed at Duke University and NASA Langley Research Center that can solve fault-tree models that frequently employ high levels of redundancy, dynamic redundancy management, and complex fault and error recovery techniques.
Abstract: Reliability analysis of fault-tolerant computer systems for critical applications is complicated by several factors. Systems designed to achieve high levels of reliability frequently employ high levels of redundancy, dynamic redundancy management, and complex fault and error recovery techniques. This paper describes dynamic fault-tree modeling techniques for handling these difficulties. Three advanced fault-tolerant computer systems are described: a fault-tolerant parallel processor, a mission avionics system, and a fault-tolerant hypercube. Fault-tree models for their analysis are presented. HARP (Hybrid Automated Reliability Predictor) is a software package developed at Duke University and NASA Langley Research Center that can solve those fault-tree models. >

730 citations


"A 2 thOS: availability analysis and..." refers background in this paper

  • ...We share with FTs the use of minimal cut sets, which in our notation are called dependency sets (see Section 3), but the availability calculation we apply to AND/OR dependency graphs is different from the one used in FTs (for the reason mentioned above)....

    [...]

  • ...In FTs, a number of components (called basic events) are linked together to make up a system according to AND/OR relationships....

    [...]

  • ...In FTs, the system state is represented by the top event, i.e. the root of the tree....

    [...]

  • ...Galileo [20], Coral [21], Relex [22] and BlockSim [23] are tools operating with dynamic FTs....

    [...]

  • ...FTs and RBDs are the most used ones....

    [...]

Frequently Asked Questions (1)
Q1. What are the contributions in "A2thos: availability analysis and optimisation in slas" ?

In this paper the authors present ATHOS, a framework to calculate the availability of partially outsourced IT services in the presence of SLAs and to achieve a cost-optimal choice of availability levels for outsourced IT components while guaranteeing a target availability level for the service.