Journal Article•DOI•

A 2 thOS: availability analysis and optimisation in SLAs

Emmanuele Zambon¹, Sandro Etalle¹, Roel Wieringa¹•Institutions (1)

01 Mar 2012-International Journal of Network Management (Wiley)-Vol. 22, Iss: 2, pp 104-130

TL;DR: This paper presents A2thOS, a framework to calculate the availability of partially outsourcing IT services in the presence of SLAs and to achieve a cost-optimal choice of availability levels for outsourced IT components while guaranteeing a target availability level for the service.

read less

Abstract: Information technology (IT) service availability is at the core of customer satisfaction and business success for today's organisations. Many medium- to large-size organisations outsource part of their IT services to external providers, with service-level agreements describing the agreed availability of outsourced service components. Availability management of partially outsourced IT services is a non-trivial task since classic approaches for calculating availability are not applicable, and IT managers can only rely on their expertise to fulfil it. This often leads to the adoption of non-optimal solutions. In this paper we present A2thOS, a framework to calculate the availability of partially outsourced IT services in the presence of SLAs and to achieve a cost-optimal choice of availability levels for outsourced IT components while guaranteeing a target availability level for the service. Copyright © 2011 John Wiley & Sons, Ltd.

...read moreread less

Summary (1 min read)

Jump to: and [Introduction]

Introduction

A framework to calculate the availability of partially outsourced IT services in the presence of SLAs and to achieve a cost-optimal choice of availability levels for outsourced IT components while guaranteeing a target availability level for the service.the authors.
Figure 4 shows one possible scheduling for the failure of the components on which Service1 depends on, resulting in Service1 having an availability of αService1 (0.984).
To this end the authors distinguish among three types of nodes in a dependency graph: target availability nodes, variable availability nodes and given availability nodes.
The analysis engine solves the availability analysis problem, described in Section 3.

Did you find this useful? Give us your feedback

Figures (12)

Figure 3: The dependency graph representing the system we analyse in our running example. AND nodes are represented by the ∧ symbol, OR nodes by the ∨ symbol.

Figure 8: Reliability Block Diagram parallel composition

Figure 9: Dependency graph parallel composition

Figure 1: Mixed-sourced IT service provision regulated by SLAs.

Table II: Performance of the simplex algorithm for availability analysis

Table III: Performance of the availability optimisation algorithm with 50 variable availability nodes

Figure 4: One possible scheduling for the failure of FW1, App1, App2, Srv1, Srv2 and Srv3 resulting in Service1 having an availability of 0.984. System components are on the vertical axis and the components unavailability fraction of time (∈ [0, 1]) is on the horizontal axis.

Figure 2: Two simple dependency graphs, respectively with AND and OR nodes

Table IV: Results of the availability analysis on Oxygen

Content maybe subject to copyright Report

THOS: Availability Analysis and Optimisation in SLAs

Emmanuele Zambon

, Sandro Etalle

1,2

and Roel J. Wieringa

University of Twente

Enschede, The Netherlands

Email: {emmanuele.zambon, sandro.etalle, r.j.wieringa}@utwente.nl

Technical University of Eindhoven

Eindhoven, The Netherlands

Email: s.etalle@tue.nl

SUMMARY

IT service availability is at the core of customer satisfaction and business success for today’s organisations. Many medium-large

size organisations outsource part of their IT services to external providers, with Service Level Agreements describing the agreed

availability of outsourced service components. Availability management of partially outsourced IT services is a non trivial task since

classic approaches for calculating availability are not applicable, and IT managers can only rely on their expertise to fulﬁl it. This

often leads to the adoption of non optimal solutions. In this paper we present A

THOS, a framework to calculate the availability of

partially outsourced IT services in the presence of SLAs and to achieve a cost-optimal choice of availability levels for outsourced

IT components while guaranteeing a target availability level for the service. Copyright

 2010 John Wiley & Sons, Ltd.

KEY WORDS: SLA Management, Availability, Optimisation, Modelling

1. Introduction

Having a functional, cost effective and and properly managed IT infrastructure has become one of the main key success

factors for all kinds of organisations. Nowadays, the IT infrastructure of most large organisations is so complex that it is

often organised in terms of services that are offered as part of an internal market in which different business units offer

and buy IT services to and from each other. In some cases, services are acquired from an external organisation rather

than from an internal business unit (outsourcing). Typically, services offered by an internal provider are customised and

tailored to support the business goals of the organisation, while those offered by external providers are standardised and

large-scale, and therefore are less speciﬁc but potentially cheaper than those implemented internally. In some cases,

internal providers outsource some sub-services to external ones, for instance when it lacks speciﬁc competencies (e.g.,

SAP conﬁguration). This is a so-called mixed sourcing strategy.

Regardless of whether the service is bought internally or externally, the terms and conditions of the contract are

determined in the so-called Service Level Agreement (SLA). (Figure 1 summarises the concept of mixed-sourced IT

services regulated by SLAs.) For instance, ITIL [15] is one of the most popular frameworks providing guidelines and

best practice for a correct IT service management and it describes this process in detail in [17].

In this paper we focus on IT service availability, which is at the core of customer satisfaction and business success

for organisations [16], and indeed it is one of the main topics in a SLA. In fact a typical SLA includes hard clauses on

the minimal availability of the service offered (for example, it may include that the service should not be “down” for

more than two hours per week, and a penalty fee for each week in which this is not satisﬁed).

Now, the two concerns we focus on (and at the same time the two questions to which we provide an answer within

the limits of the settings of this paper) are:

1. how can a business unit check and/or guarantee that a given (offered) service will respect some given minimal

availability levels;

2. as (1) while minimising costs.

Figure 1: Mixed-sourced IT service provision regulated by SLAs.

Let us elaborate on these two points and explain why they are not only relevant, but also non-trivial problems.

An IT service is usually offered by a system consisting of several components. These components can interact in

non-trivial ways: for instance a component could be crucial to the service in a way that if the component is unavailable

then the service becomes unavailable as well; other components my be organised in such a way (e.g., exploiting

redundancy) that only if a number of them fails the service will be affected. In addition, a component may depend in

a non-trivial way on sub-services which are in turn regulated by other SLAs.

To ensure that the minimal service availability remains within the agreed margins, IT managers can take reactive

(e.g., monitoring, measuring) and/or proactive measures. A key proactive measure is planning and designing service

availability when services are created or changed. At the business level, planning service availability allows the service

provider to set availability ﬁgures on the SLAs that both satisfy the customer needs and can be guaranteed by the

technical infrastructure providing the service. To achieve this at the technical level the service provider needs to

(a) calculate the availability of the IT system providing the service(s) based on the information available on system

components, and (b) make appropriate system design choices to support a speciﬁc availability level by selecting the

system components based on their contribution to the availability of the system.

Reliability studies have introduced a number of by now standard techniques (e.g., Continuous Time Markov Chains

(CTMC) [19] and Petri Nets [9]) which allow one to compute system availability when the mean time between

component failures and the mean time to repair a component is known. However, in the context of mixed-sourced

IT services, this information is usually not available. Instead, SLAs between the external and the internal provider

typically only include the minimal guaranteed availability of the component. Therefore, it is not possible to apply

these standard techniques to calculate the system availability (see Section 2 for details).

Regarding the second point, the service catalogue of most IT outsourcing companies include different availability

levels (e.g., gold, silver and bronze) with different associated prices (same service, only different availability levels, at

different costs). Service providers need to minimise the cost of outsourced (sub)services while guaranteeing that their

own service achieves the desired minimal availability level. Given the interactions mentioned above, this is a nontrivial

optimisation problem: one needs to determine the combination of minimal availability levels for the sub-services in

such a way that the total cost is minimal while ensuring that the resulting service achieves the availability speciﬁed in

the SLAs. This cannot be solved without the use of speciﬁc optimisation algorithms and typically IT managers choose

non-optimal, conservative solutions.

Contribution We present A

THOS, a framework for the analysis and optimisation of the availability of mixed-

sourced IT services. The framework consists of (1) a modelling technique to represent partially-outsourced IT systems,

their components and the services they provide, based on dependency graphs, (2) a procedure to calculate (a lower

bound of) the system availability given the (lower bounds of) components availability, and (3) a procedure to select

the optimum availability level for outsourced components in order to guarantee a desired target availability level for

the service(s) and to minimise costs.

A dependency graph is an AND/OR graph in which nodes represent system components and services, and edges

between nodes represent the functional dependency of one node with the other. We use the graph in order to calculate

a state function describing the availability of each service based on the state of the components (operational or not

operational). We then use the state function and the information about components availability to determine a lower

bound for the availability of the service, by setting up a linear programming problem. Based on this procedure, we

ﬁnally present the procedure to set up an integer programming problem which allows one to determine the cost-optimal

combination of availability levels for outsourced components in order to guarantee a target service availability. We

show the practical use of A

THOS by implementing it in a tool which we apply to the service availability planning of

an industrial case.

Limitation of the approach A

THOS uses an AND/OR graph to represent the system, thus it is unable to explicitly

represent failure recovery mechanisms such as spare parts. Spare parts are used to implement warm and cold standby

mechanisms. For example, to shorten the downtime caused by a server breakdown, the system administrators can keep

another server ready to replace the broken one. This second server is the spare part. When it is always running (but not

operating) and the workload of the broken server is automatically routed to the spare server, this mechanism is called

hot standby. When the workload of the broken server needs to be manually routed to the spare server, this mechanism

is called warm standby. When the spare server is not readily available, but it needs a setup phase before the workload

of the broken server can be redirected to it, the mechanism is called cold standby. Our representation allows us to

explicitly model hot standby mechanisms by using OR nodes, but it is not applicable in case of warm and cold standby

mechanisms. We share this limitation with other well-known modelling techniques, such as traditional Fault Trees and

Reliability Block Diagrams.

Organisation The rest of the paper is organised as follows. In Section 2 we present the related work in the ﬁelds of

reliability and IT service composition. In Section 3 we present dependency graphs and we provide the mathematical

foundation for using them to calculate service availability. In Section 4 we present the procedure to ﬁnd the optimal

choice of availability level for outsourced components. In Section 5 we describe the tool we created to implement the

THOS framework and the benchmarks we conducted to test its scalability performances. Finally, in Section 6 we

show how we applied A

THOS to a practical case of service availability planning in an industrial context.

2. Related Works

In this section we discuss related works in four relevant areas for our problem: (1) the general approach to calculate

system availability, (2) modelling techniques to represent the system under analysis, (3) existing tools and (4) other

approaches taking into account availability to optimise IT service composition.

The general approach Referring to a classic formulation [2] taken from the reliability theory, a repairable system

is a system which can be repaired after a failure.

In the simplest case, the system m for which availability must be determined is represented by the state function

χ(m, t) which assumes value 1 if m is operating within tolerances at time t, 0 otherwise. The general way of calculating

the availability of a repairable system is to assume it has an independent, exponential distribution of failure and repair

time (a so-called stationary alternating renewal process [14]). However, to do so one must know at least two properties

of the system: its failure rate λ, and its repair rate µ. The ﬁrst property speciﬁes how often the system will fail on

average, i.e., its Mean Time Between Failure (MTBF): λ =

MTBF

. The second one speciﬁes its Mean Time To Repair

(MTTR): µ =

MTTR

. Under this assumption the limiting availability is then obtained by the formula

A =

µ+λ

In the general case, the system can assume more than two states. Such a system is called complex. A complex

system is a system which is made of interconnected components that as a whole exhibit one or more properties

depending on the properties of the individual component. For example, a complex system can be made of two “simple”

components (i.e., two components that can independently be either in operative or in repairing state). The state of the

system depends on the state of the two components: the system may work properly even if one component only is

operative, or it may need both components to be operative. To model the state of the system, a state formula is used.

Components can have more than two states (e.g., operative, planned maintenance, emergency repair, etc.). To compute

the availability of complex systems, Continuous Time Markov Chains (CTMC) [19], or Petri Nets [9] are used. To

employ such techniques, one has to (1) deﬁne a state formula of the system based on the component’s state, and (2)

know the transaction probability of each component from one state to the other.

In our case, the information available in the SLAs for outsourced components concerns only a minimal availability

in a given time frame (e.g., one month). Therefore, classic techniques are not applicable to this problem, as the internal

states of each component and the probability of state transition (i.e., failure and repair rate) are only known by the

outsourcing company.

System modelling Several approaches have been proposed in the literature for system reliability modelling. Fault

trees (FTs) and Reliability Block Diagrams (RDBs) are the most used ones. However, we should mention that also

other approaches have been proposed, e.g., Torres-Toledano and Sucar [22] use bayesian networks, and Leangsuksun

et al. [13] use an UML representation (although in this second case the authors do not provide the mathematical

support for reliability analysis). In FTs, a number of components (called basic events) are linked together to make up a

system according to AND/OR relationships. The same behaviour is achieved in RBDs through SERIES/PARALLEL

compositions. According to [9], FTs are easy to use, as they do not require very skilled modellers, and relatively fast

to evaluate, as it is possible to use very efﬁcient combinatorial solving techniques to obtain most of the reliability

indexes.

In FTs, the system state is represented by the top event, i.e., the root of the tree. It is possible to build a boolean

equation from the FT, and to reduce it to the minimal cut set, i.e., the smallest set of combinations of basic events

(component failures) which all need to occur for the top event to take place (system failure) [23]. Based on the

minimal cut set, a combination of combinatorial techniques and CTMC or PetriNets is then used to calculate the

system (limiting) availability.

According to Flamini et al. [9], the main limitation of FTs and RBDs consists in the lack of modelling power, as they

do not allow to model maintenance-related issues explicitly. To solve this problem, FTs and RDBs have been extended

into Dynamic Fault Trees [6] and Dynamic Reliability Block Diagrams [5], allowing one to model maintenance-related

issues.

The modelling notation we use in this paper (dependency graphs) can be seen as a condensed form of fault trees.

With a single dependency graph we are able to model a forest of fault trees sharing (some of) the basic events (i.e.,

the failure of a component), but with different top events. A single dependency graph can thus model separately the

failure of all the business services which the IT system provides, and for which a speciﬁc availability level must be

calculated. In fact, it is possible to (automatically) transform any dependency graph into a forest of FTs, as well as

in a set of RBD, as we show in Appendix B. We share with FTs the use of minimal cut sets, which in our notation

are called Dependency Sets (see Section 3), but the availability calculation we apply to dependency graphs is different

from the one used in FTs (for the reason we mentioned above).

Tools IBM Tivoli [12] and HP Business Availability Centre [11] are two of the most popular conﬁguration

management tools. These tools are meant to support IT managers in the conﬁguration and maintenance of complex IT

systems. Among the many features they possess, they can be used to manage SLAs, including availability levels. One

can assign to each IT component the availability level imposed by SLAs, and keep track of the actual availability levels

to check for SLA compliancy. However, to the best of our knowledge there is no support for the analytical calculation

of the service availability.

Galileo [21], Coral [4], Relex [18] and BlockSim [3] are tools operating with Dynamic Fault Trees. Although

integrating the A

THOS engines in one of these tools would be useful, this was not possible: Relex and BlockSim are

commercial tools, Coral is mostly a MatLab library without a GUI, and Galileo is free software, but not open source.

For these reasons we developed our prototype as an independent Java/Prolog tool.

Availability in service composition In the ﬁeld of IT service composition, several approaches have been proposed

that consider availability as one of the QoS parameters to optimise the performances of the resulting composite IT

service. Gu et al. [10] propose QUEST, a framework to schedule dynamically a composite IT service while satisfying

QoS requirements (e.g., response time and availability) imposed by SLAs. Zeng et al. [26], Yu et al. [24] and Ardagna

et al. [1] propose scheduling techniques to create a cost-optimal execution plan for composite web services which

respect QoS parameters (including availability) deﬁned in SLA contracts.

In all these works, an estimation of the availability of the composite service is made by multiplying the availability

level of the components (expressed as a real number in the interval [0,1]. This is possible thanks to two simplifying

assumptions. First, all the components must be available at the same time for the system to operate (i.e., the system

is an AND-combination of its components and it becomes unavailable in the moment that any of its component is

unavailable). Secondly, the resulting availability is not a lower bound, i.e., there can be a run of the composite service

in which the resulting availability is lower than the calculated one. Differently from these approaches, A

THOS is

able to deal with a wider range of dependencies, namely combinations of AND and OR dependencies. In the sequel

we also argue in more detail why OR dependencies are necessary to model complex IT services correctly. A

THOS

also allows one to calculate an absolute the lower bound for the availability, which can be safely included in an SLA

contract.

3. Analysis of the minimal service availability

We now present the theoretical foundations of A

THOS. Let us ﬁrst start with an intuitive explanation. We model

the system using a dependency graph, in which a node represents a component of the system that at any given time

may (or may not) be available. A directed edge from node m to node n indicates that m depends on n, i.e. that the

availability of m depends also from the availability of n in a way that we are about to explain.

In a dependency graph, a node m can be unavailable because of an internal failure, or because (some) nodes it

depends on are unavailable. To model internal failure, to each node m we associate a (virtual) internal node m

On the other hand, to model the fact that m becomes unavailable because one or more nodes it depends on are

unavailable, we then consider nodes of two types: AND and OR .

(a) AND (b) OR

Figure 2: Two simple dependency graphs, respectively with AND and OR nodes

If m is a node in a dependency graph and n

, . . . , n

are the nodes m depends on, we say that

• m is unavailable at time t iff its internal node m

is unavailable at time t or

– n

, . . . , n

are all unavailable at time t, in case m is an AND node,

– at least one node in n

, . . . , n

is unavailable at time t, in case m is an OR node.

Formally,

Deﬁnition 3.1 (Dependency graph) A dependency graph hN, Ei is a directed and acyclic graph (DAG) where N is

the set of nodes, and is partitioned in AND-N and OR-N, and E is the set of edges E ⊆ {hu, vi | u, v ∈ N }.

Given a graph hN, Ei, we call N

the set of the internal nodes of g; N

= {n

internal of n | n ∈ N}.

Running example - Part 1. In this example we analyse the availability of an IT system providing two IT services

(Service1 and Service2), and implemented by means of three applications (App1, App2 and App3) running

on ﬁve different servers (Srv1, Srv2, Srv3, Srv4, Srv5). Service1 is implemented by App1 and App2 in

such a way that the service goes off-line only when both applications are off-line (OR dependency). Service2 is

HTML Viewer

Frequently Asked Questions (1)

Q1. What are the contributions in "A2thos: availability analysis and optimisation in slas" ?

In this paper the authors present ATHOS, a framework to calculate the availability of partially outsourced IT services in the presence of SLAs and to achieve a cost-optimal choice of availability levels for outsourced IT components while guaranteeing a target availability level for the service.

A 2 thOS: availability analysis and optimisation in SLAs

Summary (1 min read)

Introduction

Figures (12)

Citations

References

"A 2 thOS: availability analysis and..." refers background in this paper

"A 2 thOS: availability analysis and..." refers background in this paper

"A 2 thOS: availability analysis and..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (1)

Q1. What are the contributions in "A2thos: availability analysis and optimisation in slas" ?