Voice over IP performance monitoring

doi:10.1145/505666.505669

VOICE OVER IP PERFORMANCE MONITORING

COLE, R. G. AND ROSENBLUTH, J. H.

AT&T LABORATORIES

MIDDLETOWN, NJ

{rgcole, jrosenbluth} @att.com

ABSTRACT

We describe a method for monitoring Voice over

IP (VolP) applications based upon a reduction of the

ITU-T's E-Model to transport level, measurable

quantities. In the process, 1) we identify the relevant

transport level quantities, 2) we discuss the tradeoffs

between placing the monitors within the VolP

gateways versus placement of the monitors within the

transport path, and 3) we identify several areas where

further work and consensus within the industry are

required. We discover that the relevant transport level

quantities are the delay, network packet loss and the

decoder's de-jitter buffer packet loss. We find that an

in-path monitor requires the definition of a reference

de-jitter buffer implementation to estimate voice

quality based upon observed transport measurements.

Finally, we suggest that more studies are required,

which evaluate the quality of various VolP codecs in

the presence of representative packet loss patterns.

1 INTRODUCTION

There is great interest in supporting voice

applications over both the public Internet and private

intra-nets, i.e., Voice over IP (VoIP). Several popular

Internet implementations are the Video Audio Tool

(VAT) [1] and the Robust Audio Tool (RAT) [2], as

well as a host of ITU-T H.323 implementations. An

important aspect of VoIP is developing a performance

monitoring capability to track the quality of the voice

transport. In this paper, we discuss one approach to

monitoring the performance of conversational voice

applications over Internet transport. Specifically, we

investigate the use of the 1TU-T's E-Model[3] as a tool

to relate several transport level metrics to an estimate

of conversational voice quality. To accomplish this,

we analyze the reduction of the existing E-Model in

terms of transport-level metrics for the purpose of

monitoring conversational voice quality. In the

process, we discuss the advantages and shortcomings

of our approach and identify a set of issues which we

believe need to be addressed within the open

literature.

The ITU-T's E-Model is a network planning tool

used in the design of hybrid circuit-switched and

packet-switched networks for carrying high quality

voice applications. The tool estimates the relative

impairments to voice quality when comparing

different network equipment and network designs.

The tool provides a means to estimate the subjective

Mean Opinion Score (MOS) rating of voice quality

over these planned network environments. We

describe the E-Model in more detail in Section 3

below.

The specific method we advocate is to:

Measure the low-level transport metrics

(characterizing the channel), which impact voice

performance, i.e., delay, delay variation and packet

IOSS~

Combine the packet loss and delay variation

measurements, de-jitter buffer operations, packet

size and coder frame size into an error mask (the

exact sequence of good and bad coder frames)

that can be characterized in a simple manner (e.g.,

average frame loss rate along with some measure

of burstiness),

Combine the characterized error mask with the

coder and its frameqoss concealment algorithm

via a look-up table (or curve fit) based on

ACM SIGCOMM 9 Computer Communication Review

subjective testing to produce an E-Model

equipment impairment factor (lee), and

Combine the I~f with other E-Model low-level

measurable elements, i.e., delay and echo, to

produce a predicted opinion score on the quality

of the voice conversation.

We illustrate this measurement and data reduction

methodology in Figure 1 below. In this figure we

capture the channel characteristics via a set of

transport level measurements, e.g., packet loss and

delay distributions. We combine this with various

architectural characteristics of the VolP gateways,

specifically the de-jitter buffer implementations, the

transport packet size and the codec frame size.

Channel Characteristics

Packet Loss Distribution

Delay Jitter Distribution

Frame Erasure Distribution

(Error Mask)

Opinion ..... > Equipment Impairment Factor

Architecture Choices

De-jitter Buffer

Packet Size

Codee Frame Size

Codec

Loss Concealment Algorithm

Figure h A measurement and data reduction methodology for

VolP quality monitorine~ which highlights the equipment

impairment factor elements.

The result of combining the channel characterization

and the architectural characterization of the gateways

is a

Frame Erasure Distribution

(or

Error Mask).

The

Error Mask characterizes the salient features of the

loss distribution as observed by the decoder. (Note:

This loss distribution captures both the transport

packet loss and the loss in the decoder's de-jitter

buffer I due to late packet arrivals.) When the Error

Mask is combined with the specific loss concealment

algorithm implemented within codec, we generate an

Equipment Impairment Factor, which captures the

t The decoder must intentionally delay the variable delayed,

arriving voice packets in its de-jitter buffer in order to

reconstruct a synchronous bit stream. In some cases this

de-jitter buffer delay is not large enough to absorb the

transport delay variation. This results in de-jitter buffer

losses as observed by the decoder[

expected impairment of the codec under the above

conditions. From this, the E-Model provides a means

to estimate a quality score for the conversational voice

application. We discuss the methodology in detail in

Section 4 below,

Following this methodology, it is possible to build

a relatively simple VolP performance monitoring

capability. It is a further goal of this paper to help

foster an industry consensus around the specific

methodology tO follow in developing such a VolP

performance monitoring capability. Only then, would

it be possible ~o obtain consistent quality estimates

from this type of VolP quality monitor.

I

i

The remaiflder of this paper is organized as

follows: We next discuss the relationship of the E-

Model approacl! we are advocating to other methods

of monitoring VolP performance. In Section 3, we

present an ove~iew of the ITU-T's E-Model. Section

4 covers our efforts to reduce the E-Model's formulae

to transport-levd, directly measurable quantities in an

unambiguous fashion. Section 5 discusses several

issues with this reduction and, in particular, discusses

the issue of identifying a 'reference model' for

performance monitoring. We follow this with a

discussion of measurement methodologies and report

on some field-work we have performed with this

approach. We finish the paper with a section on our

conclusions.

2

RELATIONSHIP TO OTHER METHODS

We know of a few commercial products, which

implement monitoring approaches similar to the one

we advocate within this paper. Also, the approach we

advocate is not the only approach to monitoring the

quality of VolP applications. Other approaches range

from "objective models of quality", involving the

injection of sample speech segments across the

network, to simple packet level measurements. In this

section we discus§ these alternatives.

We have run across several commercial products,

which monitor VoIP quality in a fashion similar to the

approach we advocate. These include the Cisco voice

dial control MIB [4] and Telchemy's monitoring

software [5]. The information on these products

refers to the E-Model in the description of their

approach. However, both of these products appear to

rely on information extracted from within the voice

gateways. As such, they require implementation

within the VolP gateways themselves. In this paper,

ACM SIGCOMM 10 CompUter Communication Review

we discuss a generalized approach, which is

independent of the implementation location of the

monitors. As such, we attempt to highlight the

relative advantages of each approach.

An alternative method, commonly in use, requires

the injection of sample speech segments across a voice

transport path, i.e., the coder-to-network-to-decoder

path. This method is often referred to as "objective

models of quality". The models compare the output

speech to the input speech, using psycho-acoustic

fundamentals, to produce opinion without reference

to the underlying channel conditions. However, it is

still necessary to overlay the low-level transport

measurements of delay and echo on top of this, using

the E-Model, in order to capture the conversational

speech impairments due to delay. The ITU-T has

standardized one such model [6] and continues to

investigate improved models [7].

The advantages of objective models of quality are:

1) no knowledge of, or assumptions about, the

underlying network is required (coder, de-jitter buffer,

error mask, packet size), and 2) predicted opinion is

based on fundamental psycho-acoustics rather than an

interpolation of subjective testing results as with the

E-Model, and thus the results may be more accurate

and more robust. With regard to accuracy, the E-

Model was intended to be used as a network planning

tool, not a network maintenance tool. As such, it only

needs to be accurate enough to distinguish between

broad ranges of quality (see Table 1). On the other

hand, objective models of quality can often distinguish

between quality levels within a broad range. With

regard to robustness, the E-Model cannot predict

opinion for conditions that have not been previously

scored by subjective panels. On the other hand,

objective models of quality can predict opinion for

such conditions, although their accuracy in such cases

is not always good.

The disadvantages of objective models are: 1) they

are complex and costly, 2) there are some conditions

for which they are known not to be accurate (e.g.,

temporal clipping), 3) they are intrusive whereas the E-

Model can be implemented as either intrusive or non-

intrusive, and 4) they reveal nothing about the

underlying cause of quality problems. Because we

make low-level measurements with the E-Model, we

have causality information.

Another method is to rely on direct packet level

measurements or straightforward combinations of

packet level measurements. Thresholds are then

defined as to when the quality of voice conversations

would degrade to a critical point. The advantage of

this approach is that it is relatively simple to

implement. Its disadvantage is that the thresholds it

relies upon are somewhat arbitrarily chosen. Further,

this approach does not attempt to combine the

transport metrics in a meaningful way with respect to

voice quality.

3 THE E-MODEL: AN END-TO-END VOICE

QUALITY MODEL

The E-Model, defined in the ITU-T Rec. G.107

[3] as well as other associated ITU-T

recommendations [8], is an analytic model of voice

quality used for network planning purposes.

Specifically, the E-Model presents a method for

estimating the relative voice quality when comparing

two reference connections [3]. According to [3], the

E-Model has proven useful as a transmission planning

tool, however further studies are underway to address

the assumptions of the E-Model under specific

parameter combinations. For a fuller discussion of the

validity of the E-Model, refer to [3].

A basic result of the E-Model is the calculation of

the R-factor, which is a simple measure of voice

quality ranging from a best case of 100 to a worst case

of 0. The R-factor uniquely determines the familiar

Mean Opinion Score (MOS), which is the arithmetic

average of opinion when

"excellent"

quality is given a

score of 5, "good" a 4, "fair" a 3, "poor" a 2, and

"'bad"

a 1. The R-factor is defined in terms of several

parameters associated with a voice channel across a

mixed Switched Circuit Network (SCN) and a Packet

Switched Network (PSN). The parameters included in

the computation of the R-factor are fairly extensive

covering such factors as echo, background noise,

signal loss, codec impairments, and others. An

excellent discussion of the E-Modal is found in [9].

The R-factor is related to the MOS through the

following set of expressions:

ForR <O: MOS = I

ForR > IO0:MOS = 4.5

ForO < R < 100 : MOS = 1 + 0.035 R

+ 7xlO" R(R-60)(IOO-R)

(Equation I)

For reference, Eq.(1) is plotted in Figure 2.

ACM SIGCOMM 11 Computer Communication Review

6O

O

eJ

I I I I I

0 Z0 40 60 80 100

R

Figure 2: A plot showing the relationship between the R factor

and the MOS (see Eq. (1)).

Typically, the values of the R-factor are categorized as

shown in Table 1 below. Here we see that

connections with R-factors of less than 60 are

expected to provide a 'poor' quality of service to users.

Table 1:Rfactor6 quality ratings and the assodated MOS.

R-factor

90 < R < 100

Quality of

voice rating

Best

MOS

4.34 - 4.5

80 < R < 90 High 4.03 - 4.34

70 < R < 80 Medium 3.60 - 4.03

60 < R < 70 Low 3.10 - 3.60

50 < R < 60 Poor 2.58 - 3.10

The R-factor is expressed as the sum of four

terms:

a= loo-x,-zd-z¢ +A

(Equation 2)

where Is is the signal-to-noise impairments associated

with typical SCN paths, Id is the impairment associated

with the mouth-to-ear delay of the path, I~f is an

equipment impairment factor associated with the

losses within the gateway codecs and A is the

Expectation Factor.

An interesting aspect of the E-

Model is that these terms, i.e., Is, Ia, and I~f are additive

and further, that the delay and packet loss

contributions are isolated into

Id

and Ief, respectively.

This does not imply that delay and packet loss are un-

correlated in the underlying transport media, but only

that their contributions to the estimated impairments

are separable.

The ExpeCtation Factor covers those intangible

quantities that are difficult (or impossible) to quantify.

This term accounts for lowered customer expectations

of quality because of, e.g., a cell phone user's tendency

to tolerate lower quality in exchange for the

convenience afforded by mobility, or in exchange for a

lower price. For the most part it is difficult to estimate

the Expectation Factor, although there appears to be

some agreement that an Expectation Factor of around

10 for a cellular network is appropriate [10]. However,

no such agreement has been reached for the case of

lower prices aS expected with some VolP services.

For this reasori, we will drop the Expectation Factor

from our future discussions of the R-factor.

The signal-to-noise impairment factor, I,, is a

function of several parameters, none of which are a

function of the underlying packet transport. However,

the ITU-T Rec. G.107 [3] recommends a set of default

values for these parameters for planning purposes.

Because this is inot the focus of our discussion, and is

dependent upon the method to access the VolP

network, we will rely upon the default

recommendations for all but a few parameters, e.g., all

except for the delay and packet loss parameters. For

example, it is *ufficient for our purposes to assume

that echo cancelers are present and working properly

(no echo). Choosing these default values, we can

reduce the expression for the R-factor [3] to:

R = 94.2 - I d - I,f (Equation 3)

Not only have we chosen the default values for the

various SCN signal impairments, but we have also

dropped reference to the Expectation Factor.

The delay components within the function Id are:

1) T, the average, absolute one-way mouth-to-ear

delay, 2) T the average, one-way delay from the receive

side to the point in the end-to-end path where a signal

coupling occurs as a source of echo, and 3) T~ the

average, round trip delay in the four-wire loop. Note

that T~, T, and T represent various measures of delay

from different points within a general reference

ACM SIGCOMM 12 Computer Communication Review

connection. G.107 gives a fully analytical expression

for the function

Id,

in terms of T~, T, T, and

parameters associated with a general reference

connection describing various circuit switched and

packet switch inter-working scenarios.

Table 2: Values of the delay impairment for selected, one-way

delay values. (Note: The one-way delay is defined as mouth-to-

ear delay.)

One-way delay

(msec)

Ia

25 0.9

50 1.5

75 2.1

100 2.6

125 3.1

150 3.7

175 5.0

200 7.4

225 10.6

250 14.1

275 17.4

300 20.6

325 23.5

350 26.2

375 28.7

400 31.0

Since the focus of this paper is on the

development of an IP-based monitoring system, we

choose to simplify the expression for Ia in three ways

(and hence focus our discussion on IP-based transport

and VolP gateway issues). First, for the cases we are

interested in, i.e., VolP with no circuit switched

network interworking, the various measurement points

for the delay measures collapse into a single pair of

points, such that

d=ro

=r=r,/2

(Equation 4)

and that Id(d) is now a function only of the single delay

measurement d. Second, we choose to use the default

values listed in [3] for all terms in the Ia expression

other than Ta, T and T,. Third, we plot out the delay

component and then fit the resulting curve to a simple

expression for discussion purposes. For reference, the

full expression for Id, assuming only the default values

listed in G.107, could be used for our purposes instead

of our simplified expression derived below. But, we

find it much more convenient for discussion and

modeling purposes to use our simplified expression

below. Table 2 above gives the values for the delay

component for selected values of the one way delay

[11].

In Figure 3, we plot these values and find that Ia

has two roughly linear regions. A knee in the curve

occurs at a delay of 177 msec. For one way delays less

than 177 msec, conversations occur naturally, whereas

at delays in excess of 177 msec conversations begin to

strain and breakdown; often degenerating into simplex

communications at high delay values.

40

35

30

25

_

20

15

10

5

0

One-way delay

(ms)

[ #

Delay component

- ,11-- Estimate I

Figure 3: A plot of the Ia as a function of delay along, with a

simple fit.

Also on this plot, we fit the values of Id to the

expression:

Ia = 0.024d + O.t l (d- t 77.3) H(d- 177.3) (EquationS)

Here d is the one-way delay (in milliseconds) and H(x)

is the Heavyside (or step) function:

H(x) = O if x < O, else

H(x) = 1 for x >= 0 (Equation 6)

We can now express the R-factor in the form:

R ~ 94.2- 0.024d+0.11(d-177.3)H(d-177.3) - I¢

(Equation 7)

All that remains is to fred suitable estimates for

the equipment impairment factors. Currently, no

analytic expressions exist for the equipment

ACM SIGCOMM 13 Computer Communication Review

Voice over IP performance monitoring

Citations

A Survey on Security Threats and Detection Techniques in Cognitive Radio Networks

From QoS to QoE: A Tutorial on Video Quality Assessment

Interactive wifi connectivity for moving vehicles

Assessing the quality of voice communications over Internet backbones

Voice quality prediction models and their application in VoIP networks

References

Adaptive playout mechanisms for packetized audio applications in wide-area networks

Characterizing End-to-End Packet Delay and Loss in the Internet

Packet audio playout delay adjustment: performance bounds and algorithms

Successful multiparty audio communication over the Internet

Overcoming workstation scheduling problems in a real-time audio tool

Related Papers (5)

RTP: A Transport Protocol for Real-Time Applications

Adaptive playout mechanisms for packetized audio applications in wide-area networks

Packet audio playout delay adjustment: performance bounds and algorithms

SIP: Session Initiation Protocol

An Architecture for Differentiated Service