(Open Access) Blurring snapshots: Temporal inference of missing and uncertain data (2010) | Vasanth Rajamani

Q: What have the authors contributed in "Blurring snapshots: temporal inference of missing and uncertain data tr-utedge-2009-005" ?

In this paper, the authors support continuous monitoring of dynamic pervasive computing phenomena through the use of a series of snapshot queries. The authors evaluate the usefulness of this abstraction in its application to complex spatiotemporal pattern queries in pervasive computing networks.

Blurring Snapshots: Temporal

Inference of Missing and

Uncertain Data

TR-UTEDGE-2009-005

Vasanth Rajamani, The University of Texas at Austin

Christine Julien, The University of Texas at Austin

The University of Texas at Austin

Blurring Snapshots: Temporal Inference of

Missing and Uncertain Data

Vasanth Rajamani and Christine Julien

Department of Electrical and Computer Engineering

The University of Texas at Austin

{vasanthrajamani,c.julien}@mail.utexas.edu

Abstract—Many pervasive computing applications continu-

ously monitor state changes in the environment by acquiring,

interpreting and responding to information from sensors embed-

ded in the environment. However, it is extremely difﬁcult and

expensive to obtain a continuous, complete, and consistent picture

of a continuously evolving operating environment. One standard

technique to mitigate this problem is to employ mathematical

models that compute missing data from sampled observations

thereby approximating a continuous and complete stream of

information. However, existing models have traditionally not

incorporated a notion of temporal validity, or the quantiﬁcation

of imprecision associated with inferring data values from past

or future observations. In this paper, we support continuous

monitoring of dynamic pervasive computing phenomena through

the use of a series of snapshot queries. We deﬁne a decay

function and a set of inference approaches to ﬁlling in missing

and uncertain data in this continuous query. We evaluate the

usefulness of this abstraction in its application to complex spatio-

temporal pattern queries in pervasive computing networks.

Keywords-sensor networks, queries, dynamics, interpolation

I. INTRODUCTION

As applications place an increased focus on using dis-

tributed embedded networks to monitor both physical and

network phenomena, it becomes necessary to support efﬁcient

and robust continuous monitoring that can communicate the

uncertainty associated with data collected from a dynamic net-

work. The emergence of pervasive computing is characterized

by increased instrumentation of the physical world, including

small sensing devices that allow applications to query a local

area using a dynamic and distributed network for support. On

the roadways, all vehicles may be equipped with devices that

sense and share location, and that information can be queried

by other nearby vehicles to understand trafﬁc ﬂow patterns.

On an intelligent construction site, workers, equipment, assets,

and even parts of buildings may be equipped with sensors

to measure location, temperature, humidity, stress, etc., with

the goal of generating meaningful pictures of the project’s

progress and maintaining safe working conditions.

Central to these and other applications is the ability to

monitor some condition and its evolution over a period of

time. On a construction site, the amount of an available

material at a particular time may be useful, but it may be

just as useful to monitor how that material is consumed

(and resupplied) over time. Such trends are usually measured

through continuous queries that are often registered at the

remote information sources and periodically push sensed data

back to the consumers [2], [9]. Such a “push” approach to

continuous query processing requires maintaining a distributed

data structure, which can be costly in dynamic settings. In

addition, this often requires that a query issuer interact with

a collector that is known in advance and reachable at any

instant, which is often unreasonable. We have demonstrated

that, in dynamic networks, it often makes sense to generate a

continuous queries using a sequence of snapshot queries [18].

A snapshot query is distributed through the network at a

particular point in time, takes measurements of the target

phenomenon, and sends the results back to the the query issuer.

In our model (Section II), a continuous query is the integration

over time across a sequence of snapshot queries.

In generating a continuous and accurate reﬂection of an

evolving environment, uncertainty is introduced in several

ways [15], [16]. First, there is a signiﬁcant tradeoff between

the cost of generating the continuous query result and the

quality of the result. For instance, the more frequently the

snapshot queries execute, the more closely the continuous

query reﬂects the ground truth, but the more expensive it is

to execute in terms of communication bandwidth and battery

power. In addition, the snapshot queries can be executed

using different protocols that consider the same tradeoff (e.g.,

consider the differences in quality and cost of a query ﬂooded

to all hosts in the network and one probabilistically gossiped

to some subset). On a more fundamental level, the quality of

any interaction with a dynamic network is inherently affected

by the unreliability of the network—packets may be dropped

or corrupted, and communication links may break. The fact

that a continuous query fails to sense a value at a particular

instant may simply be a reﬂection of this inherent uncertainty.

Even when these uncertainties weaken a continuous query,

applications can still beneﬁt if the query processing can

provide some knowledge about the degree of the uncertainty.

For example, in a continuous query on a construction site for

the amount of available material, it would be useful to know

that, with some degree of certainty (i.e., a conﬁdence) there

is a given amount of available material. This may be based

on information collected directly from the environment (in

which case the conﬁdence is quite high), historical trends, or

knowledge about the nature of the phenomenon. Model-driven

approaches that estimate missing data using mathematical

models can alleviate these uncertainties [6], [7]. In these

approaches, the goal is to build a model of the phenomenon

being observed and to only query the network to rebuild the

model when the conﬁdence in the model has degraded to

make relying on it unacceptable. Section VII examines these

approaches and the relationship to our work in more detail.

Because we build a continuous query from a sequence of

snapshot queries, handling uncertainty is twofold. First, we

must be able to provide estimates of the continuous query

result between adjacent snapshot queries. Second, even if we

fail to sample a data point in a given snapshot, we may

have some information about that data point at a previous

time (and potentially a future time) that we may use to infer

something about the missing data. In both cases, we are not

actually changing the amount of information available to the

application; instead we are blurring the snapshot queries and

associating a level of conﬁdence with inferred results.

Our approach relies on a simple abstraction called a decay

function (Section III) that quantiﬁes the temporal validity

associated with sensing a particular phenomenon. We use this

decay function as the basis for performing model-assisted

inference (Section IV) to use sampled data values from the

snapshot queries to infer values into the past and future. This

inference can allow us to ﬁll in gaps in the sequence of snap-

shot queries to enable trend analysis on the components of the

continuous query. The inference and its associated conﬁdence

can also provide the application a concrete sense of what the

degree of the uncertainty is. Finally, by smoothing across the

available data, this inference makes the information that is

available more viewable and understandable by the application

and its user. We examine these beneﬁts in Sections V and VI.

Our novel contributions are threefold. First, we introduce

decay functions that allow applications to deﬁne temporal

validity in a principled way. Second, we build a set of simple

statistical models that allow us to effectively blur snapshot

queries into continuous queries and use them to study the use

of model-assisted inference for a variety of different types

of dynamic phenomena. Finally, we demonstrate through an

implementation and evaluation and a set of usage scenarios

the efﬁcacy and usefulness of using inference to ﬁll in missing

data in real world situations. If the network supporting data

collection is highly dynamic, our approaches help mitigate the

impact of the dynamics on the inherent uncertainty; however,

even in less dynamic situations, our approach helps applica-

tions reasonably trade off the cost of executing continuous

queries for the quality of the result.

II. BACKGROUND

This paper builds on our previous approaches deﬁning snap-

shot and continuous query ﬁdelity and an associated middle-

ware [15], [18]. These approaches approximate a continuous

query using a sequence of snapshot queries evaluated over

the network at discrete times. We model a dynamic pervasive

computing network as a closed system of hosts, where each

host has a location and data value (though a single data value

may represent a collection of values). A host is represented as a

triple (ι, ζ, ν), where ι is the host’s identiﬁer, ζ is its context,

and ν is its data value. The context can be simply a host’s

location, but it can be extended to include a list of neighbors,

routing tables, and other system or network information.

The global state of a network, a conﬁguration (C), is a set

of host tuples. Given a host h in a conﬁguration, an effective

conﬁguration (E) is the projection of the conﬁguration with

respect to the hosts reachable from h. Practically, h is a host

initiating a query, and E contains the hosts expected to receive

and respond to the query. To capture connectivity, we deﬁne a

binary logical connectivity relation, K, to express the ability

of a host to communicate with a neighboring host. Using the

values of the host triple, we can derive physical and logical

connectivity relations. As one example, if the host’s context,

ζ, includes the host’s location, we can deﬁne a physical

connectivity relation based on communication range. K is not

necessarily symmetric; in the cases that it is symmetric, K

speciﬁes bi-directional communication.

The environment evolves as the network changes, values

change, and hosts exchange messages. We model network evo-

lution as a state transition system where the state space is the

set of possible conﬁgurations, and transitions are conﬁguration

changes. A single conﬁguration change consists of one of

the following: 1) a neighbor change: changes in hosts’ states

impact the connectivity relation, K; 2) a value change: a single

host changes its stored data value; or 3) a message exchange:

a host sends a message that is received by one or more

neighboring nodes. To refer to the connectivity relation for

a particular conﬁguration, we assign conﬁgurations subscripts

(e.g., C

, C

, etc.) and use K

to refer to the connectivity

of conﬁguration C

. We have also extended K to deﬁne

query reachability. Informally, this determines whether it was

possible to deliver a one-time query to and receive a response

from some host h within the sequence of conﬁgurations [17].

A snapshot query’s result (ρ) is a subset of a conﬁguration:

it is a collection of host tuples that constitute responses to the

query. No host in the network is represented more than once

in ρ, though it is possible that a host is not represented at all

(e.g., because it was never reachable from the query issuer).

Depending on both the protocol used to execute the snapshot

query (e.g., whether the query was ﬂooded to all hosts in the

network or whether it was gossiped) and inherent network

failures, only a subset of the reachable hosts may respond.

This results in missing and uncertain data in the results of

snapshot queries, which may result in a degradation in the

quality of and conﬁdence in the continuous query’s result.

III. MODELING UNCERTAINTY

Our approach to query processing allows users to pose

continuous queries to an evolving network and receive a result

that resembles a data stream even though it is obtained using

discrete snapshot queries. This stream can then be analyzed

to evaluate trends in the sensed data. However, missing and

uncertain sensed items can be a bane to this process, especially

in monitoring the evolution of the data. For example, on a

construction site, a site supervisor may use a continuous query

to monitor the total number of available bricks on the site.

This query may be accomplished by associating a sensor with

each pallet of bricks; the snapshot queries collect the identity

of the pallets and the number of bricks the pallet holds. If

consecutive snapshot queries do not sample the same subset

of pallets, the sums they report are not comparable, resulting

in inconsistent information supplied to the site supervisor.

Consider the continuous query in Fig. 1. The three networks

on the left of the dark line show the results of the continuous

query’s ﬁrst three snapshot queries. Each circle represents a

host; a circle’s color represents the host’s data value; and

lines represent connectivity. Throughout the continuous query,

some hosts depart, some arrive, and others change their data

value. In this case, the trend the application is analyzing is the

data items that remain available and unchanged throughout the

continuous query. When our snapshot queries are not impacted

by any missing or uncertain data, the stable set the trend

analysis generates is the actual stable set.

!""#"$%&'('$

)*(*$+*,-#$./*01#'$

$ 2

60*7'/&($8-#9:#'$

6(*;,#$6#($

Fig. 1: A Continuous Query

Consider, however, what happens when data is missing or

uncertain, as depicted in Fig. 2. In this situation, the ground

truth (i.e., what the snapshot queries should have returned)

is equivalent to that shown in Fig. 1, but due to network

dynamics or other sources of uncertainty, the sample from host

A was not collected in the second snapshot query (ρ

), and the

sample from host B was not collected in the third snapshot

query (ρ

). Consequently the result of the trend analysis in

Fig. 2 is quite different from that in Fig. 1. On a construction

site, if the data items represent pallets of bricks, this trend

analysis may cause the site supervisor to have additional

supplies delivered when it is unnecessary or even impractical.

!""#"$%&'('$

)*(*$+*,-#$./*01#'$

$ 2

60*7'/&($8-#9:#'$

6(*;,#$6#($

Fig. 2: A Continuous Query with Missing Data

One way to handle this uncertainty is to blur the snapshot

queries. In Fig. 2, given the fact that we know the network

to be dynamic, we can say with some conﬁdence that host A

should have been represented in ρ

; the level of this conﬁdence

depends on the temporal validity of the phenomenon sensed

(i.e., how long do we expect a data value to remain valid), the

frequency with which the snapshot queries are issued, and the

degree of network dynamics. The fact that A “reappeared” in

further increases our conﬁdence that it may have, in fact,

been present in ρ

as well. Fig. 3 shows a simple example

of how this inference can be used to project data values into

future snapshots (e.g., from ρ

to ρ

) and into past snapshots

(e.g., from ρ

to ρ

). In this ﬁgure, the black circles repre-

sent hosts the snapshot query directly sampled; gray circles

represent hosts for which data values have been inferred. The

question that remains, however, is how to determine both the

values that should be associated with the inferred results and

the conﬁdence we have in their correctness. We deal with

the former concern in the next section; here we introduce

decay functions to ascribe temporal validity to observations

and calculate conﬁdence in unsampled (inferred) values.

!"#$%%$&'()*+*' ,-./0$&'()*+*'

' 1

Fig. 3: Projection Forward and Backwards in Time

To address temporal validity, we rely on the intuitive ob-

servation that the closer in time an inferred value is to a

sensed sample, the more likely it is to be a correct inference.

For example, in Fig. 3, the value projected from ρ

to ρ

more likely to be correct than the value projected from ρ

to ρ

. If the sample missing in ρ

is also missing in ρ

, it

becomes increasingly likely that the host generating the sample

has, in fact, departed. We exploit this observation by allowing

applications to specify the temporal validity of different sensed

phenomena using a decay function that deﬁnes the validity of

a measured observation as a function of time.

Formally, a decay function is a function d(t) = f (|t − t

where t is the current time and t

is a time from either the

future or the past of the nearest (in time) actual sample of the

data value. The period |t − t

| is the period of uncertainty; the

larger the period of uncertainty, the less likely it is that the

sampled value retains any correlation with the actual value.

The decay function’s value falls between 0 and 1; it is a

measure of percentage likelihood. These decay functions are

an intuitive representation of conﬁdence and are easy for

application developers to grasp. It is also straightforward to

deﬁne decay functions to describe a variety of phenomena.

For instance, on a construction site, a moving truck’s GPS

location might be associated with a decay function of the

form: d(t) = e

−(|t−t

, which is a rapid exponential drop in

conﬁdence over time. On the other hand a GPS mounted on

a stationary sensor on the site might have a decay function of

the form: d(t) = 1 because the location value, once measured,

is not expected to change. Possibilities for formulating decay

functions are numerous and depend on the nature of the

phenomenon being sensed and the sensing environment.

Given a user-deﬁned decay function, it is straightforward

to determine a conﬁdence measure of an inferred value. We

measure this conﬁdence probabilistically. At any time instant

t, the inferred data value’s degree of conﬁdence p, is updated

using the following rule.

• if time t is the time at which an actual data reading was

acquired, then the value of p at time t is set to 1;

• otherwise, p is updated using the formula: p

= d(t).

Thus, at every point in time an data value of interest has an

imprecision that ranges from one to zero depending on when

it was last sampled. The further in time the inferred value is

from an actual sensed value, the less conﬁdence it has. With

this understanding, we look next at how to estimate how a

sampled value may have changed during periods where it is

not sampled, allowing us to infer its value.

IV. TEMPORAL INFERENCE FOR CONTINUOUS QUERIES

Decay functions allow applications to deﬁne the validity of

projecting information across time. We now address the ques-

tion what the value of that projected data should be. Specif-

ically, we present a suite of simple techniques that estimate

inferred values. We also demonstrate how this inference can be

combined with decay functions to associate conﬁdence with

inferred values. In later sections, we evaluate the applicability

of these inference approaches to real phenomena.

A. Nearest Neighbor Inference

For some applications, data value changes may be difﬁcult

to predict, for instance when the underlying process observed

is unknown or arbitrary. These changes are usually discrete;

at some instant in time, the value changes to some potentially

unpredictable value. Consider a construction site where pallets

of bricks are distributed to different locations around the site

for storage and use. A distributed query may execute across the

site, measuring how many bricks are present at each location

at query time. The bricks are laid and restocked during the

day as trucks and construction workers perform their tasks.

Without any knowledge of the project’s goals and the rate of

brick laying at different sites, it is difﬁcult to create a model

that effectively estimates the number of bricks at any given

location for instants that have no recorded observations.

In such cases, one technique to estimate missing data is to

assume the sampled value closest in time is still correct. As the

temporal validity decays, the sensed value is increasingly un-

reliable. Consider again the pallets of bricks on a construction

site and an application that samples the number of available

bricks periodically (e.g., every 10 minutes). The application

then sums across all of the data readings to generate a total

number of bricks on the site. Fig. 4 shows an example where

the value for the number of pallets at node A changes between

the two samples. Up until t = 5, the total number of pallets is

estimated using the original sample; after that, it is assumed

that the value is the sample taken at t = 10.

The example in Fig. 4 focuses on uncertain data; i.e.,

inferring data values that the application did not attempt to

sample. The same approach can be used to infer missing data,

e.g., if the application failed to sample a value for node A

#$%&"'#

()#

((#

("#

#$%&("'#

()#

((#

("#

(*#

("#

)"#

+"#

,"#

*"#

-"#

"# ("#*#

./0#

12330%4#56#789:;4#

<# <#

=5>0#<#

%5%23#

Fig. 4: Nearest Neighbor Inference for Uncertain Data

at time t = 10 but did resample it at time t = 20. This

example also demonstrates the importance of inferring missing

data. Because this data is used to monitor the total number of

pallets of bricks on the site, if data values are missing from a

particular snapshot, the site supervisor might observe radical

ﬂuctuations in the number of bricks that actually did not occur.

B. Interpolation and Regression

The evolution of many pervasive computing phenomena

can be fairly accurately represented by continuous functions.

If a truck is driving at a steady speed across the site, and

we sample its location at t = 0 and t = 10 it may be

reasonable to infer that at t = 5, the truck was at the midpoint

of a line drawn between the two sample points. In such

cases, standard statistical techniques like interpolation and

regression can be employed to infer data across snapshots.

In interpolation, the observed values are ﬁt on a function,

where the domain is typically the time of observation and

the range is the attribute’s value. For any point in time where

there is no recorded observation, the value is estimated using

the function. Interpolation approaches range from simple (e.g.,

linear interpolation) to complex (e.g., spline interpolation).

Linear interpolation connects consecutive observations of

a data item with a line segment. Polynomial interpolation

generalizes the function to a degree higher than one; in general,

one can ﬁt a curve through n data points using a function of

degree n−1. Spline interpolation breaks set of data points into

subsets, and applied polynomial interpolation to each subset.

Fig. 5 shows an example of interpolation. The data values

sensed are the locations of the devices on a 3x4 grid; the

moving truck’s data is missing from snapshots ρ

and ρ

. The

bottom ﬁgures show how linear interpolation and an example

of polynomial interpolation estimate the missing data.

Regression identiﬁes relationships between a dependent

sensed variable (e.g., location or temperature at a particular

device) and an independent variable (e.g., time). However,

regression does not try to ﬁt a curve or a function through

every observed data point. Instead, the end result of regression

encodes an approximation of the relationship between the

independent and dependent variables. As with interpolation,

regression comes in several ﬂavors ranging from simple

techniques like linear regression to more complex non-linear

variants. Effectively, regression provides a “looser ﬁt” function

for the data; this can be effective when the underlying data is

noisy (e.g., when the samples may contain errors), and it may

not be useful to ﬁt a curve through every observed data point,

since those data points may not be an accurate reﬂection of

Blurring snapshots: Temporal inference of missing and uncertain data

Figures

Citations

RES: A Robust and Efficient Snapshot Algorithm for Wireless Sensor Networks

Designing cyber-physical systems middleware for smart cities applications

Mutual exclusion in cyber-physical systems

Global Snapshot of a Large Wireless Sensor Network

References

Directed diffusion for wireless sensor networking

Model-driven data acquisition in sensor networks

TelegraphCQ: continuous dataflow processing

Evaluating probabilistic queries over imprecise data

Inferring High-Level Behavior from Low-Level Sensors

Related Papers (5)

Querying Uncertain Spatio-Temporal Data

Indexing uncertain spatio-temporal data

T-Patterns Revisited: Mining for Temporal Patterns in Sensor Data

Efficiently querying moving objects with pre-defined paths in a distributed environment

Using Probabilistic Models for Data Management in Acquisitional Environments

Frequently Asked Questions (1)

Q1. What have the authors contributed in "Blurring snapshots: temporal inference of missing and uncertain data tr-utedge-2009-005" ?