scispace - formally typeset
Open AccessProceedings ArticleDOI

Blurring snapshots: Temporal inference of missing and uncertain data

TLDR
This paper defines a decay function and a set of inference approaches to filling in missing and uncertain data in this continuous query, and evaluates the usefulness of this abstraction in its application to complex spatio-temporal pattern queries in pervasive computing networks.
Abstract
Many pervasive computing applications continuously monitor state changes in the environment by acquiring, interpreting and responding to information from sensors embedded in the environment. However, it is extremely difficult and expensive to obtain a continuous, complete, and consistent picture of a continuously evolving operating environment. One standard technique to mitigate this problem is to employ mathematical models that compute missing data from sampled observations thereby approximating a continuous and complete stream of information. However, existing models have traditionally not incorporated a notion of temporal validity, or the quantification of imprecision associated with inferring data values from past or future observations. In this paper, we support continuous monitoring of dynamic pervasive computing phenomena through the use of a series of snapshot queries. We define a decay function and a set of inference approaches to filling in missing and uncertain data in this continuous query.We evaluate the usefulness of this abstraction in its application to complex spatio-temporal pattern queries in pervasive computing networks.

read more

Content maybe subject to copyright    Report

Blurring Snapshots: Temporal
Inference of Missing and
Uncertain Data
TR-UTEDGE-2009-005
Vasanth Rajamani, The University of Texas at Austin
Christine Julien, The University of Texas at Austin
© Copyright 2009
The University of Texas at Austin

1
Blurring Snapshots: Temporal Inference of
Missing and Uncertain Data
Vasanth Rajamani and Christine Julien
Department of Electrical and Computer Engineering
The University of Texas at Austin
{vasanthrajamani,c.julien}@mail.utexas.edu
Abstract—Many pervasive computing applications continu-
ously monitor state changes in the environment by acquiring,
interpreting and responding to information from sensors embed-
ded in the environment. However, it is extremely difficult and
expensive to obtain a continuous, complete, and consistent picture
of a continuously evolving operating environment. One standard
technique to mitigate this problem is to employ mathematical
models that compute missing data from sampled observations
thereby approximating a continuous and complete stream of
information. However, existing models have traditionally not
incorporated a notion of temporal validity, or the quantification
of imprecision associated with inferring data values from past
or future observations. In this paper, we support continuous
monitoring of dynamic pervasive computing phenomena through
the use of a series of snapshot queries. We define a decay
function and a set of inference approaches to filling in missing
and uncertain data in this continuous query. We evaluate the
usefulness of this abstraction in its application to complex spatio-
temporal pattern queries in pervasive computing networks.
Keywords-sensor networks, queries, dynamics, interpolation
I. INTRODUCTION
As applications place an increased focus on using dis-
tributed embedded networks to monitor both physical and
network phenomena, it becomes necessary to support efficient
and robust continuous monitoring that can communicate the
uncertainty associated with data collected from a dynamic net-
work. The emergence of pervasive computing is characterized
by increased instrumentation of the physical world, including
small sensing devices that allow applications to query a local
area using a dynamic and distributed network for support. On
the roadways, all vehicles may be equipped with devices that
sense and share location, and that information can be queried
by other nearby vehicles to understand traffic flow patterns.
On an intelligent construction site, workers, equipment, assets,
and even parts of buildings may be equipped with sensors
to measure location, temperature, humidity, stress, etc., with
the goal of generating meaningful pictures of the project’s
progress and maintaining safe working conditions.
Central to these and other applications is the ability to
monitor some condition and its evolution over a period of
time. On a construction site, the amount of an available
material at a particular time may be useful, but it may be
just as useful to monitor how that material is consumed
(and resupplied) over time. Such trends are usually measured
through continuous queries that are often registered at the
remote information sources and periodically push sensed data
back to the consumers [2], [9]. Such a “push” approach to
continuous query processing requires maintaining a distributed
data structure, which can be costly in dynamic settings. In
addition, this often requires that a query issuer interact with
a collector that is known in advance and reachable at any
instant, which is often unreasonable. We have demonstrated
that, in dynamic networks, it often makes sense to generate a
continuous queries using a sequence of snapshot queries [18].
A snapshot query is distributed through the network at a
particular point in time, takes measurements of the target
phenomenon, and sends the results back to the the query issuer.
In our model (Section II), a continuous query is the integration
over time across a sequence of snapshot queries.
In generating a continuous and accurate reflection of an
evolving environment, uncertainty is introduced in several
ways [15], [16]. First, there is a significant tradeoff between
the cost of generating the continuous query result and the
quality of the result. For instance, the more frequently the
snapshot queries execute, the more closely the continuous
query reflects the ground truth, but the more expensive it is
to execute in terms of communication bandwidth and battery
power. In addition, the snapshot queries can be executed
using different protocols that consider the same tradeoff (e.g.,
consider the differences in quality and cost of a query flooded
to all hosts in the network and one probabilistically gossiped
to some subset). On a more fundamental level, the quality of
any interaction with a dynamic network is inherently affected
by the unreliability of the network—packets may be dropped
or corrupted, and communication links may break. The fact
that a continuous query fails to sense a value at a particular
instant may simply be a reflection of this inherent uncertainty.
Even when these uncertainties weaken a continuous query,
applications can still benefit if the query processing can
provide some knowledge about the degree of the uncertainty.
For example, in a continuous query on a construction site for
the amount of available material, it would be useful to know
that, with some degree of certainty (i.e., a confidence) there
is a given amount of available material. This may be based
on information collected directly from the environment (in
which case the confidence is quite high), historical trends, or
knowledge about the nature of the phenomenon. Model-driven
approaches that estimate missing data using mathematical
models can alleviate these uncertainties [6], [7]. In these
approaches, the goal is to build a model of the phenomenon
being observed and to only query the network to rebuild the

2
model when the confidence in the model has degraded to
make relying on it unacceptable. Section VII examines these
approaches and the relationship to our work in more detail.
Because we build a continuous query from a sequence of
snapshot queries, handling uncertainty is twofold. First, we
must be able to provide estimates of the continuous query
result between adjacent snapshot queries. Second, even if we
fail to sample a data point in a given snapshot, we may
have some information about that data point at a previous
time (and potentially a future time) that we may use to infer
something about the missing data. In both cases, we are not
actually changing the amount of information available to the
application; instead we are blurring the snapshot queries and
associating a level of confidence with inferred results.
Our approach relies on a simple abstraction called a decay
function (Section III) that quantifies the temporal validity
associated with sensing a particular phenomenon. We use this
decay function as the basis for performing model-assisted
inference (Section IV) to use sampled data values from the
snapshot queries to infer values into the past and future. This
inference can allow us to fill in gaps in the sequence of snap-
shot queries to enable trend analysis on the components of the
continuous query. The inference and its associated confidence
can also provide the application a concrete sense of what the
degree of the uncertainty is. Finally, by smoothing across the
available data, this inference makes the information that is
available more viewable and understandable by the application
and its user. We examine these benefits in Sections V and VI.
Our novel contributions are threefold. First, we introduce
decay functions that allow applications to define temporal
validity in a principled way. Second, we build a set of simple
statistical models that allow us to effectively blur snapshot
queries into continuous queries and use them to study the use
of model-assisted inference for a variety of different types
of dynamic phenomena. Finally, we demonstrate through an
implementation and evaluation and a set of usage scenarios
the efficacy and usefulness of using inference to fill in missing
data in real world situations. If the network supporting data
collection is highly dynamic, our approaches help mitigate the
impact of the dynamics on the inherent uncertainty; however,
even in less dynamic situations, our approach helps applica-
tions reasonably trade off the cost of executing continuous
queries for the quality of the result.
II. BACKGROUND
This paper builds on our previous approaches defining snap-
shot and continuous query fidelity and an associated middle-
ware [15], [18]. These approaches approximate a continuous
query using a sequence of snapshot queries evaluated over
the network at discrete times. We model a dynamic pervasive
computing network as a closed system of hosts, where each
host has a location and data value (though a single data value
may represent a collection of values). A host is represented as a
triple (ι, ζ, ν), where ι is the host’s identifier, ζ is its context,
and ν is its data value. The context can be simply a host’s
location, but it can be extended to include a list of neighbors,
routing tables, and other system or network information.
The global state of a network, a configuration (C), is a set
of host tuples. Given a host h in a configuration, an effective
configuration (E) is the projection of the configuration with
respect to the hosts reachable from h. Practically, h is a host
initiating a query, and E contains the hosts expected to receive
and respond to the query. To capture connectivity, we define a
binary logical connectivity relation, K, to express the ability
of a host to communicate with a neighboring host. Using the
values of the host triple, we can derive physical and logical
connectivity relations. As one example, if the host’s context,
ζ, includes the host’s location, we can define a physical
connectivity relation based on communication range. K is not
necessarily symmetric; in the cases that it is symmetric, K
specifies bi-directional communication.
The environment evolves as the network changes, values
change, and hosts exchange messages. We model network evo-
lution as a state transition system where the state space is the
set of possible configurations, and transitions are configuration
changes. A single configuration change consists of one of
the following: 1) a neighbor change: changes in hosts’ states
impact the connectivity relation, K; 2) a value change: a single
host changes its stored data value; or 3) a message exchange:
a host sends a message that is received by one or more
neighboring nodes. To refer to the connectivity relation for
a particular configuration, we assign configurations subscripts
(e.g., C
0
, C
1
, etc.) and use K
i
to refer to the connectivity
of configuration C
i
. We have also extended K to define
query reachability. Informally, this determines whether it was
possible to deliver a one-time query to and receive a response
from some host h within the sequence of configurations [17].
A snapshot query’s result (ρ) is a subset of a configuration:
it is a collection of host tuples that constitute responses to the
query. No host in the network is represented more than once
in ρ, though it is possible that a host is not represented at all
(e.g., because it was never reachable from the query issuer).
Depending on both the protocol used to execute the snapshot
query (e.g., whether the query was flooded to all hosts in the
network or whether it was gossiped) and inherent network
failures, only a subset of the reachable hosts may respond.
This results in missing and uncertain data in the results of
snapshot queries, which may result in a degradation in the
quality of and confidence in the continuous query’s result.
III. MODELING UNCERTAINTY
Our approach to query processing allows users to pose
continuous queries to an evolving network and receive a result
that resembles a data stream even though it is obtained using
discrete snapshot queries. This stream can then be analyzed
to evaluate trends in the sensed data. However, missing and
uncertain sensed items can be a bane to this process, especially
in monitoring the evolution of the data. For example, on a
construction site, a site supervisor may use a continuous query
to monitor the total number of available bricks on the site.
This query may be accomplished by associating a sensor with
each pallet of bricks; the snapshot queries collect the identity
of the pallets and the number of bricks the pallet holds. If
consecutive snapshot queries do not sample the same subset

3
of pallets, the sums they report are not comparable, resulting
in inconsistent information supplied to the site supervisor.
Consider the continuous query in Fig. 1. The three networks
on the left of the dark line show the results of the continuous
query’s first three snapshot queries. Each circle represents a
host; a circle’s color represents the host’s data value; and
lines represent connectivity. Throughout the continuous query,
some hosts depart, some arrive, and others change their data
value. In this case, the trend the application is analyzing is the
data items that remain available and unchanged throughout the
continuous query. When our snapshot queries are not impacted
by any missing or uncertain data, the stable set the trend
analysis generates is the actual stable set.
!""#"$%&'('$
)*(*$+*,-#$./*01#'$
2
3
$ 2
4
$ 2
5
$
60*7'/&($8-#9:#'$
6(*;,#$6#($
Fig. 1: A Continuous Query
Consider, however, what happens when data is missing or
uncertain, as depicted in Fig. 2. In this situation, the ground
truth (i.e., what the snapshot queries should have returned)
is equivalent to that shown in Fig. 1, but due to network
dynamics or other sources of uncertainty, the sample from host
A was not collected in the second snapshot query (ρ
1
), and the
sample from host B was not collected in the third snapshot
query (ρ
2
). Consequently the result of the trend analysis in
Fig. 2 is quite different from that in Fig. 1. On a construction
site, if the data items represent pallets of bricks, this trend
analysis may cause the site supervisor to have additional
supplies delivered when it is unnecessary or even impractical.
!""#"$%&'('$
)*(*$+*,-#$./*01#'$
2
3
$ 2
4
$ 2
5
$
60*7'/&($8-#9:#'$
6(*;,#$6#($
Fig. 2: A Continuous Query with Missing Data
One way to handle this uncertainty is to blur the snapshot
queries. In Fig. 2, given the fact that we know the network
to be dynamic, we can say with some confidence that host A
should have been represented in ρ
1
; the level of this confidence
depends on the temporal validity of the phenomenon sensed
(i.e., how long do we expect a data value to remain valid), the
frequency with which the snapshot queries are issued, and the
degree of network dynamics. The fact that A “reappeared” in
ρ
2
further increases our confidence that it may have, in fact,
been present in ρ
1
as well. Fig. 3 shows a simple example
of how this inference can be used to project data values into
future snapshots (e.g., from ρ
1
to ρ
2
) and into past snapshots
(e.g., from ρ
1
to ρ
0
). In this figure, the black circles repre-
sent hosts the snapshot query directly sampled; gray circles
represent hosts for which data values have been inferred. The
question that remains, however, is how to determine both the
values that should be associated with the inferred results and
the confidence we have in their correctness. We deal with
the former concern in the next section; here we introduce
decay functions to ascribe temporal validity to observations
and calculate confidence in unsampled (inferred) values.
!"#$%%$&'()*+*' ,-./0$&'()*+*'
1
2
' 1
3
'1
4
'
Fig. 3: Projection Forward and Backwards in Time
To address temporal validity, we rely on the intuitive ob-
servation that the closer in time an inferred value is to a
sensed sample, the more likely it is to be a correct inference.
For example, in Fig. 3, the value projected from ρ
0
to ρ
1
is
more likely to be correct than the value projected from ρ
0
to ρ
2
. If the sample missing in ρ
1
is also missing in ρ
2
, it
becomes increasingly likely that the host generating the sample
has, in fact, departed. We exploit this observation by allowing
applications to specify the temporal validity of different sensed
phenomena using a decay function that defines the validity of
a measured observation as a function of time.
Formally, a decay function is a function d(t) = f (|t t
l
|)
where t is the current time and t
l
is a time from either the
future or the past of the nearest (in time) actual sample of the
data value. The period |t t
l
| is the period of uncertainty; the
larger the period of uncertainty, the less likely it is that the
sampled value retains any correlation with the actual value.
The decay function’s value falls between 0 and 1; it is a
measure of percentage likelihood. These decay functions are
an intuitive representation of confidence and are easy for
application developers to grasp. It is also straightforward to
define decay functions to describe a variety of phenomena.
For instance, on a construction site, a moving truck’s GPS
location might be associated with a decay function of the
form: d(t) = e
(|tt
l
|)
, which is a rapid exponential drop in
confidence over time. On the other hand a GPS mounted on
a stationary sensor on the site might have a decay function of
the form: d(t) = 1 because the location value, once measured,
is not expected to change. Possibilities for formulating decay
functions are numerous and depend on the nature of the
phenomenon being sensed and the sensing environment.

4
Given a user-defined decay function, it is straightforward
to determine a confidence measure of an inferred value. We
measure this confidence probabilistically. At any time instant
t, the inferred data value’s degree of confidence p, is updated
using the following rule.
if time t is the time at which an actual data reading was
acquired, then the value of p at time t is set to 1;
otherwise, p is updated using the formula: p
t
= d(t).
Thus, at every point in time an data value of interest has an
imprecision that ranges from one to zero depending on when
it was last sampled. The further in time the inferred value is
from an actual sensed value, the less confidence it has. With
this understanding, we look next at how to estimate how a
sampled value may have changed during periods where it is
not sampled, allowing us to infer its value.
IV. TEMPORAL INFERENCE FOR CONTINUOUS QUERIES
Decay functions allow applications to define the validity of
projecting information across time. We now address the ques-
tion what the value of that projected data should be. Specif-
ically, we present a suite of simple techniques that estimate
inferred values. We also demonstrate how this inference can be
combined with decay functions to associate confidence with
inferred values. In later sections, we evaluate the applicability
of these inference approaches to real phenomena.
A. Nearest Neighbor Inference
For some applications, data value changes may be difficult
to predict, for instance when the underlying process observed
is unknown or arbitrary. These changes are usually discrete;
at some instant in time, the value changes to some potentially
unpredictable value. Consider a construction site where pallets
of bricks are distributed to different locations around the site
for storage and use. A distributed query may execute across the
site, measuring how many bricks are present at each location
at query time. The bricks are laid and restocked during the
day as trucks and construction workers perform their tasks.
Without any knowledge of the project’s goals and the rate of
brick laying at different sites, it is difficult to create a model
that effectively estimates the number of bricks at any given
location for instants that have no recorded observations.
In such cases, one technique to estimate missing data is to
assume the sampled value closest in time is still correct. As the
temporal validity decays, the sensed value is increasingly un-
reliable. Consider again the pallets of bricks on a construction
site and an application that samples the number of available
bricks periodically (e.g., every 10 minutes). The application
then sums across all of the data readings to generate a total
number of bricks on the site. Fig. 4 shows an example where
the value for the number of pallets at node A changes between
the two samples. Up until t = 5, the total number of pallets is
estimated using the original sample; after that, it is assumed
that the value is the sample taken at t = 10.
The example in Fig. 4 focuses on uncertain data; i.e.,
inferring data values that the application did not attempt to
sample. The same approach can be used to infer missing data,
e.g., if the application failed to sample a value for node A
!
"#
#$%&"'#
()#
((#
("#
("#
!
(#
#$%&("'#
()#
((#
("#
(*#
("#
)"#
+"#
,"#
*"#
-"#
"# ("#*#
./0#
12330%4#56#789:;4#
<# <#
=5>0#<#
%5%23#
Fig. 4: Nearest Neighbor Inference for Uncertain Data
at time t = 10 but did resample it at time t = 20. This
example also demonstrates the importance of inferring missing
data. Because this data is used to monitor the total number of
pallets of bricks on the site, if data values are missing from a
particular snapshot, the site supervisor might observe radical
fluctuations in the number of bricks that actually did not occur.
B. Interpolation and Regression
The evolution of many pervasive computing phenomena
can be fairly accurately represented by continuous functions.
If a truck is driving at a steady speed across the site, and
we sample its location at t = 0 and t = 10 it may be
reasonable to infer that at t = 5, the truck was at the midpoint
of a line drawn between the two sample points. In such
cases, standard statistical techniques like interpolation and
regression can be employed to infer data across snapshots.
In interpolation, the observed values are fit on a function,
where the domain is typically the time of observation and
the range is the attribute’s value. For any point in time where
there is no recorded observation, the value is estimated using
the function. Interpolation approaches range from simple (e.g.,
linear interpolation) to complex (e.g., spline interpolation).
Linear interpolation connects consecutive observations of
a data item with a line segment. Polynomial interpolation
generalizes the function to a degree higher than one; in general,
one can fit a curve through n data points using a function of
degree n1. Spline interpolation breaks set of data points into
subsets, and applied polynomial interpolation to each subset.
Fig. 5 shows an example of interpolation. The data values
sensed are the locations of the devices on a 3x4 grid; the
moving truck’s data is missing from snapshots ρ
1
and ρ
3
. The
bottom figures show how linear interpolation and an example
of polynomial interpolation estimate the missing data.
Regression identifies relationships between a dependent
sensed variable (e.g., location or temperature at a particular
device) and an independent variable (e.g., time). However,
regression does not try to fit a curve or a function through
every observed data point. Instead, the end result of regression
encodes an approximation of the relationship between the
independent and dependent variables. As with interpolation,
regression comes in several flavors ranging from simple
techniques like linear regression to more complex non-linear
variants. Effectively, regression provides a “looser fit” function
for the data; this can be effective when the underlying data is
noisy (e.g., when the samples may contain errors), and it may
not be useful to fit a curve through every observed data point,
since those data points may not be an accurate reflection of

Citations
More filters
Proceedings ArticleDOI

RES: A Robust and Efficient Snapshot Algorithm for Wireless Sensor Networks

TL;DR: This paper proposes a new snapshot algorithm, named RES, which can tolerate packet loss with efficient communication cost and conduct extensive simulations via OMNET++ to evaluate the performance of RES and compare it with similar algorithms.
Proceedings ArticleDOI

Designing cyber-physical systems middleware for smart cities applications

TL;DR: A model for cyber-physical systems is presented that formalizes various aspects of both the cyber- and the physical system in terms of graphs and identifies challenges to be addressed in extending traditional algorithms to CPS for the proposed graph-based models.
Proceedings Article

Mutual exclusion in cyber-physical systems

TL;DR: A graph-based model for cyber-physical systems is proposed which is used to describe mutual exclusion algorithm as well as user behavior and an extensive simulation study of the algorithms using OMNeT++ discrete event simulation system is conducted.
Book ChapterDOI

Global Snapshot of a Large Wireless Sensor Network

TL;DR: This paper forms an algorithm for acquiring a global snapshot of large wireless network by dividing the network into concentric zones and adopting a zone wise state collection approach and, with the help of concentric network topology and global snapshot, energy map is formed.
References
More filters
Journal ArticleDOI

Directed diffusion for wireless sensor networking

TL;DR: In this article, the authors explore and evaluate the use of directed diffusion for a simple remote-surveillance sensor network analytically and experimentally and demonstrate that directed diffusion can achieve significant energy savings and can outperform idealized traditional schemes under the investigated scenarios.
Book ChapterDOI

Model-driven data acquisition in sensor networks

TL;DR: This paper enrichs interactive sensor querying with statistical modeling techniques, and demonstrates that such models can help provide answers that are both more meaningful, and, by introducing approximations with probabilistic confidences, significantly more efficient to compute in both time and energy.
Proceedings ArticleDOI

TelegraphCQ: continuous dataflow processing

TL;DR: The current version of TelegraphCQ is shown, which is implemented by leveraging the code base of the open source PostgreSQL database system, which found that a significant portion of the PostgreSQL code was easily reusable.
Proceedings ArticleDOI

Evaluating probabilistic queries over imprecise data

TL;DR: This paper addresses the important issue of measuring the quality of the answers to query evaluation based upon uncertain data, and provides algorithms for efficiently pulling data from relevant sensors or moving objects in order to improve thequality of the executing queries.
Book ChapterDOI

Inferring High-Level Behavior from Low-Level Sensors

TL;DR: In this paper, a method of learning a Bayesian model of a traveler moving through an urban environment is presented, which simultaneously learns a unified model of the traveler's current mode of transportation as well as his most likely route, in an unsupervised manner.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What have the authors contributed in "Blurring snapshots: temporal inference of missing and uncertain data tr-utedge-2009-005" ?

In this paper, the authors support continuous monitoring of dynamic pervasive computing phenomena through the use of a series of snapshot queries. The authors evaluate the usefulness of this abstraction in its application to complex spatiotemporal pattern queries in pervasive computing networks.