Showing papers on "Uncertain data published in 2011"

PDF

Open Access

Book•DOI•

[...]

27 Jun 2011

TL;DR: This textbook provides a first course in stochastic programming suitable for students with a basic knowledge of linear programming, elementary analysis, and probability to help students develop an intuition on how to model uncertainty into mathematical problems.

...read moreread less

Abstract: The aim of stochastic programming is to find optimal decisions in problems which involve uncertain data. This field is currently developing rapidly with contributions from many disciplines including operations research, mathematics, and probability. At the same time, it is now being applied in a wide variety of subjects ranging from agriculture to financial planning and from industrial engineering to computer networks. This textbook provides a first course in stochastic programming suitable for students with a basic knowledge of linear programming, elementary analysis, and probability. The authors aim to present a broad overview of the main themes and methods of the subject. Its prime goal is to help students develop an intuition on how to model uncertainty into mathematical problems, what uncertainty changes bring to the decision process, and what techniques help to manage uncertainty in solving the problems.In this extensively updated new edition there is more material on methods and examples including several new approaches for discrete variables, new results on risk measures in modeling and Monte Carlo sampling methods, a new chapter on relationships to other methods including approximate dynamic programming, robust optimization and online methods.The book is highly illustrated with chapter summaries and many examples and exercises. Students, researchers and practitioners in operations research and the optimization area will find it particularly of interest. Review of First Edition:"The discussion on modeling issues, the large number of examples used to illustrate the material, and the breadth of the coverage make'Introduction to Stochastic Programming' an ideal textbook for the area." (Interfaces, 1998)

...read moreread less

5,398 citations

Journal Article•DOI•

Decision Trees for Uncertain Data

[...]

Smith Tsang¹, Ben Kao¹, Kevin Y. Yip², Wai-Shing Ho¹, Sau Dan Lee¹ - Show less +1 more•Institutions (2)

University of Hong Kong¹, Yale University²

01 Jan 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work discovers that the accuracy of a decision tree classifier can be much improved if the "complete information" of a data item (taking into account the probability density function (pdf)) is utilized.

...read moreread less

Abstract: Traditional decision tree classifiers work with data whose values are known and precise. We extend such classifiers to handle data with uncertain information. Value uncertainty arises in many applications during the data collection process. Example sources of uncertainty include measurement/quantization errors, data staleness, and multiple repeated measurements. With uncertainty, the value of a data item is often represented not by one single value, but by multiple values forming a probability distribution. Rather than abstracting uncertain data by statistical derivatives (such as mean and median), we discover that the accuracy of a decision tree classifier can be much improved if the "complete information" of a data item (taking into account the probability density function (pdf)) is utilized. We extend classical decision tree building algorithms to handle data tuples with uncertain values. Extensive experiments have been conducted which show that the resulting classifiers are more accurate than those using value averages. Since processing pdfs is computationally more costly than processing single values (e.g., averages), decision tree construction on uncertain data is more CPU demanding than that for certain data. To tackle this problem, we propose a series of pruning techniques that can greatly improve construction efficiency.

...read moreread less

193 citations

Journal Article•DOI•

Risk analysis in a linguistic environment: A fuzzy evidential reasoning-based approach

[...]

Yong Deng¹, Rehan Sadiq², Wen Jiang³, Solomon Tesfamariam²•Institutions (3)

Southwest University¹, University of British Columbia², Northwestern Polytechnical University³

01 Nov 2011-Expert Systems With Applications

TL;DR: The proposed linguistic approach is based on fuzzy set theory and Dempster-Shafer theory of evidence, where the later has been used to combine the risk of components to determine the system risk.

...read moreread less

Abstract: Performing risk analysis can be a challenging task for complex systems due to lack of data and insufficient understanding of the failure mechanisms. A semi quantitative approach that can utilize imprecise information, uncertain data and domain experts' knowledge can be an effective way to perform risk analysis for complex systems. Though the definition of risk varies considerably across disciplines, it is a well accepted notion to use a composition of likelihood of system failure and the associated consequences (severity of loss). A complex system consists of various components, where these two elements of risk for each component can be linguistically described by the domain experts. The proposed linguistic approach is based on fuzzy set theory and Dempster-Shafer theory of evidence, where the later has been used to combine the risk of components to determine the system risk. The proposed risk analysis approach is demonstrated through a numerical example.

...read moreread less

131 citations

Journal Article•DOI•

Positional Uncertainty of Isocontours: Condition Analysis and Probabilistic Measures

[...]

Kai Pöthkow¹, H.-C. Hege¹•Institutions (1)

Zuse Institute Berlin¹

01 Oct 2011-IEEE Transactions on Visualization and Computer Graphics

TL;DR: This work presents mathematical formulations for uncertain equivalents of isocontours based on standard probability theory and statistics and employs them in interactive visualization methods to evaluate and display these measures and apply them to 2D and 3D data sets.

...read moreread less

Abstract: Uncertainty is ubiquitous in science, engineering and medicine. Drawing conclusions from uncertain data is the normal case, not an exception. While the field of statistical graphics is well established, only a few 2D and 3D visualization and feature extraction methods have been devised that consider uncertainty. We present mathematical formulations for uncertain equivalents of isocontours based on standard probability theory and statistics and employ them in interactive visualization methods. As input data, we consider discretized uncertain scalar fields and model these as random fields. To create a continuous representation suitable for visualization we introduce interpolated probability density functions. Furthermore, we introduce numerical condition as a general means in feature-based visualization. The condition number-which potentially diverges in the isocontour problem-describes how errors in the input data are amplified in feature computation. We show how the average numerical condition of isocontours aids the selection of thresholds that correspond to robust isocontours. Additionally, we introduce the isocontour density and the level crossing probability field; these two measures for the spatial distribution of uncertain isocontours are directly based on the probabilistic model of the input data. Finally, we adapt interactive visualization methods to evaluate and display these measures and apply them to 2D and 3D data sets.

...read moreread less

131 citations

Dissertation•

Decision making under uncertainty

[...]

Jian Li¹•Institutions (1)

City University of Hong Kong¹

01 Jan 2011

TL;DR: This chapter contains sections titled: Half Title, MIT Lincoln Laboratory Series, Title, Copyright, Dedication, Table of Contents, Preface, About the Authors, Acknowledgments.

...read moreread less

Abstract: Almost all (important) decision problems are inevitably subject to some level of uncertainty either about data measurements, the parameters, or predictions describing future evolution. The significance of handling uncertainty is further amplified by the large volume of uncertain data automatically generated by modern data gathering or integration systems. Examples include imprecise sensor measurements in a sensor network, inconsistent information collected from different sources in a data integration application, noisy observation data in scientific domains, and so on. Various types of problems of decision making under uncertainty have been a subject of extensive research in computer science, economics and social science. In this talk, I will focus on two important problems in this domain: (1) ranking and top-k query processing over probabilistic database and (2) utility maximization for stochastic combinatorial problems. I will also briefly discuss some of my other research works, such as stochastic matching, distributed multi-query processing, if time allows.

...read moreread less

123 citations

Journal Article•DOI•

INTAMAP: The design and implementation of an interoperable automated interpolation web service

[...]

Edzer Pebesma¹, Dan Cornford², Grégoire Dubois, Gerard B. M. Heuvelink³, Dionisis Hristopulos⁴, Jürgen Pilz⁵, Ulrich Stöhlker, Gary Morin, J. O. Skøien⁶ - Show less +5 more•Institutions (6)

University of Münster¹, Aston University², Wageningen University and Research Centre³, Technical University of Crete⁴, Alpen-Adria-Universität Klagenfurt⁵, Utrecht University⁶

01 Mar 2011-Computers & Geosciences

TL;DR: INTAMAP is a Web Processing Service for the automatic spatial interpolation of measured point data using open standards for spatial data such as developed in the context of the Open Geospatial Consortium, and producing an integrated, open source solution.

...read moreread less

78 citations

Journal Article•DOI•

Network condition simulator for benchmarking sewer deterioration models

[...]

Andreas Scheidegger¹, Thomas Hug¹, Jörg Rieckermann¹, Max Maurer¹•Institutions (1)

Swiss Federal Institute of Aquatic Science and Technology¹

15 Oct 2011-Water Research

TL;DR: A novel network condition simulator (NetCoS) that produces a synthetic population of sewer sections with a given condition-class distribution that can be used to benchmark deterioration models and guide utilities in the selection of appropriate models and data management strategies.

...read moreread less

75 citations

Proceedings Article•DOI•

Sensitivity analysis and explanations for robust query evaluation in probabilistic databases

[...]

Bhargav Kanagal¹, Jian Li¹, Amol Deshpande¹•Institutions (1)

University of Maryland, College Park¹

12 Jun 2011

TL;DR: A unified framework is proposed that can handle both the issues mentioned above to facilitate robust query processing over probabilistic databases and naturally enables highly efficient incremental evaluation when input probabilities are modified.

...read moreread less

Abstract: Probabilistic database systems have successfully established themselves as a tool for managing uncertain data. However, much of the research in this area has focused on efficient query evaluation and has largely ignored two key issues that commonly arise in uncertain data management: First, how to provide explanations for query results, e.g., Why is this tuple in my result? or Why does this output tuple have such high probability?. Second, the problem of determining the sensitive input tuples for the given query, e.g., users are interested to know the input tuples that can substantially alter the output, when their probabilities are modified (since they may be unsure about the input probability values). Existing systems provide the lineage/provenance of each of the output tuples in addition to the output probabilities, which is a boolean formula indicating the dependence of the output tuple on the input tuples. However, lineage does not immediately provide a quantitative relationship and it is not informative when we have multiple output tuples. In this paper, we propose a unified framework that can handle both the issues mentioned above to facilitate robust query processing. We formally define the notions of influence and explanations and provide algorithms to determine the top-l influential set of variables and the top-l set of explanations for a variety of queries, including conjunctive queries, probabilistic threshold queries, top-k queries and aggregation queries. Further, our framework naturally enables highly efficient incremental evaluation when input probabilities are modified (e.g., if uncertainty is resolved). Our preliminary experimental results demonstrate the benefits of our framework for performing robust query processing over probabilistic databases.

...read moreread less

70 citations

Journal Article•DOI•

Semantics of Ranking Queries for Probabilistic Data

[...]

Jeffrey Jestes¹, Graham Cormode², Feifei Li¹, Ke Yi³•Institutions (3)

University of Utah¹, AT&T Labs², Hong Kong University of Science and Technology³

01 Dec 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work proposes an intuitive new ranking definition based on the observation that the ranks of a tuple across all possible worlds represent a well-founded rank distribution, and is able to prove that the expected rank, median rank, and quantile rank satisfy all these properties for a ranking query.

...read moreread less

Abstract: Recently, there have been several attempts to propose definitions and algorithms for ranking queries on probabilistic data. However, these lack many intuitive properties of a top-k over deterministic data. We define several fundamental properties, including exact-k, containment, unique rank, value invariance, and stability, which are satisfied by ranking queries on certain data. We argue that these properties should also be carefully studied in defining ranking queries in probabilistic data, and fulfilled by definition for ranking uncertain data for most applications. We propose an intuitive new ranking definition based on the observation that the ranks of a tuple across all possible worlds represent a well-founded rank distribution. We studied the ranking definitions based on the expectation, the median, and other statistics of this rank distribution for a tuple and derived the expected rank, median rank, and quantile rank correspondingly. We are able to prove that the expected rank, median rank, and quantile rank satisfy all these properties for a ranking query. We provide efficient solutions to compute such rankings across the major models of uncertain data, such as attribute-level and tuple-level uncertainty. Finally, a comprehensive experimental study confirms the effectiveness of our approach.

...read moreread less

64 citations

Journal Article•DOI•

Recurrent neural networks for fuzzy data

[...]

Steffen Freitag¹, Wolfgang Graf¹, Michael Kaliske¹•Institutions (1)

Dresden University of Technology¹

01 Aug 2011-Integrated Computer-aided Engineering

TL;DR: A model-free approach for data mining in engineering based on artificial neural networks based on fuzzy fractional rheological material model for the identification and prediction of time-dependent structural behavior under dynamic loading is presented.

...read moreread less

Abstract: In this paper, a model-free approach for data mining in engineering is presented. The numerical approach is based on artificial neural networks. Recurrent neural networks for fuzzy data are developed to identify and predict complex dependencies from uncertain data. Uncertain structural processes obtained from measurements or numerical analyses are used to identify the time-dependent behavior of engineering structures. Structural action and response processes are treated as fuzzy processes. The identification of uncertain dependencies between structural action and response processes is realized by recurrent neural networks for fuzzy data. Algorithms for signal processing and network training are presented. The new recurrent neural network approach is verified by a fuzzy fractional rheological material model. An application for the identification and prediction of time-dependent structural behavior under dynamic loading is presented.

...read moreread less

58 citations

Journal Article•DOI•

Chance constrained uncertain classification via robust optimization

[...]

Aharon Ben-Tal¹, Sahely Bhadra², Chiranjib Bhattacharyya², J. Saketha Nath³•Institutions (3)

Tilburg University¹, Indian Institute of Science², Indian Institute of Technology Bombay³

01 Mar 2011-Mathematical Programming

TL;DR: Experimental results on synthetic and real-world datasets show that the proposed classifiers are better equipped to handle data uncertainty and outperform state-of-the-art in many cases.

...read moreread less

Abstract: This paper studies the problem of constructing robust classifiers when the training is plagued with uncertainty. The problem is posed as a Chance-Constrained Program (CCP) which ensures that the uncertain data points are classified correctly with high probability. Unfortunately such a CCP turns out to be intractable. The key novelty is in employing Bernstein bounding schemes to relax the CCP as a convex second order cone program whose solution is guaranteed to satisfy the probabilistic constraint. Prior to this work, only the Chebyshev based relaxations were exploited in learning algorithms. Bernstein bounds employ richer partial information and hence can be far less conservative than Chebyshev bounds. Due to this efficient modeling of uncertainty, the resulting classifiers achieve higher classification margins and hence better generalization. Methodologies for classifying uncertain test data points and error measures for evaluating classifiers robust to uncertain data are discussed. Experimental results on synthetic and real-world datasets show that the proposed classifiers are better equipped to handle data uncertainty and outperform state-of-the-art in many cases.

...read moreread less

Journal Article•DOI•

Mining uncertain data

[...]

Carson K. Leung¹•Institutions (1)

University of Manitoba¹

01 Jul 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: Recent algorithmic development on mining uncertain data in these probabilistic databases for frequent patterns from probabilism databases of uncertain data is reviewed.

...read moreread less

Abstract: As an important data mining and knowledge discovery task, association rule mining searches for implicit, previously unknown, and potentially useful pieces of information—in the form of rules revealing associative relationships—that are embedded in the data. In general, the association rule mining process comprises two key steps. The first key step, which mines frequent patterns (i.e., frequently occurring sets of items) from data, is more computationally intensive than the second key step of using the mined frequent patterns to form association rules. In the early days, many developed algorithms mined frequent patterns from traditional transaction databases of precise data such as shopping market basket data, in which the contents of databases are known. However, we are living in an uncertain world, in which uncertain data can be found almost everywhere. Hence, in recent years, researchers have paid more attention to frequent pattern mining from probabilistic databases of uncertain data. In this paper, we review recent algorithmic development on mining uncertain data in these probabilistic databases for frequent patterns. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 316–329 DOI: 10.1002/widm.31

...read moreread less

Journal Article•DOI•

Efficient probabilistic reverse nearest neighbor query processing on uncertain data

[...]

Thomas Bernecker¹, Tobias Emrich¹, Hans-Peter Kriegel¹, Matthias Renz¹, Stefan Zankl¹, Andreas Züfle¹ - Show less +2 more•Institutions (1)

Ludwig Maximilian University of Munich¹

01 Jul 2011

TL;DR: This paper proposes an algorithm for efficiently answering PRNN queries using new pruning mechanisms taking distance dependencies into account, and compares it to state-of-the-art approaches recently proposed.

...read moreread less

Abstract: Given a query object q, a reverse nearest neighbor (RNN) query in a common certain database returns the objects having q as their nearest neighbor. A new challenge for databases is dealing with uncertain objects. In this paper we consider probabilistic reverse nearest neighbor (PRNN) queries, which return the uncertain objects having the query object as nearest neighbor with a sufficiently high probability. We propose an algorithm for efficiently answering PRNN queries using new pruning mechanisms taking distance dependencies into account. We compare our algorithm to state-of-the-art approaches recently proposed. Our experimental evaluation shows that our approach is able to significantly outperform previous approaches. In addition, we show how our approach can easily be extended to PRkNN (where k > 1) query processing for which there is currently no efficient solution.

...read moreread less

Journal Article•DOI•

An interactive robust data envelopment analysis model for determining alternative targets in Iranian electricity distribution companies

[...]

Seyed Jafar Sadjadi¹, Hashem Omrani¹, Ahmad Makui¹, Kamran Shahanaghi¹•Institutions (1)

Iran University of Science and Technology¹

01 Aug 2011-Expert Systems With Applications

TL;DR: An interactive robust data envelopment analysis (IRDEA) model is proposed to determine the input and output target values of electricity distribution companies with considering the existence perturbation in data.

...read moreread less

Abstract: One of the primary concerns on target setting the electricity distribution companies is the uncertainty on input/output data. In this paper, an interactive robust data envelopment analysis (IRDEA) model is proposed to determine the input and output target values of electricity distribution companies with considering the existence perturbation in data. Target setting is implemented with the uncertain data and the decision maker (DM) can search the envelop frontier and find the targets based on his preference. In order to search the envelop frontier, the paper combine the DEA and multi-objective linear programming method such as STEM. The proposed method of this paper is capable of handling uncertainty in data and finding the target values according to the DM's preferences. To illustrate ability the proposed model, a numerical example is solved. Also, the input and output target values for some of the electricity distribution companies in Iran are reported. The results indicate that the IRDEA model is suitable for target setting based on DM's preferences and with considering uncertain data.

...read moreread less

Journal Article•DOI•

A novel Bayesian classification for uncertain data

[...]

Biao Qin¹, Yuni Xia², Shan Wang¹, Xiaoyong Du¹•Institutions (2)

Renmin University of China¹, Indiana University – Purdue University Indianapolis²

01 Dec 2011-Knowledge Based Systems

TL;DR: A novel Bayesian classification algorithm for uncertain data is proposed and it is shown that the proposed method classifies uncertain data with potentially higher accuracies than the Naive Bayesian approach and has a more stable performance than the existing extended Naïve Bayesian method.

...read moreread less

Abstract: Data uncertainty can be caused by numerous factors such as measurement precision limitations, network latency, data staleness and sampling errors. When mining knowledge from emerging applications such as sensor networks or location based services, data uncertainty should be handled cautiously to avoid erroneous results. In this paper, we apply probabilistic and statistical theory on uncertain data and develop a novel method to calculate conditional probabilities of Bayes theorem. Based on that, we propose a novel Bayesian classification algorithm for uncertain data. The experimental results show that the proposed method classifies uncertain data with potentially higher accuracies than the Naive Bayesian approach. It also has a more stable performance than the existing extended Naive Bayesian method.

...read moreread less

Proceedings Article•DOI•

Frequent itemset mining of uncertain data streams using the damped window model

[...]

Carson K. Leung¹, Fan Jiang¹•Institutions (1)

University of Manitoba¹

21 Mar 2011

TL;DR: This paper proposes tree-based algorithms that use the damped window model to mine frequent itemsets from streams of uncertain data.

...read moreread less

Abstract: With advances in technology, large amounts of streaming data can be generated continuously by sensors in applications like environment surveillance. Due to the inherited limitation of sensors, these continuous data can be uncertain. This calls for stream mining of uncertain data. In recent years, tree-based algorithms have been proposed to use the sliding window model for mining frequent itemsets from streams of uncertain data. Besides the sliding window model, there are other window models for processing data streams. In this paper, we propose tree-based algorithms that use the damped window model to mine frequent itemsets from streams of uncertain data.

...read moreread less

Book Chapter•DOI•

Mining sequential patterns from probabilistic databases

[...]

Muhammad Muzammal¹, Rajeev Raman¹•Institutions (1)

University of Leicester¹

24 May 2011

TL;DR: In this article, the authors consider sequential pattern mining in situations where there is uncertainty about which source an event is associated with and use dynamic programming (DP) to compute the probability that a source supports a sequence.

...read moreread less

Abstract: We consider sequential pattern mining in situations where there is uncertainty about which source an event is associated with. We model this in the probabilistic database framework and consider the problem of enumerating all sequences whose expected support is sufficiently large. Unlike frequent itemset mining in probabilistic databases [C. Aggarwal et al. KDD'09; Chui et al., PAKDD'07; Chui and Kao, PAKDD'08], we use dynamic programming (DP) to compute the probability that a source supports a sequence, and show that this suffices to compute the expected support of a sequential pattern. Next, we embed this DP algorithm into candidate generate-and-test approaches, and explore the pattern lattice both in a breadth-first (similar to GSP) and a depth-first (similar to SPAM) manner. We propose optimizations for efficiently computing the frequent 1-sequences, for re-using previously-computed results through incremental support computation, and for elmiminating candidate sequences without computing their support via probabilistic pruning. Preliminary experiments show that our optimizations are effective in improving the CPU cost.

...read moreread less

Book Chapter•DOI•

Frequent pattern mining from time-fading streams of uncertain data

[...]

Carson K. Leung¹, Fan Jiang¹•Institutions (1)

University of Manitoba¹

29 Aug 2011

TL;DR: Mining algorithms that use the time-fading model to discover frequent patterns from streams of uncertain data are proposed.

...read moreread less

Abstract: Nowadays, streams of data can be continuously generated by sensors in various real-life applications such as environment surveillance. Partially due to the inherited limitation of the sensors, data in these streams can be uncertain. To discover useful knowledge in the form of frequent patterns from streams of uncertain data, a few algorithms have been developed. They mostly use the sliding window model for processing and mining data streams. However, for some applications, other stream processing models such as the time-fading model are more appropriate. In this paper, we propose mining algorithms that use the time-fading model to discover frequent patterns from streams of uncertain data.

...read moreread less

Proceedings Article•DOI•

Stochastic skyline operator

[...]

Xuemin Lin¹, Ying Zhang¹, Wenjie Zhang¹, Muhammad Aamir Cheema¹•Institutions (1)

University of New South Wales¹

11 Apr 2011

TL;DR: It is shown that the problem of stochastic skyline is NP-complete with respect to the dimensionality, and novel and efficient algorithms are developed to efficiently compute stoChastic skyline over multi-dimensional uncertain data, which run in polynomial time if thedimensionality is fixed.

...read moreread less

Abstract: In many applications involving the multiple criteria optimal decision making, users may often want to make a personal trade-off among all optimal solutions. As a key feature, the skyline in a multi-dimensional space provides the minimum set of candidates for such purposes by removing all points not preferred by any (monotonic) utility/scoring functions; that is, the skyline removes all objects not preferred by any user no mater how their preferences vary. Driven by many applications with uncertain data, the probabilistic skyline model is proposed to retrieve uncertain objects based on skyline probabilities. Nevertheless, skyline probabilities cannot capture the preferences of monotonic utility functions. Motivated by this, in this paper we propose a novel skyline operator, namely stochastic skyline. In the light of the expected utility principle, stochastic skyline guarantees to provide the minimum set of candidates for the optimal solutions over all possible monotonic multiplicative utility functions. In contrast to the conventional skyline or the probabilistic skyline computation, we show that the problem of stochastic skyline is NP-complete with respect to the dimensionality. Novel and efficient algorithms are developed to efficiently compute stochastic skyline over multi-dimensional uncertain data, which run in polynomial time if the dimensionality is fixed. We also show, by theoretical analysis and experiments, that the size of stochastic skyline is quite similar to that of conventional skyline over certain data. Comprehensive experiments demonstrate that our techniques are efficient and scalable regarding both CPU and IO costs.

...read moreread less

Book Chapter•DOI•

Geometric computations on indecisive points

[...]

Allan Grønlund Jørgensen, Maarten Löffler¹, Jeff M. Phillips²•Institutions (2)

University of California, Irvine¹, University of Utah²

15 Aug 2011

TL;DR: Surprisingly, one can compute the distribution of the radius of the smallest enclosing ball exactly in polynomial time, but computing the same distribution for the diameter is #P-hard.

...read moreread less

Abstract: We study computing with indecisive point sets. Such points have spatial uncertainty where the true location is one of a finite number of possible locations. This data arises from probing distributions a few times or when the location is one of a few locations from a known database. In particular, we study computing distributions of geometric functions such as the radius of the smallest enclosing ball and the diameter. Surprisingly, we can compute the distribution of the radius of the smallest enclosing ball exactly in polynomial time, but computing the same distribution for the diameter is #P-hard. We generalize our polynomial-time algorithm to all LP-type problems. We also utilize our indecisive framework to deterministically and approximately compute on a more general class of uncertain data where the location of each point is given by a probability distribution.

...read moreread less

Posted Content•

Indexing the Earth Mover's Distance Using Normal Distributions

[...]

Brian E. Ruttenberg¹, Ambuj K. Singh¹•Institutions (1)

University of California, Santa Barbara¹

30 Nov 2011-arXiv: Databases

TL;DR: In this article, a lower bound to the Earth Mover's distance (EMD) and an index structure are proposed to improve the performance of K-NN queries on uncertain databases.

...read moreread less

Abstract: Querying uncertain data sets (represented as probability distributions) presents many challenges due to the large amount of data involved and the difficulties comparing uncertainty between distributions. The Earth Mover's Distance (EMD) has increasingly been employed to compare uncertain data due to its ability to effectively capture the differences between two distributions. Computing the EMD entails finding a solution to the transportation problem, which is computationally intensive. In this paper, we propose a new lower bound to the EMD and an index structure to significantly improve the performance of EMD based K-nearest neighbor (K-NN) queries on uncertain databases. We propose a new lower bound to the EMD that approximates the EMD on a projection vector. Each distribution is projected onto a vector and approximated by a normal distribution, as well as an accompanying error term. We then represent each normal as a point in a Hough transformed space. We then use the concept of stochastic dominance to implement an efficient index structure in the transformed space. We show that our method significantly decreases K-NN query time on uncertain databases. The index structure also scales well with database cardinality. It is well suited for heterogeneous data sets, helping to keep EMD based queries tractable as uncertain data sets become larger and more complex.

...read moreread less

Posted Content•

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

[...]

Thomas Bernecker, Tobias Emrich, Hans-Peter Kriegel, Nikos Mamoulis, Matthias Renz, Andreas Zuefle - Show less +2 more

13 Jan 2011-arXiv: Databases

TL;DR: In this article, a geometric pruning filter is proposed to estimate the probabilistic domination count, which is used to answer a wide range of probability similarity queries on uncertain data.

...read moreread less

Abstract: In this paper, we propose a novel, effective and efficient probabilistic pruning criterion for probabilistic similarity queries on uncertain data. Our approach supports a general uncertainty model using continuous probabilistic density functions to describe the (possibly correlated) uncertain attributes of objects. In a nutshell, the problem to be solved is to compute the PDF of the random variable denoted by the probabilistic domination count: Given an uncertain database object B, an uncertain reference object R and a set D of uncertain database objects in a multi-dimensional space, the probabilistic domination count denotes the number of uncertain objects in D that are closer to R than B. This domination count can be used to answer a wide range of probabilistic similarity queries. Specifically, we propose a novel geometric pruning filter and introduce an iterative filter-refinement strategy for conservatively and progressively estimating the probabilistic domination count in an efficient way while keeping correctness according to the possible world semantics. In an experimental evaluation, we show that our proposed technique allows to acquire tight probability bounds for the probabilistic domination count quickly, even for large uncertain databases.

...read moreread less

Journal Article•DOI•

[...]

Xiang Lian¹, Lei Chen¹•Institutions (1)

Hong Kong University of Science and Technology¹

01 Nov 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper formalizes this problem of similarity join processing on stream data that inherently contain uncertainty, where the incoming data at each time stamp are uncertain and imprecise as join on uncertain data streams (USJ), which can guarantee the accuracy of USJ answers over uncertain data.

...read moreread less

Abstract: Similarity join processing in the streaming environment has many practical applications such as sensor networks, object tracking and monitoring, and so on. Previous works usually assume that stream processing is conducted over precise data. In this paper, we study an important problem of similarity join processing on stream data that inherently contain uncertainty (or called uncertain data streams), where the incoming data at each time stamp are uncertain and imprecise. Specifically, we formalize this problem as join on uncertain data streams (USJ), which can guarantee the accuracy of USJ answers over uncertain data. To tackle the challenges with respect to efficiency and effectiveness such as limited memory and small response time, we propose effective pruning methods on both object and sample levels to filter out false alarms. We integrate the proposed pruning methods into an efficient query procedure that can incrementally maintain the USJ answers. Most importantly, we further design a novel strategy, namely, adaptive superset prejoin (ASP), to maintain a superset of USJ candidate pairs. ASP is in light of our proposed formal cost model such that the average USJ processing cost is minimized. We have conducted extensive experiments to demonstrate the efficiency and effectiveness of our proposed approaches.

...read moreread less

Journal Article•DOI•

Accurate state estimation from uncertain data and models: an application of data assimilation to mathematical models of human brain tumors

[...]

Eric J. Kostelich¹, Yang Kuang¹, Joshua M McDaniel¹, Nina Z. Moore², Nikolay L. Martirosyan², Mark C. Preul² - Show less +2 more•Institutions (2)

Arizona State University¹, St. Joseph's Hospital and Medical Center²

21 Dec 2011-Biology Direct

TL;DR: A modern state estimation algorithm (the Local Ensemble Transform Kalman Filter) is applied to two different mathematical models of glioblastoma, taking into account likely errors in model parameters and measurement uncertainties in magnetic resonance imaging.

...read moreread less

Abstract: Data assimilation refers to methods for updating the state vector (initial condition) of a complex spatiotemporal model (such as a numerical weather model) by combining new observations with one or more prior forecasts. We consider the potential feasibility of this approach for making short-term (60-day) forecasts of the growth and spread of a malignant brain cancer (glioblastoma multiforme) in individual patient cases, where the observations are synthetic magnetic resonance images of a hypothetical tumor. We apply a modern state estimation algorithm (the Local Ensemble Transform Kalman Filter), previously developed for numerical weather prediction, to two different mathematical models of glioblastoma, taking into account likely errors in model parameters and measurement uncertainties in magnetic resonance imaging. The filter can accurately shadow the growth of a representative synthetic tumor for 360 days (six 60-day forecast/update cycles) in the presence of a moderate degree of systematic model error and measurement noise. The mathematical methodology described here may prove useful for other modeling efforts in biology and oncology. An accurate forecast system for glioblastoma may prove useful in clinical settings for treatment planning and patient counseling. This article was reviewed by Anthony Almudevar, Tomas Radivoyevitch, and Kristin Swanson (nominated by Georg Luebeck).

...read moreread less

Proceedings Article•DOI•

A novel probabilistic pruning approach to speed up similarity queries in uncertain databases

[...]

Thomas Bernecker¹, Tobias Emrich¹, Hans-Peter Kriegel¹, Nikos Mamoulis², Matthias Renz¹, Andreas Züfle¹ - Show less +2 more•Institutions (2)

Ludwig Maximilian University of Munich¹, University of Hong Kong²

11 Apr 2011

TL;DR: A novel geometric pruning filter is proposed and an iterative filter-refinement strategy is introduced for conservatively and progressively estimating the probabilistic domination count in an efficient way while keeping correctness according to the possible world semantics.

...read moreread less

Journal Article•DOI•

Classification systems based on rough sets under the belief function framework

[...]

Salsabil Trabelsi¹, Zied Elouedi¹, Pawan Lingras²•Institutions (2)

Institut Supérieur de Gestion¹, Saint Mary's University²

01 Dec 2011-International Journal of Approximate Reasoning

TL;DR: This paper presents two classification approaches based on Rough Sets (RS) that are able to learn decision rules from uncertain data and aims at simplifying the Uncertain Decision Table (UDT) in order to generate significant decision rules for classification process.

...read moreread less

Journal Article•DOI•

Indexing the earth mover's distance using normal distributions

[...]

Brian E. Ruttenberg¹, Ambuj K. Singh¹•Institutions (1)

University of California, Santa Barbara¹

01 Nov 2011

TL;DR: A new lower bound to the EMD and an index structure to significantly improve the performance of EMD based K-- nearest neighbor (K--NN) queries on uncertain databases is proposed and it is shown that the method significantly decreases K--NN query time on uncertainty databases.

...read moreread less

Abstract: Querying uncertain data sets (represented as probability distributions) presents many challenges due to the large amount of data involved and the difficulties comparing uncertainty between distributions. The Earth Mover's Distance (EMD) has increasingly been employed to compare uncertain data due to its ability to effectively capture the differences between two distributions. Computing the EMD entails finding a solution to the transportation problem, which is computationally intensive. In this paper, we propose a new lower bound to the EMD and an index structure to significantly improve the performance of EMD based K-- nearest neighbor (K--NN) queries on uncertain databases.We propose a new lower bound to the EMD that approximates the EMD on a projection vector. Each distribution is projected onto a vector and approximated by a normal distribution, as well as an accompanying error term. We then represent each normal as a point in a Hough transformed space. We then use the concept of stochastic dominance to implement an efficient index structure in the transformed space. We show that our method significantly decreases K--NN query time on uncertain databases. The index structure also scales well with database cardinality. It is well suited for heterogeneous data sets, helping to keep EMD based queries tractable as uncertain data sets become larger and more complex.

...read moreread less

Journal Article•DOI•

Ranking uncertain sky: The probabilistic top-k skyline operator

[...]

Ying Zhang¹, Wenjie Zhang¹, Xuemin Lin¹, Bin Jiang², Jian Pei² - Show less +1 more•Institutions (2)

NICTA¹, Simon Fraser University²

01 Jul 2011-Information Systems

TL;DR: An efficient exact algorithm for computing the top-k skyline objects is developed for discrete cases and an efficient randomized algorithm with an @e@?approximation guarantee is developed to address applications where each object may have a massive set of instances or a continuous probability density function.

...read moreread less

Journal Article•DOI•

Asymptotically efficient algorithms for skyline probabilities of uncertain data

[...]

Mikhail J. Atallah¹, Yinian Qi¹, Hao Yuan²•Institutions (2)

Purdue University¹, City University of Hong Kong²

02 Jun 2011-ACM Transactions on Database Systems

TL;DR: This work proposes a new algorithm for computing all skyline probabilities that is asymptotically faster and studies the online version of the problem, which involves answering an online query for d-dimensional data in O(n) time and space.

...read moreread less

Abstract: Skyline computation is widely used in multicriteria decision making. As research in uncertain databases draws increasing attention, skyline queries with uncertain data have also been studied. Some earlier work focused on probabilistic skylines with a given threshold; Atallah and Qi [2009] studied the problem to compute skyline probabilities for all instances of uncertain objects without the use of thresholds, and proposed an algorithm with subquadratic time complexity. In this work, we propose a new algorithm for computing all skyline probabilities that is asymptotically faster: worst-case O(n √n log n) time and O(n) space for 2D data; O(n2−1/d logd−1n) time and O(n logd−2n) space for d-dimensional data. Furthermore, we study the online version of the problem: Given any query point p (unknown until the query time), return the probability that no instance in the given data set dominates p. We propose an algorithm for answering such an online query for d-dimensional data in O(n1−1/d logd−1n) time after preprocessing the data in O(n2−1/d logd−1) time and space.

...read moreread less

Proceedings Article•DOI•

Fuzzy granular evolving modeling for time series prediction

[...]

Daniel Leite¹, Fernando Gomide¹, Rosangela Ballini¹, Pyramo Costa²•Institutions (2)

State University of Campinas¹, Pontifícia Universidade Católica de Minas Gerais²

27 Jun 2011

TL;DR: Experiments with classic Box-Jenkins and Mackey-Glass benchmarks as well as with actual Global40 bond data suggest that the FBeM approach outperforms alternative approaches.

...read moreread less

Abstract: Modeling large volumes of flowing data from complex systems motivates rethinking several aspects of the machine learning theory. Data stream mining is concerned with extracting structured knowledge from spatio-temporally correlated data. A profusion of systems and algorithms devoted to this end has been constructed under the conceptual framework of granular computing. This paper outlines a fuzzy set based granular evolving modeling — FBeM — approach for learning from imprecise data. Granulation arises because modeling uncertain data dispenses attention to details. The evolving aspect is fundamental to account endless flows of nonstationary data and structural adaptation of models. Experiments with classic Box-Jenkins and Mackey-Glass benchmarks as well as with actual Global40 bond data suggest that the FBeM approach outperforms alternative approaches.

...read moreread less

Collapse