scispace - formally typeset
Search or ask a question

Showing papers on "Uncertain data published in 2011"


BookDOI
27 Jun 2011
TL;DR: This textbook provides a first course in stochastic programming suitable for students with a basic knowledge of linear programming, elementary analysis, and probability to help students develop an intuition on how to model uncertainty into mathematical problems.
Abstract: The aim of stochastic programming is to find optimal decisions in problems which involve uncertain data. This field is currently developing rapidly with contributions from many disciplines including operations research, mathematics, and probability. At the same time, it is now being applied in a wide variety of subjects ranging from agriculture to financial planning and from industrial engineering to computer networks. This textbook provides a first course in stochastic programming suitable for students with a basic knowledge of linear programming, elementary analysis, and probability. The authors aim to present a broad overview of the main themes and methods of the subject. Its prime goal is to help students develop an intuition on how to model uncertainty into mathematical problems, what uncertainty changes bring to the decision process, and what techniques help to manage uncertainty in solving the problems.In this extensively updated new edition there is more material on methods and examples including several new approaches for discrete variables, new results on risk measures in modeling and Monte Carlo sampling methods, a new chapter on relationships to other methods including approximate dynamic programming, robust optimization and online methods.The book is highly illustrated with chapter summaries and many examples and exercises. Students, researchers and practitioners in operations research and the optimization area will find it particularly of interest. Review of First Edition:"The discussion on modeling issues, the large number of examples used to illustrate the material, and the breadth of the coverage make'Introduction to Stochastic Programming' an ideal textbook for the area." (Interfaces, 1998)

5,398 citations


Journal ArticleDOI
TL;DR: This work discovers that the accuracy of a decision tree classifier can be much improved if the "complete information" of a data item (taking into account the probability density function (pdf)) is utilized.
Abstract: Traditional decision tree classifiers work with data whose values are known and precise. We extend such classifiers to handle data with uncertain information. Value uncertainty arises in many applications during the data collection process. Example sources of uncertainty include measurement/quantization errors, data staleness, and multiple repeated measurements. With uncertainty, the value of a data item is often represented not by one single value, but by multiple values forming a probability distribution. Rather than abstracting uncertain data by statistical derivatives (such as mean and median), we discover that the accuracy of a decision tree classifier can be much improved if the "complete information" of a data item (taking into account the probability density function (pdf)) is utilized. We extend classical decision tree building algorithms to handle data tuples with uncertain values. Extensive experiments have been conducted which show that the resulting classifiers are more accurate than those using value averages. Since processing pdfs is computationally more costly than processing single values (e.g., averages), decision tree construction on uncertain data is more CPU demanding than that for certain data. To tackle this problem, we propose a series of pruning techniques that can greatly improve construction efficiency.

193 citations


Journal ArticleDOI
TL;DR: The proposed linguistic approach is based on fuzzy set theory and Dempster-Shafer theory of evidence, where the later has been used to combine the risk of components to determine the system risk.
Abstract: Performing risk analysis can be a challenging task for complex systems due to lack of data and insufficient understanding of the failure mechanisms. A semi quantitative approach that can utilize imprecise information, uncertain data and domain experts' knowledge can be an effective way to perform risk analysis for complex systems. Though the definition of risk varies considerably across disciplines, it is a well accepted notion to use a composition of likelihood of system failure and the associated consequences (severity of loss). A complex system consists of various components, where these two elements of risk for each component can be linguistically described by the domain experts. The proposed linguistic approach is based on fuzzy set theory and Dempster-Shafer theory of evidence, where the later has been used to combine the risk of components to determine the system risk. The proposed risk analysis approach is demonstrated through a numerical example.

131 citations


Journal ArticleDOI
TL;DR: This work presents mathematical formulations for uncertain equivalents of isocontours based on standard probability theory and statistics and employs them in interactive visualization methods to evaluate and display these measures and apply them to 2D and 3D data sets.
Abstract: Uncertainty is ubiquitous in science, engineering and medicine. Drawing conclusions from uncertain data is the normal case, not an exception. While the field of statistical graphics is well established, only a few 2D and 3D visualization and feature extraction methods have been devised that consider uncertainty. We present mathematical formulations for uncertain equivalents of isocontours based on standard probability theory and statistics and employ them in interactive visualization methods. As input data, we consider discretized uncertain scalar fields and model these as random fields. To create a continuous representation suitable for visualization we introduce interpolated probability density functions. Furthermore, we introduce numerical condition as a general means in feature-based visualization. The condition number-which potentially diverges in the isocontour problem-describes how errors in the input data are amplified in feature computation. We show how the average numerical condition of isocontours aids the selection of thresholds that correspond to robust isocontours. Additionally, we introduce the isocontour density and the level crossing probability field; these two measures for the spatial distribution of uncertain isocontours are directly based on the probabilistic model of the input data. Finally, we adapt interactive visualization methods to evaluate and display these measures and apply them to 2D and 3D data sets.

131 citations


Dissertation
01 Jan 2011
TL;DR: This chapter contains sections titled: Half Title, MIT Lincoln Laboratory Series, Title, Copyright, Dedication, Table of Contents, Preface, About the Authors, Acknowledgments.
Abstract: Almost all (important) decision problems are inevitably subject to some level of uncertainty either about data measurements, the parameters, or predictions describing future evolution. The significance of handling uncertainty is further amplified by the large volume of uncertain data automatically generated by modern data gathering or integration systems. Examples include imprecise sensor measurements in a sensor network, inconsistent information collected from different sources in a data integration application, noisy observation data in scientific domains, and so on. Various types of problems of decision making under uncertainty have been a subject of extensive research in computer science, economics and social science. In this talk, I will focus on two important problems in this domain: (1) ranking and top-k query processing over probabilistic database and (2) utility maximization for stochastic combinatorial problems. I will also briefly discuss some of my other research works, such as stochastic matching, distributed multi-query processing, if time allows.

123 citations


Journal ArticleDOI
TL;DR: INTAMAP is a Web Processing Service for the automatic spatial interpolation of measured point data using open standards for spatial data such as developed in the context of the Open Geospatial Consortium, and producing an integrated, open source solution.

78 citations


Journal ArticleDOI
TL;DR: A novel network condition simulator (NetCoS) that produces a synthetic population of sewer sections with a given condition-class distribution that can be used to benchmark deterioration models and guide utilities in the selection of appropriate models and data management strategies.

75 citations


Proceedings ArticleDOI
12 Jun 2011
TL;DR: A unified framework is proposed that can handle both the issues mentioned above to facilitate robust query processing over probabilistic databases and naturally enables highly efficient incremental evaluation when input probabilities are modified.
Abstract: Probabilistic database systems have successfully established themselves as a tool for managing uncertain data. However, much of the research in this area has focused on efficient query evaluation and has largely ignored two key issues that commonly arise in uncertain data management: First, how to provide explanations for query results, e.g., Why is this tuple in my result? or Why does this output tuple have such high probability?. Second, the problem of determining the sensitive input tuples for the given query, e.g., users are interested to know the input tuples that can substantially alter the output, when their probabilities are modified (since they may be unsure about the input probability values). Existing systems provide the lineage/provenance of each of the output tuples in addition to the output probabilities, which is a boolean formula indicating the dependence of the output tuple on the input tuples. However, lineage does not immediately provide a quantitative relationship and it is not informative when we have multiple output tuples. In this paper, we propose a unified framework that can handle both the issues mentioned above to facilitate robust query processing. We formally define the notions of influence and explanations and provide algorithms to determine the top-l influential set of variables and the top-l set of explanations for a variety of queries, including conjunctive queries, probabilistic threshold queries, top-k queries and aggregation queries. Further, our framework naturally enables highly efficient incremental evaluation when input probabilities are modified (e.g., if uncertainty is resolved). Our preliminary experimental results demonstrate the benefits of our framework for performing robust query processing over probabilistic databases.

70 citations


Journal ArticleDOI
TL;DR: This work proposes an intuitive new ranking definition based on the observation that the ranks of a tuple across all possible worlds represent a well-founded rank distribution, and is able to prove that the expected rank, median rank, and quantile rank satisfy all these properties for a ranking query.
Abstract: Recently, there have been several attempts to propose definitions and algorithms for ranking queries on probabilistic data. However, these lack many intuitive properties of a top-k over deterministic data. We define several fundamental properties, including exact-k, containment, unique rank, value invariance, and stability, which are satisfied by ranking queries on certain data. We argue that these properties should also be carefully studied in defining ranking queries in probabilistic data, and fulfilled by definition for ranking uncertain data for most applications. We propose an intuitive new ranking definition based on the observation that the ranks of a tuple across all possible worlds represent a well-founded rank distribution. We studied the ranking definitions based on the expectation, the median, and other statistics of this rank distribution for a tuple and derived the expected rank, median rank, and quantile rank correspondingly. We are able to prove that the expected rank, median rank, and quantile rank satisfy all these properties for a ranking query. We provide efficient solutions to compute such rankings across the major models of uncertain data, such as attribute-level and tuple-level uncertainty. Finally, a comprehensive experimental study confirms the effectiveness of our approach.

64 citations


Journal ArticleDOI
TL;DR: A model-free approach for data mining in engineering based on artificial neural networks based on fuzzy fractional rheological material model for the identification and prediction of time-dependent structural behavior under dynamic loading is presented.
Abstract: In this paper, a model-free approach for data mining in engineering is presented. The numerical approach is based on artificial neural networks. Recurrent neural networks for fuzzy data are developed to identify and predict complex dependencies from uncertain data. Uncertain structural processes obtained from measurements or numerical analyses are used to identify the time-dependent behavior of engineering structures. Structural action and response processes are treated as fuzzy processes. The identification of uncertain dependencies between structural action and response processes is realized by recurrent neural networks for fuzzy data. Algorithms for signal processing and network training are presented. The new recurrent neural network approach is verified by a fuzzy fractional rheological material model. An application for the identification and prediction of time-dependent structural behavior under dynamic loading is presented.

58 citations


Journal ArticleDOI
TL;DR: Experimental results on synthetic and real-world datasets show that the proposed classifiers are better equipped to handle data uncertainty and outperform state-of-the-art in many cases.
Abstract: This paper studies the problem of constructing robust classifiers when the training is plagued with uncertainty. The problem is posed as a Chance-Constrained Program (CCP) which ensures that the uncertain data points are classified correctly with high probability. Unfortunately such a CCP turns out to be intractable. The key novelty is in employing Bernstein bounding schemes to relax the CCP as a convex second order cone program whose solution is guaranteed to satisfy the probabilistic constraint. Prior to this work, only the Chebyshev based relaxations were exploited in learning algorithms. Bernstein bounds employ richer partial information and hence can be far less conservative than Chebyshev bounds. Due to this efficient modeling of uncertainty, the resulting classifiers achieve higher classification margins and hence better generalization. Methodologies for classifying uncertain test data points and error measures for evaluating classifiers robust to uncertain data are discussed. Experimental results on synthetic and real-world datasets show that the proposed classifiers are better equipped to handle data uncertainty and outperform state-of-the-art in many cases.

Journal ArticleDOI
TL;DR: Recent algorithmic development on mining uncertain data in these probabilistic databases for frequent patterns from probabilism databases of uncertain data is reviewed.
Abstract: As an important data mining and knowledge discovery task, association rule mining searches for implicit, previously unknown, and potentially useful pieces of information—in the form of rules revealing associative relationships—that are embedded in the data. In general, the association rule mining process comprises two key steps. The first key step, which mines frequent patterns (i.e., frequently occurring sets of items) from data, is more computationally intensive than the second key step of using the mined frequent patterns to form association rules. In the early days, many developed algorithms mined frequent patterns from traditional transaction databases of precise data such as shopping market basket data, in which the contents of databases are known. However, we are living in an uncertain world, in which uncertain data can be found almost everywhere. Hence, in recent years, researchers have paid more attention to frequent pattern mining from probabilistic databases of uncertain data. In this paper, we review recent algorithmic development on mining uncertain data in these probabilistic databases for frequent patterns. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 316–329 DOI: 10.1002/widm.31

Journal ArticleDOI
01 Jul 2011
TL;DR: This paper proposes an algorithm for efficiently answering PRNN queries using new pruning mechanisms taking distance dependencies into account, and compares it to state-of-the-art approaches recently proposed.
Abstract: Given a query object q, a reverse nearest neighbor (RNN) query in a common certain database returns the objects having q as their nearest neighbor. A new challenge for databases is dealing with uncertain objects. In this paper we consider probabilistic reverse nearest neighbor (PRNN) queries, which return the uncertain objects having the query object as nearest neighbor with a sufficiently high probability. We propose an algorithm for efficiently answering PRNN queries using new pruning mechanisms taking distance dependencies into account. We compare our algorithm to state-of-the-art approaches recently proposed. Our experimental evaluation shows that our approach is able to significantly outperform previous approaches. In addition, we show how our approach can easily be extended to PRkNN (where k > 1) query processing for which there is currently no efficient solution.

Journal ArticleDOI
TL;DR: An interactive robust data envelopment analysis (IRDEA) model is proposed to determine the input and output target values of electricity distribution companies with considering the existence perturbation in data.
Abstract: One of the primary concerns on target setting the electricity distribution companies is the uncertainty on input/output data. In this paper, an interactive robust data envelopment analysis (IRDEA) model is proposed to determine the input and output target values of electricity distribution companies with considering the existence perturbation in data. Target setting is implemented with the uncertain data and the decision maker (DM) can search the envelop frontier and find the targets based on his preference. In order to search the envelop frontier, the paper combine the DEA and multi-objective linear programming method such as STEM. The proposed method of this paper is capable of handling uncertainty in data and finding the target values according to the DM's preferences. To illustrate ability the proposed model, a numerical example is solved. Also, the input and output target values for some of the electricity distribution companies in Iran are reported. The results indicate that the IRDEA model is suitable for target setting based on DM's preferences and with considering uncertain data.

Journal ArticleDOI
TL;DR: A novel Bayesian classification algorithm for uncertain data is proposed and it is shown that the proposed method classifies uncertain data with potentially higher accuracies than the Naive Bayesian approach and has a more stable performance than the existing extended Naïve Bayesian method.
Abstract: Data uncertainty can be caused by numerous factors such as measurement precision limitations, network latency, data staleness and sampling errors. When mining knowledge from emerging applications such as sensor networks or location based services, data uncertainty should be handled cautiously to avoid erroneous results. In this paper, we apply probabilistic and statistical theory on uncertain data and develop a novel method to calculate conditional probabilities of Bayes theorem. Based on that, we propose a novel Bayesian classification algorithm for uncertain data. The experimental results show that the proposed method classifies uncertain data with potentially higher accuracies than the Naive Bayesian approach. It also has a more stable performance than the existing extended Naive Bayesian method.

Proceedings ArticleDOI
21 Mar 2011
TL;DR: This paper proposes tree-based algorithms that use the damped window model to mine frequent itemsets from streams of uncertain data.
Abstract: With advances in technology, large amounts of streaming data can be generated continuously by sensors in applications like environment surveillance. Due to the inherited limitation of sensors, these continuous data can be uncertain. This calls for stream mining of uncertain data. In recent years, tree-based algorithms have been proposed to use the sliding window model for mining frequent itemsets from streams of uncertain data. Besides the sliding window model, there are other window models for processing data streams. In this paper, we propose tree-based algorithms that use the damped window model to mine frequent itemsets from streams of uncertain data.

Book ChapterDOI
24 May 2011
TL;DR: In this article, the authors consider sequential pattern mining in situations where there is uncertainty about which source an event is associated with and use dynamic programming (DP) to compute the probability that a source supports a sequence.
Abstract: We consider sequential pattern mining in situations where there is uncertainty about which source an event is associated with. We model this in the probabilistic database framework and consider the problem of enumerating all sequences whose expected support is sufficiently large. Unlike frequent itemset mining in probabilistic databases [C. Aggarwal et al. KDD'09; Chui et al., PAKDD'07; Chui and Kao, PAKDD'08], we use dynamic programming (DP) to compute the probability that a source supports a sequence, and show that this suffices to compute the expected support of a sequential pattern. Next, we embed this DP algorithm into candidate generate-and-test approaches, and explore the pattern lattice both in a breadth-first (similar to GSP) and a depth-first (similar to SPAM) manner. We propose optimizations for efficiently computing the frequent 1-sequences, for re-using previously-computed results through incremental support computation, and for elmiminating candidate sequences without computing their support via probabilistic pruning. Preliminary experiments show that our optimizations are effective in improving the CPU cost.

Book ChapterDOI
29 Aug 2011
TL;DR: Mining algorithms that use the time-fading model to discover frequent patterns from streams of uncertain data are proposed.
Abstract: Nowadays, streams of data can be continuously generated by sensors in various real-life applications such as environment surveillance. Partially due to the inherited limitation of the sensors, data in these streams can be uncertain. To discover useful knowledge in the form of frequent patterns from streams of uncertain data, a few algorithms have been developed. They mostly use the sliding window model for processing and mining data streams. However, for some applications, other stream processing models such as the time-fading model are more appropriate. In this paper, we propose mining algorithms that use the time-fading model to discover frequent patterns from streams of uncertain data.

Proceedings ArticleDOI
11 Apr 2011
TL;DR: It is shown that the problem of stochastic skyline is NP-complete with respect to the dimensionality, and novel and efficient algorithms are developed to efficiently compute stoChastic skyline over multi-dimensional uncertain data, which run in polynomial time if thedimensionality is fixed.
Abstract: In many applications involving the multiple criteria optimal decision making, users may often want to make a personal trade-off among all optimal solutions. As a key feature, the skyline in a multi-dimensional space provides the minimum set of candidates for such purposes by removing all points not preferred by any (monotonic) utility/scoring functions; that is, the skyline removes all objects not preferred by any user no mater how their preferences vary. Driven by many applications with uncertain data, the probabilistic skyline model is proposed to retrieve uncertain objects based on skyline probabilities. Nevertheless, skyline probabilities cannot capture the preferences of monotonic utility functions. Motivated by this, in this paper we propose a novel skyline operator, namely stochastic skyline. In the light of the expected utility principle, stochastic skyline guarantees to provide the minimum set of candidates for the optimal solutions over all possible monotonic multiplicative utility functions. In contrast to the conventional skyline or the probabilistic skyline computation, we show that the problem of stochastic skyline is NP-complete with respect to the dimensionality. Novel and efficient algorithms are developed to efficiently compute stochastic skyline over multi-dimensional uncertain data, which run in polynomial time if the dimensionality is fixed. We also show, by theoretical analysis and experiments, that the size of stochastic skyline is quite similar to that of conventional skyline over certain data. Comprehensive experiments demonstrate that our techniques are efficient and scalable regarding both CPU and IO costs.

Book ChapterDOI
15 Aug 2011
TL;DR: Surprisingly, one can compute the distribution of the radius of the smallest enclosing ball exactly in polynomial time, but computing the same distribution for the diameter is #P-hard.
Abstract: We study computing with indecisive point sets. Such points have spatial uncertainty where the true location is one of a finite number of possible locations. This data arises from probing distributions a few times or when the location is one of a few locations from a known database. In particular, we study computing distributions of geometric functions such as the radius of the smallest enclosing ball and the diameter. Surprisingly, we can compute the distribution of the radius of the smallest enclosing ball exactly in polynomial time, but computing the same distribution for the diameter is #P-hard. We generalize our polynomial-time algorithm to all LP-type problems. We also utilize our indecisive framework to deterministically and approximately compute on a more general class of uncertain data where the location of each point is given by a probability distribution.

Posted Content
TL;DR: In this article, a lower bound to the Earth Mover's distance (EMD) and an index structure are proposed to improve the performance of K-NN queries on uncertain databases.
Abstract: Querying uncertain data sets (represented as probability distributions) presents many challenges due to the large amount of data involved and the difficulties comparing uncertainty between distributions. The Earth Mover's Distance (EMD) has increasingly been employed to compare uncertain data due to its ability to effectively capture the differences between two distributions. Computing the EMD entails finding a solution to the transportation problem, which is computationally intensive. In this paper, we propose a new lower bound to the EMD and an index structure to significantly improve the performance of EMD based K-nearest neighbor (K-NN) queries on uncertain databases. We propose a new lower bound to the EMD that approximates the EMD on a projection vector. Each distribution is projected onto a vector and approximated by a normal distribution, as well as an accompanying error term. We then represent each normal as a point in a Hough transformed space. We then use the concept of stochastic dominance to implement an efficient index structure in the transformed space. We show that our method significantly decreases K-NN query time on uncertain databases. The index structure also scales well with database cardinality. It is well suited for heterogeneous data sets, helping to keep EMD based queries tractable as uncertain data sets become larger and more complex.

Posted Content
TL;DR: In this article, a geometric pruning filter is proposed to estimate the probabilistic domination count, which is used to answer a wide range of probability similarity queries on uncertain data.
Abstract: In this paper, we propose a novel, effective and efficient probabilistic pruning criterion for probabilistic similarity queries on uncertain data. Our approach supports a general uncertainty model using continuous probabilistic density functions to describe the (possibly correlated) uncertain attributes of objects. In a nutshell, the problem to be solved is to compute the PDF of the random variable denoted by the probabilistic domination count: Given an uncertain database object B, an uncertain reference object R and a set D of uncertain database objects in a multi-dimensional space, the probabilistic domination count denotes the number of uncertain objects in D that are closer to R than B. This domination count can be used to answer a wide range of probabilistic similarity queries. Specifically, we propose a novel geometric pruning filter and introduce an iterative filter-refinement strategy for conservatively and progressively estimating the probabilistic domination count in an efficient way while keeping correctness according to the possible world semantics. In an experimental evaluation, we show that our proposed technique allows to acquire tight probability bounds for the probabilistic domination count quickly, even for large uncertain databases.

Journal ArticleDOI
TL;DR: This paper formalizes this problem of similarity join processing on stream data that inherently contain uncertainty, where the incoming data at each time stamp are uncertain and imprecise as join on uncertain data streams (USJ), which can guarantee the accuracy of USJ answers over uncertain data.
Abstract: Similarity join processing in the streaming environment has many practical applications such as sensor networks, object tracking and monitoring, and so on. Previous works usually assume that stream processing is conducted over precise data. In this paper, we study an important problem of similarity join processing on stream data that inherently contain uncertainty (or called uncertain data streams), where the incoming data at each time stamp are uncertain and imprecise. Specifically, we formalize this problem as join on uncertain data streams (USJ), which can guarantee the accuracy of USJ answers over uncertain data. To tackle the challenges with respect to efficiency and effectiveness such as limited memory and small response time, we propose effective pruning methods on both object and sample levels to filter out false alarms. We integrate the proposed pruning methods into an efficient query procedure that can incrementally maintain the USJ answers. Most importantly, we further design a novel strategy, namely, adaptive superset prejoin (ASP), to maintain a superset of USJ candidate pairs. ASP is in light of our proposed formal cost model such that the average USJ processing cost is minimized. We have conducted extensive experiments to demonstrate the efficiency and effectiveness of our proposed approaches.

Journal ArticleDOI
TL;DR: A modern state estimation algorithm (the Local Ensemble Transform Kalman Filter) is applied to two different mathematical models of glioblastoma, taking into account likely errors in model parameters and measurement uncertainties in magnetic resonance imaging.
Abstract: Data assimilation refers to methods for updating the state vector (initial condition) of a complex spatiotemporal model (such as a numerical weather model) by combining new observations with one or more prior forecasts. We consider the potential feasibility of this approach for making short-term (60-day) forecasts of the growth and spread of a malignant brain cancer (glioblastoma multiforme) in individual patient cases, where the observations are synthetic magnetic resonance images of a hypothetical tumor. We apply a modern state estimation algorithm (the Local Ensemble Transform Kalman Filter), previously developed for numerical weather prediction, to two different mathematical models of glioblastoma, taking into account likely errors in model parameters and measurement uncertainties in magnetic resonance imaging. The filter can accurately shadow the growth of a representative synthetic tumor for 360 days (six 60-day forecast/update cycles) in the presence of a moderate degree of systematic model error and measurement noise. The mathematical methodology described here may prove useful for other modeling efforts in biology and oncology. An accurate forecast system for glioblastoma may prove useful in clinical settings for treatment planning and patient counseling. This article was reviewed by Anthony Almudevar, Tomas Radivoyevitch, and Kristin Swanson (nominated by Georg Luebeck).

Proceedings ArticleDOI
11 Apr 2011
TL;DR: A novel geometric pruning filter is proposed and an iterative filter-refinement strategy is introduced for conservatively and progressively estimating the probabilistic domination count in an efficient way while keeping correctness according to the possible world semantics.
Abstract: In this paper, we propose a novel, effective and efficient probabilistic pruning criterion for probabilistic similarity queries on uncertain data. Our approach supports a general uncertainty model using continuous probabilistic density functions to describe the (possibly correlated) uncertain attributes of objects. In a nutshell, the problem to be solved is to compute the PDF of the random variable denoted by the probabilistic domination count: Given an uncertain database object B, an uncertain reference object R and a set D of uncertain database objects in a multi-dimensional space, the probabilistic domination count denotes the number of uncertain objects in D that are closer to R than B. This domination count can be used to answer a wide range of probabilistic similarity queries. Specifically, we propose a novel geometric pruning filter and introduce an iterative filter-refinement strategy for conservatively and progressively estimating the probabilistic domination count in an efficient way while keeping correctness according to the possible world semantics. In an experimental evaluation, we show that our proposed technique allows to acquire tight probability bounds for the probabilistic domination count quickly, even for large uncertain databases.

Journal ArticleDOI
TL;DR: This paper presents two classification approaches based on Rough Sets (RS) that are able to learn decision rules from uncertain data and aims at simplifying the Uncertain Decision Table (UDT) in order to generate significant decision rules for classification process.

Journal ArticleDOI
01 Nov 2011
TL;DR: A new lower bound to the EMD and an index structure to significantly improve the performance of EMD based K-- nearest neighbor (K--NN) queries on uncertain databases is proposed and it is shown that the method significantly decreases K--NN query time on uncertainty databases.
Abstract: Querying uncertain data sets (represented as probability distributions) presents many challenges due to the large amount of data involved and the difficulties comparing uncertainty between distributions. The Earth Mover's Distance (EMD) has increasingly been employed to compare uncertain data due to its ability to effectively capture the differences between two distributions. Computing the EMD entails finding a solution to the transportation problem, which is computationally intensive. In this paper, we propose a new lower bound to the EMD and an index structure to significantly improve the performance of EMD based K-- nearest neighbor (K--NN) queries on uncertain databases.We propose a new lower bound to the EMD that approximates the EMD on a projection vector. Each distribution is projected onto a vector and approximated by a normal distribution, as well as an accompanying error term. We then represent each normal as a point in a Hough transformed space. We then use the concept of stochastic dominance to implement an efficient index structure in the transformed space. We show that our method significantly decreases K--NN query time on uncertain databases. The index structure also scales well with database cardinality. It is well suited for heterogeneous data sets, helping to keep EMD based queries tractable as uncertain data sets become larger and more complex.

Journal ArticleDOI
Ying Zhang1, Wenjie Zhang1, Xuemin Lin1, Bin Jiang2, Jian Pei2 
TL;DR: An efficient exact algorithm for computing the top-k skyline objects is developed for discrete cases and an efficient randomized algorithm with an @e@?approximation guarantee is developed to address applications where each object may have a massive set of instances or a continuous probability density function.

Journal ArticleDOI
TL;DR: This work proposes a new algorithm for computing all skyline probabilities that is asymptotically faster and studies the online version of the problem, which involves answering an online query for d-dimensional data in O(n) time and space.
Abstract: Skyline computation is widely used in multicriteria decision making. As research in uncertain databases draws increasing attention, skyline queries with uncertain data have also been studied. Some earlier work focused on probabilistic skylines with a given threshold; Atallah and Qi [2009] studied the problem to compute skyline probabilities for all instances of uncertain objects without the use of thresholds, and proposed an algorithm with subquadratic time complexity. In this work, we propose a new algorithm for computing all skyline probabilities that is asymptotically faster: worst-case O(n √n log n) time and O(n) space for 2D data; O(n2−1/d logd−1n) time and O(n logd−2n) space for d-dimensional data. Furthermore, we study the online version of the problem: Given any query point p (unknown until the query time), return the probability that no instance in the given data set dominates p. We propose an algorithm for answering such an online query for d-dimensional data in O(n1−1/d logd−1n) time after preprocessing the data in O(n2−1/d logd−1) time and space.

Proceedings ArticleDOI
27 Jun 2011
TL;DR: Experiments with classic Box-Jenkins and Mackey-Glass benchmarks as well as with actual Global40 bond data suggest that the FBeM approach outperforms alternative approaches.
Abstract: Modeling large volumes of flowing data from complex systems motivates rethinking several aspects of the machine learning theory. Data stream mining is concerned with extracting structured knowledge from spatio-temporally correlated data. A profusion of systems and algorithms devoted to this end has been constructed under the conceptual framework of granular computing. This paper outlines a fuzzy set based granular evolving modeling — FBeM — approach for learning from imprecise data. Granulation arises because modeling uncertain data dispenses attention to details. The evolving aspect is fundamental to account endless flows of nonstationary data and structural adaptation of models. Experiments with classic Box-Jenkins and Mackey-Glass benchmarks as well as with actual Global40 bond data suggest that the FBeM approach outperforms alternative approaches.