scispace - formally typeset
Search or ask a question

Showing papers on "Tuple published in 2008"


Proceedings ArticleDOI
09 Jun 2008
TL;DR: MQL provides an easy-to-use object-oriented interface to the tuple data in Freebase and is designed to facilitate the creation of collaborative, Web-based data-oriented applications.
Abstract: Freebase is a practical, scalable tuple database used to structure general human knowledge. The data in Freebase is collaboratively created, structured, and maintained. Freebase currently contains more than 125,000,000 tuples, more than 4000 types, and more than 7000 properties. Public read/write access to Freebase is allowed through an HTTP-based graph-query API using the Metaweb Query Language (MQL) as a data query and manipulation language. MQL provides an easy-to-use object-oriented interface to the tuple data in Freebase and is designed to facilitate the creation of collaborative, Web-based data-oriented applications.

4,813 citations


Journal ArticleDOI
TL;DR: In this paper, a self-supervised learner employs a parser and heuristics to determine criteria that will be used by an extraction classifier (or other ranking model) for evaluating the trustworthiness of candidate tuples that have been extracted from the corpus of text.
Abstract: To implement open information extraction, a new extraction paradigm has been developed in which a system makes a single data-driven pass over a corpus of text, extracting a large set of relational tuples without requiring any human input. Using training data, a Self-Supervised Learner employs a parser and heuristics to determine criteria that will be used by an extraction classifier (or other ranking model) for evaluating the trustworthiness of candidate tuples that have been extracted from the corpus of text, by applying heuristics to the corpus of text. The classifier retains tuples with a sufficiently high probability of being trustworthy. A redundancy-based assessor assigns a probability to each retained tuple to indicate a likelihood that the retained tuple is an actual instance of a relationship between a plurality of objects comprising the retained tuple. The retained tuples comprise an extraction graph that can be queried for information.

545 citations


Proceedings ArticleDOI
Ravi Jampani1, Fei Xu1, Mingxi Wu1, Luis Perez1, Chris Jermaine1, Peter J. Haas2 
09 Jun 2008
TL;DR: MCDB is introduced, a system for managing uncertain data that is based on a Monte Carlo approach, which can easily handle arbitrary joint probability distributions over discrete or continuous attributes, arbitrarily complex SQL queries, and arbitrary functionals of the query-result distribution such as means, variances, and quantiles.
Abstract: To deal with data uncertainty, existing probabilistic database systems augment tuples with attribute-level or tuple-level probability values, which are loaded into the database along with the data itself. This approach can severely limit the system's ability to gracefully handle complex or unforeseen types of uncertainty, and does not permit the uncertainty model to be dynamically parameterized according to the current state of the database. We introduce MCDB, a system for managing uncertain data that is based on a Monte Carlo approach. MCDB represents uncertainty via "VG functions," which are used to pseudorandomly generate realized values for uncertain attributes. VG functions can be parameterized on the results of SQL queries over "parameter tables" that are stored in the database, facilitating what-if analyses. By storing parameters, and not probabilities, and by estimating, rather than exactly computing, the probability distribution over possible query answers, MCDB avoids many of the limitations of prior systems. For example, MCDB can easily handle arbitrary joint probability distributions over discrete or continuous attributes, arbitrarily complex SQL queries, and arbitrary functionals of the query-result distribution such as means, variances, and quantiles. To achieve good performance, MCDB uses novel query processing techniques, executing a query plan exactly once, but over "tuple bundles" instead of ordinary tuples. Experiments indicate that our enhanced functionality can be obtained with acceptable overheads relative to traditional systems.

305 citations


Proceedings ArticleDOI
09 Jun 2008
TL;DR: This paper proposes Lahar1, an event processing system for probabilistic event streams that yields a much higher recall and precision than deterministic techniques operating over only the most probable tuples by using a novel static analysis and novel algorithms.
Abstract: A major problem in detecting events in streams of data is that the data can be imprecise (e.g. RFID data). However, current state-ofthe-art event detection systems such as Cayuga [14], SASE [46] or SnoopIB[1], assume the data is precise. Noise in the data can be captured using techniques such as hidden Markov models. Inference on these models creates streams of probabilistic events which cannot be directly queried by existing systems. To address this challenge we propose Lahar1, an event processing system for probabilistic event streams. By exploiting the probabilistic nature of the data, Lahar yields a much higher recall and precision than deterministic techniques operating over only the most probable tuples. By using a novel static analysis and novel algorithms, Lahar processes data orders of magnitude more efficiently than a naive approach based on sampling. In this paper, we present Lahar's static analysis and core algorithms. We demonstrate the quality and performance of our approach through experiments with our prototype implementation and comparisons with alternate methods.

213 citations


Journal ArticleDOI
TL;DR: Salinas as discussed by the authors is a novel skyline algorithm that exploits the idea of presorting the input data so as to effectively limit the number of tuples to be read and compared, which makes salsa also attractive when skyline queries are executed on top of systems that do not understand skyline semantics.
Abstract: Skyline queries compute the set of Pareto-optimal tuples in a relation, that is, those tuples that are not dominated by any other tuple in the same relation. Although several algorithms have been proposed for efficiently evaluating skyline queries, they either necessitate the relation to have been indexed or have to perform the dominance tests on all the tuples in order to determine the result. In this article we introduce salsa, a novel skyline algorithm that exploits the idea of presorting the input data so as to effectively limit the number of tuples to be read and compared. This makes salsa also attractive when skyline queries are executed on top of systems that do not understand skyline semantics, or when the skyline logic runs on clients with limited power and/or bandwidth. We prove that, if one considers symmetric sorting functions, the number of tuples to be read is minimized by sorting data according to a “minimum coordinate,” minC, criterion, and that performance can be further improved if data distribution is known and an asymmetric sorting function is used. Experimental results obtained on synthetic and real datasets show that salsa consistently outperforms state-of-the-art sequential skyline algorithms and that its performance can be accurately predicted.

206 citations


Journal ArticleDOI
01 Aug 2008
TL;DR: The semantics of SPREAD is described, a unification of two different SQL extensions for streams and its associated semantics that gives the user control over the granularity at which one can express simultaneity.
Abstract: This paper describes a unification of two different SQL extensions for streams and its associated semantics. We use the data models from Oracle and StreamBase as our examples. Oracle uses a time-based execution model while StreamBase uses a tuple-based execution model. Time-based execution provides a way to model simultaneity while tuple-based execution provides a way to react to primitive events as soon as they are seen by the system.The result is a new model that gives the user control over the granularity at which one can express simultaneity. Of course, it is possible to ignore simultaneity altogether. The proposed model captures ordering and simultaneity through partial orders on batches of tuples. The batching and the ordering are encapsulated in and can be modified by means of a powerful new operator that we call SPREAD. This paper describes the semantics of SPREAD and gives several examples of its use.

188 citations


Journal ArticleDOI
01 Aug 2008
TL;DR: This paper is the first to formally characterize a "good" pattern tableau, based on naturally desirable properties of support, confidence and parsimony, and shows that the problem of generating an optimal tableau for a given FD is NP-complete but can be approximated in polynomial time via a greedy algorithm.
Abstract: Conditional functional dependencies (CFDs) have recently been proposed as a useful integrity constraint to summarize data semantics and identify data inconsistencies. A CFD augments a functional dependency (FD) with a pattern tableau that defines the context (i.e., the subset of tuples) in which the underlying FD holds. While many aspects of CFDs have been studied, including static analysis and detecting and repairing violations, there has not been prior work on generating pattern tableaux, which is critical to realize the full potential of CFDs.This paper is the first to formally characterize a "good" pattern tableau, based on naturally desirable properties of support, confidence and parsimony. We show that the problem of generating an optimal tableau for a given FD is NP-complete but can be approximated in polynomial time via a greedy algorithm. For large data sets, we propose an "on-demand" algorithm providing the same approximation bound, that outperforms the basic greedy algorithm in running time by an order of magnitude. For ordered attributes, we propose the range tableau as a generalization of a pattern tableau, which can achieve even more parsimony. The effectiveness and efficiency of our techniques are experimentally demonstrated on real data.

187 citations


Journal ArticleDOI
01 Aug 2008
TL;DR: This work focuses on providing provenance-style explanations for non-answers and develops a mechanism for providing this new type of provenance and suggests that this approach can provide effective provenance information that can help a user resolve their doubts over non-ANSwers to a query.
Abstract: In information extraction, uncertainty is ubiquitous. For this reason, it is useful to provide users querying extracted data with explanations for the answers they receive. Providing the provenance for tuples in a query result partially addresses this problem, in that provenance can explain why a tuple is in the result of a query. However, in some cases explaining why a tuple is not in the result may be just as helpful. In this work we focus on providing provenance-style explanations for non-answers and develop a mechanism for providing this new type of provenance. Our experience with an information extraction prototype suggests that our approach can provide effective provenance information that can help a user resolve their doubts over non-answers to a query.

186 citations


Proceedings ArticleDOI
09 Jun 2008
TL;DR: The core mining problem of clustering on uncertain data is studied, and appropriate natural generalizations of standard clustering optimization criteria are defined, and a variety of bicriteria approximation algorithms are shown, including the first known guaranteed approximation algorithms for the problems of clustered uncertain data.
Abstract: There is an increasing quantity of data with uncertainty arising from applications such as sensor network measurements, record linkage, and as output of mining algorithms. This uncertainty is typically formalized as probability density functions over tuple values. Beyond storing and processing such data in a DBMS, it is necessary to perform other data analysis tasks such as data mining. We study the core mining problem of clustering on uncertain data, and define appropriate natural generalizations of standard clustering optimization criteria. Two variations arise, depending on whether a point is automatically associated with its optimal center, or whether it must be assigned to a fixed cluster no matter where it is actually located.For uncertain versions of k-means and k-median, we show reductions to their corresponding weighted versions on data with no uncertainties. These are simple in the unassigned case, but require some care for the assigned version. Our most interesting results are for uncertain k-center, which generalizes both traditional k-center and k-median objectives. We show a variety of bicriteria approximation algorithms. One picks O(ke--1log2n) centers and achieves a (1 + e) approximation to the best uncertain k-centers. Another picks 2k centers and achieves a constant factor approximation. Collectively, these results are the first known guaranteed approximation algorithms for the problems of clustering uncertain data.

181 citations


Proceedings ArticleDOI
07 Apr 2008
TL;DR: This work introduces novel polynomial algorithms for processing top-k queries in uncertain databases under the generally adopted model of x-relations, and introduces the first-known polynometric algorithms, while the current best algorithms have exponential complexity in both time and space.
Abstract: This work introduces novel polynomial-time algorithms for processing top-k queries in uncertain databases, under the generally adopted model of x-relations. An x-relation consists of a number of x-tuples, and each x-tuple randomly instantiates into one tuple from one or more alternatives. Our results significantly improve the best known algorithms for top-k query processing in uncertain databases, in terms of both running time and memory usage. Focusing on the single-alternative case, the new algorithms are orders of magnitude faster.

176 citations


Journal ArticleDOI
01 Aug 2008
TL;DR: This paper introduces BayesStore, a novel probabilistic data management architecture built on the principle of handling statistical models and probabilistically inference tools as first-class citizens of the database system, and presents BAYESSTORE's uncertainty model based on a novel, first-order statistical model, and redefine traditional query processing operators, to manipulate the data and the probabilism models of thedatabase in an efficient manner.
Abstract: Several real-world applications need to effectively manage and reason about large amounts of data that are inherently uncertain. For instance, pervasive computing applications must constantly reason about volumes of noisy sensory readings for a variety of reasons, including motion prediction and human behavior modeling. Such probabilistic data analyses require sophisticated machine-learning tools that can effectively model the complex spatio/temporal correlation patterns present in uncertain sensory data. Unfortunately, to date, most existing approaches to probabilistic database systems have relied on somewhat simplistic models of uncertainty that can be easily mapped onto existing relational architectures: Probabilistic information is typically associated with individual data tuples, with only limited or no support for effectively capturing and reasoning about complex data correlations. In this paper, we introduce BayesStore, a novel probabilistic data management architecture built on the principle of handling statistical models and probabilistic inference tools as first-class citizens of the database system. Adopting a machine-learning view, BAYESSTORE employs concise statistical relational models to effectively encode the correlation patterns between uncertain data, and promotes probabilistic inference and statistical model manipulation as part of the standard DBMS operator repertoire to support efficient and sound query processing. We present BAYESSTORE's uncertainty model based on a novel, first-order statistical model, and we redefine traditional query processing operators, to manipulate the data and the probabilistic models of the database in an efficient manner. Finally, we validate our approach, by demonstrating the value of exploiting data correlations during query processing, and by evaluating a number of optimizations which significantly accelerate query processing.

Proceedings ArticleDOI
Erik Vee1, Utkarsh Srivastava1, Jayavel Shanmugasundaram1, P. Bhat1, Sihem Amer Yahia1 
07 Apr 2008
TL;DR: In this paper, the problem of efficiently computing diverse query results in online shopping applications was studied, where users specify queries through a form interface that allows a mix of structured and content-based selection conditions.
Abstract: We study the problem of efficiently computing diverse query results in online shopping applications, where users specify queries through a form interface that allows a mix of structured and content-based selection conditions. Intuitively, the goal of diverse query answering is to return a representative set of top-k answers from all the tuples that satisfy the user selection condition. For example, if a user is searching for Honda cars and we can only display five results, we wish to return cars from five different Honda models, as opposed to returning cars from only one or two Honda models. A key contribution of this paper is to formally define the notion of diversity, and to show that existing score based techniques commonly used in web applications are not sufficient to guarantee diversity. Another contribution of this paper is to develop novel and efficient query processing techniques that guarantee diversity. Our experimental results using Yahoo! Autos data show that our proposed techniques are scalable and efficient.

Journal ArticleDOI
01 Aug 2008
TL;DR: This paper designs a unified framework for processing sliding-window top-k queries on uncertain streams, and shows that all the existing top-K definitions in the literature can be plugged into this framework, resulting in several succinct synopses that use space much smaller than the window size.
Abstract: Query processing on uncertain data streams has attracted a lot of attentions lately, due to the imprecise nature in the data generated from a variety of streaming applications, such as readings from a sensor network. However, all of the existing works on uncertain data streams study unbounded streams. This paper takes the first step towards the important and challenging problem of answering sliding-window queries on uncertain data streams, with a focus on arguably one of the most important types of queries---top-k queries.The challenge of answering sliding-window top-k queries on uncertain data streams stems from the strict space and time requirements of processing both arriving and expiring tuples in high-speed streams, combined with the difficulty of coping with the exponential blowup in the number of possible worlds induced by the uncertain data model. In this paper, we design a unified framework for processing sliding-window top-k queries on uncertain streams. We show that all the existing top-k definitions in the literature can be plugged into our framework, resulting in several succinct synopses that use space much smaller than the window size, while are also highly efficient in terms of processing time. In addition to the theoretical space and time bounds that we prove for these synopses, we also present a thorough experimental report to verify their practical efficiency on both synthetic and real data.

Proceedings ArticleDOI
07 Apr 2008
TL;DR: Blink is presented, the first attempt at this goal, that runs every query as a table scan over a fully denormalized database, with hash group-by done along the way, and a scheme for evaluating a conjunction of range and equality predicates in SIMD fashion over compressed tuples.
Abstract: Query performance in current systems depends significantly on tuning: how well the query matches the available indexes, materialized views etc. Even in a well tuned system, there are always some queries that take much longer than others. This frustrates users who increasingly want consistent response times to ad hoc queries. We argue that query processors should instead aim for constant response times for all queries, with no assumption about tuning. We present Blink, our first attempt at this goal, that runs every query as a table scan over a fully denormalized database, with hash group-by done along the way. To make this scan efficient, Blink uses a novel compression scheme that horizontally partitions tuples by frequency, thereby compressing skewed data almost down to entropy, even while producing long runs of fixed-length, easily-parseable values. We also present a scheme for evaluating a conjunction of range and equality predicates in SIMD fashion over compressed tuples, and different schemes for efficient hash-based aggregation within the L2 cache. A experimental study with a suite of arbitrary single block SQL queries over a TPCH-like schema suggests that constant-time queries can be efficient.

Proceedings ArticleDOI
15 Dec 2008
TL;DR: A novel graph OLAP framework is developed, which presents a multi-dimensional and multi-level view over graphs and shows how a graph cube can be materialized by calculating a special kind of measure called aggregated graph and how to implement it efficiently.
Abstract: OLAP (On-Line Analytical Processing) is an important notion in data analysis. Recently, more and more graph or networked data sources come into being. There exists a similar need to deploy graph analysis from different perspectives and with multiple granularities. However, traditional OLAP technology cannot handle such demands because it does not consider the links among individual data tuples. In this paper, we develop a novel graph OLAP framework, which presents a multi-dimensional and multi-level view over graphs. The contributions of this work are two-fold. First, starting from basic definitions, i.e., what are dimensions and measures in the graph OLAP scenario, we develop a conceptual framework for data cubes on graphs. We also look into different semantics of OLAP operations, and classify the framework into two major subcases: informational OLAP and topological OLAP. Then, with more emphasis on informational OLAP (topological OLAP will be covered in a future study due to the lack of space), we show how a graph cube can be materialized by calculating a special kind of measure called aggregated graph and how to implement it efficiently. This includes both full materialization and partial materialization where constraints are enforced to obtain an iceberg cube. We can see that the aggregated graphs, which depend on the graph properties of underlying networks, are much harder to compute than their traditional OLAP counterparts, due to the increased structural complexity of data. Empirical studies show insightful results on real datasets and demonstrate the efficiency of our proposed optimizations.

Journal ArticleDOI
TL;DR: Essence as discussed by the authors is a formal language for specifying combinatorial problems in a manner similar to natural rigorous specifications that use a mixture of natural language and discrete mathematics, providing a high level of abstraction.
Abstract: Essence is a formal language for specifying combinatorial problems in a manner similar to natural rigorous specifications that use a mixture of natural language and discrete mathematics. Essence provides a high level of abstraction, much of which is the consequence of the provision of decision variables whose values can be combinatorial objects, such as tuples, sets, multisets, relations, partitions and functions. Essence also allows these combinatorial objects to be nested to arbitrary depth, providing for example sets of partitions, sets of sets of partitions, and so forth. Therefore, a problem that requires finding a complex combinatorial object can be specified directly by using a decision variable whose type is precisely that combinatorial object.

Journal ArticleDOI
TL;DR: This work introduces novel polynomial algorithms for processing top-k queries in uncertain databases under the generally adopted model of x-relations, and introduces the first-known polynometric algorithms, while the current best algorithms have exponential complexity in both time and space.
Abstract: This work introduces novel polynomial algorithms for processing top-k queries in uncertain databases under the generally adopted model of x-relations. An x-relation consists of a number of x-tuples, and each x-tuple randomly instantiates into one tuple from one or more alternatives. Our results significantly improve the best known algorithms for top-k query processing in uncertain databases, in terms of both runtime and memory usage. In the single-alternative case, the new algorithms are 2 to 3 orders of magnitude faster than the previous algorithms. In the multialternative case, we introduce the first-known polynomial algorithms, while the current best algorithms have exponential complexity in both time and space. Our algorithms run in near linear or low polynomial time and cover both types of top-k queries in uncertain databases. We provide both the theoretical analysis and an extensive experimental evaluation to demonstrate the superiority of the new approaches over existing solutions.

Journal ArticleDOI
TL;DR: This paper presents a mechanism for proof of ownership based on the secure embedding of a robust imperceptible watermark in relational data and formulate the watermarking of relational databases as a constrained optimization problem and discusses efficient techniques to solve the optimizationproblem and to handle the constraints.
Abstract: Proving ownership rights on outsourced relational databases is a crucial issue in today's internet-based application environments and in many content distribution applications In this paper, we present a mechanism for proof of ownership based on the secure embedding of a robust imperceptible watermark in relational data We formulate the watermarking of relational databases as a constrained optimization problem and discuss efficient techniques to solve the optimization problem and to handle the constraints Our watermarking technique is resilient to watermark synchronization errors because it uses a partitioning approach that does not require marker tuples Our approach overcomes a major weakness in previously proposed watermarking techniques Watermark decoding is based on a threshold-based technique characterized by an optimal threshold that minimizes the probability of decoding errors We implemented a proof of concept implementation of our watermarking technique and showed by experimental results that our technique is resilient to tuple deletion, alteration, and insertion attacks

Proceedings ArticleDOI
09 Jun 2008
TL;DR: This work demonstrates how Orion simplifies the design and enhances the capabilities of two example applications: managing sensor data (continuous uncertainty) and inferring missing values (discrete uncertainty).
Abstract: Orion is a state-of-the-art uncertain database management system with built-in support for probabilistic data as first class data types. In contrast to other uncertain databases, Orion supports both attribute and tuple uncertainty with arbitrary correlations. This enables the database engine to handle both discrete and continuous pdfs in a natural and accurate manner. The underlying model is closed under the basic relational operators and is consistent with Possible Worlds Semantics. We demonstrate how Orion simplifies the design and enhances the capabilities of two example applications: managing sensor data (continuous uncertainty) and inferring missing values (discrete uncertainty).

Book ChapterDOI
15 Sep 2008
TL;DR: This paper uses the TextRunner system to extract tuples from text, and then induce general concepts and relations from them by jointly clustering the objects and relational strings in the tuples using Markov logic.
Abstract: Extracting knowledge from text has long been a goal of AI. Initial approaches were purely logical and brittle. More recently, the availability of large quantities of text on the Web has led to the development of machine learning approaches. However, to date these have mainly extracted ground facts, as opposed to general knowledge. Other learning approaches can extract logical forms, but require supervision and do not scale. In this paper we present an unsupervised approach to extracting semantic networks from large volumes of text. We use the TextRunner system [1] to extract tuples from text, and then induce general concepts and relations from them by jointly clustering the objects and relational strings in the tuples. Our approach is defined in Markov logic using four simple rules. Experiments on a dataset of two million tuples show that it outperforms three other relational clustering approaches, and extracts meaningful semantic networks.

Patent
Scott M. Heimendinger1
26 Nov 2008
TL;DR: In this paper, performance metrics data in a multi-dimensional structure such as a nested scorecard matrix is transformed into a flat structure or de-normalized for efficient querying of individual records.
Abstract: Performance metrics data in a multi-dimensional structure such as a nested scorecard matrix is transformed into a flat structure or de-normalized for efficient querying of individual records. Each dimension and header is converted to a column and data values resolved at intersection of dimension levels through an iterative process covering all dimensions and headers of the data structure. A key corresponding to a tuple representation of each cell or a transform of the tuple may be used to identify rows corresponding to the resolved data in cells for further enhanced query capabilities.

Proceedings ArticleDOI
07 Apr 2008
TL;DR: This paper presents a model for handling arbitrary probabilistic uncertain data natively at the database level, and develops a model that is consistent with possible worlds semantics and closed under basic relational operators.
Abstract: The inherent uncertainty of data present in numerous applications such as sensor databases, text annotations, and information retrieval motivate the need to handle imprecise data at the database level. Uncertainty can be at the attribute or tuple level and is present in both continuous and discrete data domains. This paper presents a model for handling arbitrary probabilistic uncertain data (both discrete and continuous) natively at the database level. Our approach leads to a natural and efficient representation for probabilistic data. We develop a model that is consistent with possible worlds semantics and closed under basic relational operators. This is the first model that accurately and efficiently handles both continuous and discrete uncertainty. The model is implemented in a real database system (PostgreSQL) and the effectiveness and efficiency of our approach is validated experimentally.

Proceedings ArticleDOI
26 Oct 2008
TL;DR: This paper proposes minimum-effort driven navigational techniques for enterprise database systems based on the faceted search paradigm that dynamically suggest facets for drilling down into the database such that the cost of navigation is minimized.
Abstract: In this paper, we propose minimum-effort driven navigational techniques for enterprise database systems based on the faceted search paradigm. Our proposed techniques dynamically suggest facets for drilling down into the database such that the cost of navigation is minimized. At every step, the system asks the user a question or a set of questions on different facets and depending on the user response, dynamically fetches the next most promising set of facets, and the process repeats. Facets are selected based on their ability to rapidly drill down to the most promising tuples, as well as on the ability of the user to provide desired values for them. Our facet selection algorithms also work in conjunction with any ranked retrieval model where a ranking function imposes a bias over the user preferences for the selected tuples. Our methods are principled as well as efficient, and our experimental study validates their effectiveness on several application scenarios.

Proceedings ArticleDOI
01 Apr 2008
TL;DR: The content-addressable confidentiality scheme developed for DepSpace bridges the gap between Byzantine fault-tolerant replication and confidentiality of replicated data and can be used in other systems that store critical data.
Abstract: The tuple space coordination model is one of the most interesting coordination models for open distributed systems due to its space and time decoupling and its synchronization power. Several works have tried to improve the dependability of tuple spaces through the use of replication for fault tolerance and access control for security. However, many practical applications in the Internet require both fault tolerance and security. This paper describes the design and implementation of DepSpace, a Byzantine fault-tolerant coordination service that provides a tuple space abstraction. The service offered by DepSpace is secure, reliable and available as long as less than a third of service replicas are faulty. Moreover, the content-addressable confidentiality scheme developed for DepSpace bridges the gap between Byzantine fault-tolerant replication and confidentiality of replicated data and can be used in other systems that store critical data.

Journal ArticleDOI
01 Aug 2008
TL;DR: This work presents the PWS-quality metric, a universal measure that quantifies the ambiguity of query answers under the possible world semantics, and investigates how such a metric can be used for data cleaning purposes.
Abstract: Uncertain or imprecise data are pervasive in applications like location-based services, sensor monitoring, and data collection and integration. For these applications, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with statistical confidence. Given that a limited amount of resources is available to "clean" the database (e.g., by probing some sensor data values to get their latest values), we address the problem of choosing the set of uncertain objects to be cleaned, in order to achieve the best improvement in the quality of query answers. For this purpose, we present the PWS-quality metric, which is a universal measure that quantifies the ambiguity of query answers under the possible world semantics. We study how PWS-quality can be efficiently evaluated for two major query classes: (1) queries that examine the satisfiability of tuples independent of other tuples (e.g., range queries); and (2) queries that require the knowledge of the relative ranking of the tuples (e.g., MAX queries). We then propose a polynomial-time solution to achieve an optimal improvement in PWS-quality. Other fast heuristics are presented as well. Experiments, performed on both real and synthetic datasets, show that the PWS-quality metric can be evaluated quickly, and that our cleaning algorithm provides an optimal solution with high efficiency. To our best knowledge, this is the first work that develops a quality metric for a probabilistic database, and investigates how such a metric can be used for data cleaning purposes.

Journal ArticleDOI
TL;DR: A specialized join algorithm, termed mesh join (MESHJOIN), is proposed, which compensates for the difference in the access cost of the two join inputs by 1) relying entirely on fast sequential scans of R and 2) sharing the I/O cost of accessing R across multiple tuples of 5".
Abstract: Active data warehousing has emerged as an alternative to conventional warehousing practices in order to meet the high demand of applications for up-to-date information. In a nutshell, an active warehouse is refreshed online and thus achieves a higher consistency between the stored information and the latest data updates. The need for online warehouse refreshment introduces several challenges in the implementation of data warehouse transformations, with respect to their execution time and their overhead to the warehouse processes. In this paper, we focus on a frequently encountered operation in this context, namely, the join of a fast stream 5" of source updates with a disk-based relation R, under the constraint of limited memory. This operation lies at the core of several common transformations such as surrogate key assignment, duplicate detection, or identification of newly inserted tuples. We propose a specialized join algorithm, termed mesh join (MESHJOIN), which compensates for the difference in the access cost of the two join inputs by 1) relying entirely on fast sequential scans of R and 2) sharing the I/O cost of accessing R across multiple tuples of 5". We detail the MESHJOIN algorithm and develop a systematic cost model that enables the tuning of MESHJOIN for two objectives: maximizing throughput under a specific memory budget or minimizing memory consumption for a specific throughput. We present an experimental study that validates the performance of MESHJOIN on synthetic and real-life data. Our results verify the scalability of MESHJOIN to fast streams and large relations and demonstrate its numerous advantages over existing join algorithms.

Journal ArticleDOI
TL;DR: These formulations are based on marriage of traditional top-k semantics with possible worlds semantics, and a generic processing framework is constructed supporting both query types, and leveraging existing query processing and indexing capabilities in current RDBMSs.
Abstract: Ranking and aggregation queries are widely used in data exploration, data analysis, and decision-making scenarios. While most of the currently proposed ranking and aggregation techniques focus on deterministic data, several emerging applications involve data that is unclean or uncertain. Ranking and aggregating uncertain (probabilistic) data raises new challenges in query semantics and processing, making conventional methods inapplicable. Furthermore, uncertainty imposes probability as a new ranking dimension that does not exist in the traditional settings.In this article we introduce new probabilistic formulations for top-k and ranking-aggregate queries in probabilistic databases. Our formulations are based on marriage of traditional top-k semantics with possible worlds semantics. In the light of these formulations, we construct a generic processing framework supporting both query types, and leveraging existing query processing and indexing capabilities in current RDBMSs. The framework encapsulates a state space model and efficient search algorithms to compute query answers. Our proposed techniques minimize the number of accessed tuples and the size of materialized search space to compute query answers. Our experimental study shows the efficiency of our techniques under different data distributions with orders of magnitude improvement over naive methods.

Patent
27 Aug 2008
TL;DR: In this article, a stream data processing method that can effectively handle delay data is presented. But this method is not suitable for processing data whose lifetime is defined by a window, and when the delay tuple arrives, a correct processing result is calculated from the delay tuples and the processing result restore tuple.
Abstract: Provided is a stream data processing method that can effectively handle delay data. In the stream data processing method of processing data whose lifetime is defined by a window, an operation result excluding a delay tuple is immediately output along with an unconfirmed flag according to delay processing HBT while a midway processing result necessary for reproduction is retained along with the lifetime, and when the delay tuple arrives, a correct processing result is calculated from the delay tuple and the processing result restore tuple.

Journal ArticleDOI
TL;DR: This article uses the method of hypertree decompositions to derive new algorithms and upper bounds for query containment checking and computing cores of arbitrary database instances, and shows that the core of a data exchange problem is fixed-parameter intractable with respect to a number of relevant parameters.
Abstract: Data exchange deals with inserting data from one database into another database having a different schema. Fagin et al. l2005r have shown that among the universal solutions of a solvable data exchange problem, there exists—up to isomorphism—a unique most compact one, “the core”, and have convincingly argued that this core should be the database to be materialized. They stated as an important open problem whether the core can be computed in polynomial time in the general setting where the mapping between the source and target schemas is given by source-to-target constraints that are arbitrary tuple generating dependencies (tgds) and target constraints consisting of equality generating dependencies (egds) and a weakly acyclic set of tgds. In this article, we solve this problem by developing new methods for efficiently computing the core of a universal solution. This positive result shows that data exchange based on cores is feasible and applicable in a very general setting. In addition to our main result, we use the method of hypertree decompositions to derive new algorithms and upper bounds for query containment checking and computing cores of arbitrary database instances. We also show that computing the core of a data exchange problem is fixed-parameter intractable with respect to a number of relevant parameters, and that computing cores is NP-complete if the rule bodies of target tgds are augmented by a special predicate that distinguishes a null value from a constant data value.

Proceedings ArticleDOI
07 Apr 2008
TL;DR: This paper focuses on a novel and complementary problem: how to guide a seller in selecting the best attributes of a new tuple to highlight such that it stands out in the crowd of existing competitive products and is widely visible to the pool of potential buyers.
Abstract: In recent years, there has been significant interest in development of ranking functions and efficient top-k retrieval algorithms to help users in ad-hoc search and retrieval in databases (e.g., buyers searching for products in a catalog). In this paper we focus on a novel and complementary problem: how to guide a seller in selecting the best attributes of a new tuple (e.g., new product) to highlight such that it stands out in the crowd of existing competitive products and is widely visible to the pool of potential buyers. We develop several interesting formulations of this problem. Although these problems are NP-complete, we can give several exact algorithms as well as approximation heuristics that work well in practice. Our exact algorithms are based on integer programming (IP) formulations of the problems, as well as on adaptations of maximal frequent itemset mining algorithms, while our approximation algorithms are based on greedy heuristics. We conduct a performance study illustrating the benefits of our methods on real as well as synthetic data.