scispace - formally typeset
Search or ask a question

Showing papers on "Tuple published in 2010"


Proceedings ArticleDOI
04 Feb 2010
TL;DR: This paper proposes models and algorithms for learning the model parameters and for testing the learned models to make predictions, and develops techniques for predicting the time by which a user may be expected to perform an action.
Abstract: Recently, there has been tremendous interest in the phenomenon of influence propagation in social networks. The studies in this area assume they have as input to their problems a social graph with edges labeled with probabilities of influence between users. However, the question of where these probabilities come from or how they can be computed from real social network data has been largely ignored until now. Thus it is interesting to ask whether from a social graph and a log of actions by its users, one can build models of influence. This is the main problem attacked in this paper. In addition to proposing models and algorithms for learning the model parameters and for testing the learned models to make predictions, we also develop techniques for predicting the time by which a user may be expected to perform an action. We validate our ideas and techniques using the Flickr data set consisting of a social graph with 1.3M nodes, 40M edges, and an action log consisting of 35M tuples referring to 300K distinct actions. Beyond showing that there is genuine influence happening in a real social network, we show that our techniques have excellent prediction performance.

1,116 citations


Journal ArticleDOI
TL;DR: The Distributional Memory approach is shown to be tenable despite the constraints imposed by its multi-purpose nature, and performs competitively against task-specific algorithms recently reported in the literature for the same tasks, and against several state-of-the-art methods.
Abstract: Research into corpus-based semantics has focused on the development of ad hoc models that treat single tasks, or sets of closely related tasks, as unrelated challenges to be tackled by extracting different kinds of distributional information from the corpus. As an alternative to this "one task, one model" approach, the Distributional Memory framework extracts distributional information once and for all from the corpus, in the form of a set of weighted word-link-word tuples arranged into a third-order tensor. Different matrices are then generated from the tensor, and their rows and columns constitute natural spaces to deal with different semantic problems. In this way, the same distributional information can be shared across tasks such as modeling word similarity judgments, discovering synonyms, concept categorization, predicting selectional preferences of verbs, solving analogy problems, classifying relations between word pairs, harvesting qualia structures with patterns or example pairs, predicting the typical properties of concepts, and classifying verbs into alternation classes. Extensive empirical testing in all these domains shows that a Distributional Memory implementation performs competitively against task-specific algorithms recently reported in the literature for the same tasks, and against our implementations of several state-of-the-art methods. The Distributional Memory approach is thus shown to be tenable despite the constraints imposed by its multi-purpose nature.

671 citations


Journal ArticleDOI
01 Sep 2010
TL;DR: Schism consistently outperforms simple partitioning schemes, and in some cases proves superior to the best known manual partitioning, reducing the cost of distributed transactions up to 30%.
Abstract: We present Schism, a novel workload-aware approach for database partitioning and replication designed to improve scalability of shared-nothing distributed databases. Because distributed transactions are expensive in OLTP settings (a fact we demonstrate through a series of experiments), our partitioner attempts to minimize the number of distributed transactions, while producing balanced partitions. Schism consists of two phases: i) a workload-driven, graph-based replication/partitioning phase and ii) an explanation and validation phase. The first phase creates a graph with a node per tuple (or group of tuples) and edges between nodes accessed by the same transaction, and then uses a graph partitioner to split the graph into k balanced partitions that minimize the number of cross-partition transactions. The second phase exploits machine learning techniques to find a predicate-based explanation of the partitioning strategy (i.e., a set of range predicates that represent the same replication/partitioning scheme produced by the partitioner).The strengths of Schism are: i) independence from the schema layout, ii) effectiveness on n-to-n relations, typical in social network databases, iii) a unified and fine-grained approach to replication and partitioning. We implemented and tested a prototype of Schism on a wide spectrum of test cases, ranging from classical OLTP workloads (e.g., TPC-C and TPC-E), to more complex scenarios derived from social network websites (e.g., Epinions.com), whose schema contains multiple n-to-n relationships, which are known to be hard to partition. Schism consistently outperforms simple partitioning schemes, and in some cases proves superior to the best known manual partitioning, reducing the cost of distributed transactions up to 30%.

602 citations


Proceedings ArticleDOI
22 Mar 2010
TL;DR: The problem of optimizing the shares, given a fixed number of Reduce processes, is studied, and an algorithm for detecting and fixing problems where an attribute is "mistakenly" included in the map-key is given.
Abstract: Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the "map-key," the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a "share," which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where an attribute is "mistakenly" included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case we are able to determine the map-key and determine the shares that yield the least replication. While the method we propose is not always superior to the conventional way of using map-reduce to implement joins, there are some important cases involving large-scale data where our method wins, including: (1) analytic queries in which a very large fact table is joined with smaller dimension tables, and (2) queries involving paths through graphs with high out-degree, such as the Web or a social network.

382 citations


Proceedings ArticleDOI
01 Mar 2010
TL;DR: Wang et al. as mentioned in this paper developed a data publishing technique that ensures differential privacy while providing accurate answers for range-count queries, i.e., count queries where the predicate on each attribute is a range.
Abstract: Privacy preserving data publishing has attracted considerable research interest in recent years. Among the existing solutions, ∈-differential privacy provides one of the strongest privacy guarantees. Existing data publishing methods that achieve ∈-differential privacy, however, offer little data utility. In particular, if the output dataset is used to answer count queries, the noise in the query answers can be proportional to the number of tuples in the data, which renders the results useless. In this paper, we develop a data publishing technique that ensures ∈-differential privacy while providing accurate answers for range-count queries, i.e., count queries where the predicate on each attribute is a range. The core of our solution is a framework that applies wavelet transforms on the data before adding noise to it. We present instantiations of the proposed framework for both ordinal and nominal data, and we provide a theoretical analysis on their privacy and utility guarantees. In an extensive experimental study on both real and synthetic data, we show the effectiveness and efficiency of our solution.

302 citations


Journal ArticleDOI
Guiwu Wei1
TL;DR: A method based on the ET-WG and ET-OWG operators for multiple attribute group decision-making is presented and the ranking of alternative or selection of the most desirable alternative(s) is obtained by the comparison of 2-tuple linguistic information.
Abstract: With respect to multiple attribute group decision-making problems with linguistic information of attribute values and weight values, a group decision analysis is proposed. Some new aggregation operators are proposed: the extended 2-tuple weighted geometric (ET-WG) and the extended 2-tuple ordered weighted geometric (ET-OWG) operator and properties of the operators are analyzed. Then, A method based on the ET-WG and ET-OWG operators for multiple attribute group decision-making is presented. In the approach, alternative appraisal values are calculated by the aggregation of 2-tuple linguistic information. Thus, the ranking of alternative or selection of the most desirable alternative(s) is obtained by the comparison of 2-tuple linguistic information. Finally, a numerical example is used to illustrate the applicability and effectiveness of the proposed method.

249 citations


Proceedings ArticleDOI
06 Jun 2010
TL;DR: A query language for provenance is developed, which can express all of the aforementioned types of queries, as well as many more, and the feasibility of provenance querying and the benefits of the indexing techniques across a variety of application classes and queries are experimentally validated.
Abstract: Many advanced data management operations (e.g., incremental maintenance, trust assessment, debugging schema mappings, keyword search over databases, or query answering in probabilistic databases), involve computations that look at how a tuple was produced, e.g., to determine its score or existence. This requires answers to queries such as, "Is this data derivable from trusted tuples?"; "What tuples are derived from this relation?"; or "What score should this answer receive, given initial scores of the base tuples?". Such questions can be answered by consulting the provenance of query results. In recent years there has been significant progress on formal models for provenance. However, the issues of provenance storage, maintenance, and querying have not yet been addressed in an application-independent way. In this paper, we adopt the most general formalism for tuple-based provenance, semiring provenance. We develop a query language for provenance, which can express all of the aforementioned types of queries, as well as many more; we propose storage, processing and indexing schemes for data provenance in support of these queries; and we experimentally validate the feasibility of provenance querying and the benefits of our indexing techniques across a variety of application classes and queries.

215 citations


Journal Article
TL;DR: This work surveys the developments on finding structural information among tuples in an RDB using an l-keyword query, Q, which is a set of keywords of size l, denoted as Q = {k1, k2, · · · , kl}.
Abstract: The integration of DB and IR provides flexible ways for users to query information in the same platform [6, 2, 3, 7, 5, 28] On one hand, the sophisticated DB facilities provided by RDBMSs assist users to query well-structured information using SQL On the other hand, IR techniques allow users to search unstructured information using keywords based on scoring and ranking, and do not need users to understand any database schemas We survey the developments on finding structural information among tuples in an RDB using an l-keyword query, Q, which is a set of keywords of size l, denoted as Q = {k1, k2, · · · , kl} Here, an RDB is viewed as a data graph GD(V,E), where V represents a set of tuples, and E represents a set of edges between tuples An edge exists between two tuples if at least there is a foreign key reference from one to the other A tuple consists of attribute values and some of them are strings or full-text The structural information to be returned for an l-keyword query is a set of connected structures, R, where a connected structure represents how the tuples, that contain the required keywords, are interconnected in a database GD R can be either all trees or all subgraphs When a function score(·) is given to score a structure, we can find the top-k structures instead of all structures in GD Such a score(·) function can be based on either the text information maintained in tuples (node weights), or the connections among tuples (edge weights), or both In Section 2, we focus on supporting keyword search in an RDBMS using SQL Since this implies making use of the database schema information to issue SQL queries in order to find structures for an l-keyword query, it is called the schema-based approach The two main steps in the schema-based approach are how to generate a set of SQL queries that can find all the structures among tuples in an RDB completely, and how to evaluate the generated set of SQL queries efficiently Due to the nature of set operations used in SQL and the underneath relational algebra, a data graph GD is considered as an undirected graph by ignoring the direction of references between tuples, and therefore a returned structure is of undirected structure (either tree or subgraph) The existing algorithms use a parameter to control the maximum size of a structure allowed Such a size control parameter limits the number of SQL queries to be executed Otherwise, the number of SQL queries to be executed for finding all or even top-k structures is too large The score(·) functions used to rank the structures are all based on the text information on tuples In Section 3, we focus on supporting keyword search in an RDBMS from a different viewpoint, by materializing an RDB as a directed graph GD Unlike an undirected graph, the fact that a tuple v can reach to another tuple u in a directed graph does not necessarily mean that the tuple v is reachable from u In this context, a returned structure (either steiner tree, distinct rooted tree, r-radius steiner graph, or multi-center subgraph) is directed Such direction handling provides users with more information on how the tuples are interconnected

197 citations


Proceedings ArticleDOI
06 Jun 2010
TL;DR: This paper proposes a new paradigm for explaining a why-not question that is based on automatically generating a refined query whose result includes both the original query's result as well as the user-specified missing tuple(s).
Abstract: One useful feature that is missing from today's database systems is an explain capability that enables users to seek clarifications on unexpected query results. There are two types of unexpected query results that are of interest: the presence of unexpected tuples, and the absence of expected tuples (i.e., missing tuples). Clearly, it would be very helpful to users if they could pose follow-up why and why-not questions to seek clarifications on, respectively, unexpected and expected (but missing) tuples in query results. While the why questions can be addressed by applying established data provenance techniques, the problem of explaining the why-not questions has received very little attention. There are currently two explanation models proposed for why-not questions. The first model explains a missing tuple t in terms of modifications to the database such that t appears in the query result wrt the modified database. The second model explains by identifying the data manipulation operator in the query evaluation plan that is responsible for excluding t from the result. In this paper, we propose a new paradigm for explaining a why-not question that is based on automatically generating a refined query whose result includes both the original query's result as well as the user-specified missing tuple(s). In contrast to the existing explanation models, our approach goes beyond merely identifying the "culprit" query operator responsible for the missing tuple(s) and is useful for applications where it is not appropriate to modify the database to obtain missing tuples.

179 citations


Proceedings Article
11 Jul 2010
TL;DR: TIE is presented, a novel, information-extraction system, which distills facts from text while inducing as much temporal information as possible, and performs global inference, enforcing transitivity to bound the start and ending times for each event.
Abstract: Research on information extraction (IE) seeks to distill relational tuples from natural language text, such as the contents of the WWW. Most IE work has focussed on identifying static facts, encoding them as binary relations. This is unfortunate, because the vast majority of facts are fluents, only holding true during an interval of time. It is less helpful to extract PresidentOf(Bill-Clinton, USA) without the temporal scope 1/20/93 - 1/20/01. This paper presents TIE, a novel, information-extraction system, which distills facts from text while inducing as much temporal information as possible. In addition to recognizing temporal relations between times and events, TIE performs global inference, enforcing transitivity to bound the start and ending times for each event. We introduce the notion of temporal entropy as a way to evaluate the performance of temporal IE systems and present experiments showing that TIE outperforms three alternative approaches.

158 citations


Proceedings ArticleDOI
06 Jun 2010
TL;DR: A class of extended CRPQs, called ECRPZs, are proposed, which add regular relations on tuples of paths, and allow path variables in the heads of queries, and study their properties.
Abstract: For many problems arising in the setting of graph querying (such as finding semantic associations in RDF graphs, exact and approximate pattern matching, sequence alignment, etc.), the power of standard languages such as the widely studied conjunctive regular path queries (CRPQs) is insufficient in at least two ways. First, they cannot output paths and second, more crucially, they cannot express relations among paths.We thus propose a class of extended CRPQs, called ECRPQs, which add regular relations on tuples of paths, and allow path variables in the heads of queries. We provide several examples of their usefulness in querying graph structured data, and study their properties. We analyze query evaluation and representation of tuples of paths in the output by means of automata. We present a detailed analysis of data and combined complexity of queries, and consider restrictions that lower the complexity of ECRPQs to that of relational conjunctive queries. We study the containment problem, and look at further extensions with first-order features, and with non-regular relations that express arithmetic properties of paths, based on the lengths and numbers of occurrences of labels.

Journal ArticleDOI
01 Sep 2010
TL;DR: The algorithms used to generate a correct, finite, and, when possible, minimal set of explanations in queries that include selection, projection, join, union, aggregation and grouping (SPJUA) are described.
Abstract: This paper addresses the problem of explaining missing answers in queries that include selection, projection, join, union, aggregation and grouping (SPJUA). Explaining missing answers of queries is useful in various scenarios, including query understanding and debugging. We present a general framework for the generation of these explanations based on source data. We describe the algorithms used to generate a correct, finite, and, when possible, minimal set of explanations. These algorithms are part of Artemis, a system that assists query developers in analyzing queries by, for instance, allowing them to ask why certain tuples are not in the query results. Experimental results demonstrate that Artemis generates explanations of missing tuples at a pace that allows developers to effectively use them for query analysis.

Journal ArticleDOI
TL;DR: A new formal EA model based on the integration of Fuzzy set theory with Grey Relational Analysis (GRA) is proposed that produced credible estimates when compared with the results obtained using Case-Based Reasoning, Multiple Linear Regression and Artificial Neural Networks methods.
Abstract: Accurate and credible software effort estimation is a challenge for academic research and software industry From many software effort estimation models in existence, Estimation by Analogy (EA) is still one of the preferred techniques by software engineering practitioners because it mimics the human problem solving approach Accuracy of such a model depends on the characteristics of the dataset, which is subject to considerable uncertainty The inherent uncertainty in software attribute measurement has significant impact on estimation accuracy because these attributes are measured based on human judgment and are often vague and imprecise To overcome this challenge we propose a new formal EA model based on the integration of Fuzzy set theory with Grey Relational Analysis (GRA) Fuzzy set theory is employed to reduce uncertainty in distance measure between two tuples at the k th continuous feature $$ \left( {\left| {\left( {{x_o}(k) - {x_i}(k)} \right} \right|} \right) $$ GRA is a problem solving method that is used to assess the similarity between two tuples with M features Since some of these features are not necessary to be continuous and may have nominal and ordinal scale type, aggregating different forms of similarity measures will increase uncertainty in the similarity degree Thus the GRA is mainly used to reduce uncertainty in the distance measure between two software projects for both continuous and categorical features Both techniques are suitable when relationship between effort and other effort drivers is complex Experimental results showed that using integration of GRA with FL produced credible estimates when compared with the results obtained using Case-Based Reasoning, Multiple Linear Regression and Artificial Neural Networks methods

Journal Article
TL;DR: A discussion on causality in databases is initiated, some simple definitions are given, and this formalism is motivated through a number of example applications.
Abstract: Provenance is often used to validate data, by verifying its origin and explaining its derivation. When searching for “causes” of tuples in the query results or in general observations, the analysis of lineage becomes an essential tool for providing such justifications. However, lineage can quickly grow very large, limiting its immediate use for providing intuitive explanations to the user. The formal notion of causality is a more refined concept that identifies causes for observations based on user-defined criteria, and that assigns to them gradual degrees of responsibility based on their respective contributions. In this paper, we initiate a discussion on causality in databases, give some simple definitions, and motivate this formalism through a number of example applications.

Proceedings ArticleDOI
17 Jan 2010
TL;DR: The experience of implementing a lightweight, fully verified relational database management system (RDBMS) in Coq shows that though many challenges remain, building fully-verified systems software in CoQ is within reach.
Abstract: We report on our experience implementing a lightweight, fully verified relational database management system (RDBMS). The functional specification of RDBMS behavior, RDBMS implementation, and proof that the implementation meets the specification are all written and verified in Coq. Our contributions include: (1) a complete specification of the relational algebra in Coq; (2) an efficient realization of that model (B+ trees) implemented with the Ynot extension to Coq; and (3) a set of simple query optimizations proven to respect both semantics and run-time cost. In addition to describing the design and implementation of these artifacts, we highlight the challenges we encountered formalizing them, including the choice of representation for finite relations of typed tuples and the challenges of reasoning about data structures with complex sharing. Our experience shows that though many challenges remain, building fully-verified systems software in Coq is within reach.

Journal ArticleDOI
TL;DR: More expressive query languages for K -relations that extend RA K + with the difference and constant annotations operations on annotated tuples with basic properties of the resulting query languages are defined.

Posted Content
TL;DR: In this paper, the authors define the semantics of clean query answering in terms of certain/possible answers as the greatest lower bound/least upper bound of all possible answers obtained from the clean instances.
Abstract: Matching dependencies were recently introduced as declarative rules for data cleaning and entity resolution. Enforcing a matching dependency on a database instance identifies the values of some attributes for two tuples, provided that the values of some other attributes are sufficiently similar. Assuming the existence of matching functions for making two attributes values equal, we formally introduce the process of cleaning an instance using matching dependencies, as a chase-like procedure. We show that matching functions naturally introduce a lattice structure on attribute domains, and a partial order of semantic domination between instances. Using the latter, we define the semantics of clean query answering in terms of certain/possible answers as the greatest lower bound/least upper bound of all possible answers obtained from the clean instances. We show that clean query answering is intractable in some cases. Then we study queries that behave monotonically wrt semantic domination order, and show that we can provide an under/over approximation for clean answers to monotone queries. Moreover, non-monotone positive queries can be relaxed into monotone queries.

Journal ArticleDOI
Stephen Soderland1, Brendan Roof1, Bo Qin1, Shi Xu1, Oren Etzioni1 
TL;DR: The steps needed to adapt Open IE to a domain-specific ontology are explored and the approach of mapping domain-independent tuples to an ontology using domains from DARPA’s Machine Reading Project is demonstrated.
Abstract: Information extraction (IE) can identify a set of relations from free text to support question answering (QA). Until recently, IE systems were domain-specific and needed a combination of manual engineering and supervised learning to adapt to each target domain. A new paradigm, Open IE operates on large text corpora without any manual tagging of relations, and indeed without any pre-specified relations. Due to its open-domain and open-relation nature, Open IE is purely textual and is unable to relate the surface forms to an ontology, if known in advance. We explore the steps needed to adapt Open IE to a domain-specific ontology and demonstrate our approach of mapping domain-independent tuples to an ontology using domains from DARPA’s Machine Reading Project. Our system achieves precision over 0.90 from as few as 8 training examples for an NFL-scoring domain.

Journal ArticleDOI
01 Sep 2010
TL;DR: This paper develops novel, more efficient factorization algorithms that directly construct the read-once expression for a result tuple Boolean formula (if one exists), for a large subclass of queries (specifically, conjunctive queries without self-joins).
Abstract: Probabilistic databases hold promise of being a viable means for large-scale uncertainty management, increasingly needed in a number of real world applications domains. However, query evaluation in probabilistic databases remains a computational challenge. Prior work on efficient exact query evaluation in probabilistic databases has largely concentrated on query-centric formulations (e.g., safe plans, hierarchical queries), in that, they only consider characteristics of the query and not the data in the database. It is easy to construct examples where a supposedly hard query run on an appropriate database gives rise to a tractable query evaluation problem. In this paper, we develop efficient query evaluation techniques that leverage characteristics of both the query and the data in the database. We focus on tuple-independent databases where the query evaluation problem is equivalent to computing marginal probabilities of Boolean formulas associated with the result tuples. This latter task is easy if the Boolean formulas can be factorized into a form that has every variable appearing at most once (called read-once). However, a naive approach that directly uses previously developed Boolean formula factorization algorithms is inefficient, because those algorithms require the input formulas to be in the disjunctive normal form (DNF). We instead develop novel, more efficient factorization algorithms that directly construct the read-once expression for a result tuple Boolean formula (if one exists), for a large subclass of queries (specifically, conjunctive queries without self-joins). We empirically demonstrate that (1) our proposed techniques are orders of magnitude faster than generic inference algorithms for queries where the result Boolean formulas can be factorized into read-once expressions, and (2) for the special case of hierarchical queries, they rival the efficiency of prior techniques specifically designed to handle such queries.

Proceedings ArticleDOI
06 Jun 2010
TL;DR: This paper studies partially closed databases from which both tuples and values may be missing, and proposes three models to characterize whether a c-instance T is complete for a query Q relative to master data.
Abstract: Databases in real life are often neither entirely closed-world nor entirely open-world. Indeed, databases in an enterprise are typically partially closed, in which a part of the data is constrained by master data that contains complete information about the enterprise in certain aspects [21]. It has been shown that despite missing tuples, such a database may turn out to have complete information for answering a query [9].This paper studies partially closed databases from which both tuples and values may be missing. We specify such a database in terms of conditional tables constrained by master data, referred to as c-instances. We first propose three models to characterize whether a c-instance T is complete for a query Q relative to master data. That is, depending on how missing values in T are instantiated, the answer to Q in T remains unchanged when new tuples are added. We then investigate four problems, to determine (a) whether a given c-instance is complete for a query Q, (b) whether there exists a c-instance that is complete for Q relative to master data available, (c) whether a c-instance is a minimal-size database that is complete for Q, and (d) whether there exists a c-instance of a bounded size that is complete for Q. We establish matching lower and upper bounds on these problems for queries expressed in a variety of languages, in each of the three models for specifying relative completeness.

Proceedings ArticleDOI
Jef Wijsen1
06 Jun 2010
TL;DR: A decision procedure for first-order expressibility of CERTAINTY(q) when q is acyclic and without self-join is obtained and it is shown that if CERITALTY( q) is first- order expressible, itsFirst-order definition, commonly called (certain) first-Order rewriting, can be constructed in a rather straightforward way.
Abstract: A natural way for capturing uncertainty in the relational data model is by having relations that violate their primary key constraint, that is, relations in which distinct tuples agree on the primary key. A repair (or possible world) of a database is then obtained by selecting a maximal number of tuples without ever selecting two distinct tuples that have the same primary key value. For a Boolean query q, CERTAINTY(q) is the problem that takes as input a database db and asks whether q evaluates to true on every repair of db. We are interested in determining queries q for which CERTAINTY(q) is first-order expressible (and hence in the low complexity class AC0).For queries q in the class of conjunctive queries without self-join, we provide a necessary syntactic condition for first-order expressibility of CERTAINTY(q). For acyclic queries, this necessary condition is also a sufficient condition. So we obtain a decision procedure for first-order expressibility of CERTAINTY(q) when q is acyclic and without self-join. We also show that if CERTAINTY(q) is first-order expressible, its first-order definition, commonly called (certain) first-order rewriting, can be constructed in a rather straightforward way.

Journal ArticleDOI
TL;DR: In this paper, the authors introduce definitions and algorithms for building histogram-and Haar wavelet-based synopses on probabilistic data, which can form the foundation for human understanding and interactive data exploration.
Abstract: There is a growing realization that uncertain information is a first-class citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurate synopses of large probabilistic relations are crucial. Similar to their deterministic relation counterparts, such compact probabilistic data synopses can form the foundation for human understanding and interactive data exploration, probabilistic query planning and optimization, and fast approximate query processing in probabilistic database systems. In this paper, we introduce definitions and algorithms for building histogram- and Haar wavelet-based synopses on probabilistic data. The core problem is to choose a set of histogram bucket boundaries or wavelet coefficients to optimize the accuracy of the approximate representation of a collection of probabilistic tuples under a given error metric. For a variety of different error metrics, we devise efficient algorithms that construct optimal or near optimal size B histogram and wavelet synopses. This requires careful analysis of the structure of the probability distributions, and novel extensions of known dynamic-programming-based techniques for the deterministic domain. Our experiments show that this approach clearly outperforms simple ideas, such as building summaries for samples drawn from the data distribution, while taking equal or less time.

Proceedings Article
09 May 2010
TL;DR: This paper proposes novel semantical perspectives on first-order (or relational) probabilistic conditionals that are motivated by considering them as subjective, but population-based statements, and presents two inference operators that are shown to yield reasonable inferences.
Abstract: It seems to be a common view that in order to interpret probabilistic first-order sentences, either a statistical approach that counts (tuples of) individuals has to be used, or the knowledge base has to be grounded to make a possible worlds semantics applicable, for a subjective interpretation of probabilities. In this paper, we propose novel semantical perspectives on first-order (or relational) probabilistic conditionals that are motivated by considering them as subjective, but population-based statements. We propose two different semantics for relational probabilistic conditionals, and a set of postulates for suitable inference operators in this framework. Finally, we present two inference operators by applying the maximum entropy principle to the respective model theories. Both operators are shown to yield reasonable inferences according to the postulates.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: This paper presents an algorithm for processing preference queries that uses the preferential order between keywords to direct the joining of relevant tuples from multiple relations and shows how to reduce the complexity of this algorithm by sharing computational steps.
Abstract: Keyword-based search in relational databases allows users to discover relevant information without knowing the database schema or using complicated queries. However, such searches may return an overwhelming number of results, often loosely related to the user intent. In this paper, we propose personalizing keyword database search by utilizing user preferences. Query results are ranked based on both their relevance to the query and their preference degree for the user. To further increase the quality of results, we consider two new metrics that evaluate the goodness of the result as a set, namely coverage of many user interests and content diversity. We present an algorithm for processing preference queries that uses the preferential order between keywords to direct the joining of relevant tuples from multiple relations. We then show how to reduce the complexity of this algorithm by sharing computational steps. Finally, we report evaluation results of the efficiency and effectiveness of our approach.

Journal ArticleDOI
Bin Liu1, Laura Chiticariu2, Vivian Chu2, H. V. Jagadish1, Frederick Reiss2 
01 Sep 2010
TL;DR: This paper has developed a technique to suggest a ranked list of rule modifications that an expert rule specifier can consider, and implemented it in the SystemT information extraction system developed at IBM Research -- Almaden.
Abstract: Rule-based information extraction from text is increasingly being used to populate databases and to support structured queries on unstructured text. Specification of suitable information extraction rules requires considerable skill and standard practice is to refine rules iteratively, with substantial effort. In this paper, we show that techniques developed in the context of data provenance, to determine the lineage of a tuple in a database, can be leveraged to assist in rule refinement. Specifically, given a set of extraction rules and correct and incorrect extracted data, we have developed a technique to suggest a ranked list of rule modifications that an expert rule specifier can consider. We implemented our technique in the SystemT information extraction system developed at IBM Research -- Almaden and experimentally demonstrate its effectiveness.

Journal ArticleDOI
TL;DR: In this paper, a tuple-pruning algorithm that reduces the search space through Bloom filter queries, which do not require off-chip memory accesses, is proposed for packet classification.
Abstract: Tuple pruning for packet classification provides fast search and a low implementation complexity. The tuple pruning algorithm reduces the search space to a subset of tuples determined by individual field lookups that cause off-chip memory accesses. The authors propose a tuple-pruning algorithm that reduces the search space through Bloom filter queries, which do not require off-chip memory accesses.

Proceedings ArticleDOI
26 Apr 2010
TL;DR: A formal model applying REST architectural principles to the description of semantic web services is introduced, including the discussion of its syntax and operational semantics, which allows for a complete and rigorous description of resource based web systems.
Abstract: In this article a formal model applying REST architectural principles to the description of semantic web services is introduced, including the discussion of its syntax and operational semantics. RESTful semantic resources are described using the concept of tuple spaces being manipulated by HTTP methods that are related to classical tuple space operations. On the other hand, RESTful resources creation, destruction and other dynamic aspects of distributed HTTP computations involving coordination between HTTP agents and services are modeled using process calculus style named channels and message passing mechanisms. The resulting model allows for a complete and rigorous description of resource based web systems, where agents taking part in a computation publish data encoded according to semantic standards through public triple repositories identified by well known URIs. The model can be used to describe complex interaction scenarios where coordination and composition of resources are required. One of such scenarios taken from the literature about web services choreography is analyzed from the point of view of the proposed model. Finally, possible extensions to the formalism, such as the inclusion of a description logics based type system associated to the semantic resources or possible extensions to HTTP operations are briefly explored.

Proceedings ArticleDOI
01 Mar 2010
TL;DR: This paper develops techniques for detecting violations of conditional functional dependencies (CFDs) in relations that are fragmented and distributed across different sites, and shows that it is intractable to minimally refine a partition and make it dependency preserving.
Abstract: One of the central problems for data quality is inconsistency detection. Given a database D and a set Σ of dependencies as data quality rules, we want to identify tuples in D that violate some rules in Σ. When D is a centralized database, there have been effective SQL-based techniques for finding violations. It is, however, far more challenging when data in D is distributed, in which inconsistency detection often necessarily requires shipping data from one site to another. This paper develops techniques for detecting violations of conditional functional dependencies (CFDs) in relations that are fragmented and distributed across different sites. (1) We formulate the detection problem in various distributed settings as optimization problems, measured by either network traffic or response time. (2)We show that it is beyond reach in practice to find optimal detection methods: the detection problem is NP-complete when the data is partitioned either horizontally or vertically, and when we aim to minimize either data shipment or response time. (3) For data that is horizontally partitioned, we provide several algorithms to find violations of a set of CFDs, leveraging the structure of CFDs to reduce data shipment or increase parallelism. (4) We verify experimentally that our algorithms are scalable on large relations and complex CFDs. (5) For data that is vertically partitioned, we provide a characterization for CFDs to be checked locally without requiring data shipment, in terms of dependency preservation. We show that it is intractable to minimally refine a partition and make it dependency preserving.

Journal ArticleDOI
TL;DR: An algorithm is developed to precisely interpret such non-atomic values and to transfer the fuzzy database tuples to the forms acceptable for many regular (i.e. atomic values based) data mining algorithms.

Journal ArticleDOI
TL;DR: In this approach, test predicates are used to formalize combinatorial testing as a logical problem, and an external formal logic tool is applied to solve it, effectively handled by the same tool.
Abstract: Combinatorial testing is as an effective testing technique to reveal failures in a given system, based on input combinations coverage and combinatorial optimization. Combinatorial testing of strength t (t???2) requires that each t-wise tuple of values of the different system input parameters is covered by at least one test case. Combinatorial test suite generation algorithms aim at producing a test suite covering all the required tuples in a small (possibly minimal) number of test cases, in order to reduce the cost of testing. The most used combinatorial technique is the pairwise testing (t?=?2) which requires coverage of all pairs of input values. Constrained combinatorial testing takes also into account constraints over the system parameters, for instance forbidden tuples of inputs, modeling invalid or not realizable input values combinations. In this paper a new approach to combinatorial testing, tightly integrated with formal logic, is presented. In this approach, test predicates are used to formalize combinatorial testing as a logical problem, and an external formal logic tool is applied to solve it. Constraints over the input domain are expressed as logical predicates too, and effectively handled by the same tool. Moreover, inclusion or exclusion of select tuples is supported, allowing the user to customize the test suite layout. The proposed approach is supported by a prototype tool implementation and results of experimental assessment are also presented.