Showing papers on "Tuple published in 2012"

PDF

Open Access

Proceedings Article•

Open Language Learning for Information Extraction

[...]

Michael Schmitz¹, Stephen Soderland¹, Robert Bart¹, Oren Etzioni¹•Institutions (1)

12 Jul 2012

TL;DR: Ollie as mentioned in this paper improves ReVerb by extracting relations mediated by nouns, adjectives, and more, and adds context information from the sentence in the extractions to increase precision.

...read moreread less

Abstract: Open Information Extraction (IE) systems extract relational tuples from text, without requiring a pre-specified vocabulary, by identifying relation phrases and associated arguments in arbitrary sentences. However, state-of-the-art Open IE systems such as ReVerb and woe share two important weaknesses -- (1) they extract only relations that are mediated by verbs, and (2) they ignore context, thus extracting tuples that are not asserted as factual. This paper presents ollie, a substantially improved Open IE system that addresses both these limitations. First, ollie achieves high yield by extracting relations mediated by nouns, adjectives, and more. Second, a context-analysis step increases precision by including contextual information from the sentence in the extractions. ollie obtains 2.7 times the area under precision-yield curve (AUC) compared to ReVerb and 1.9 times the AUC of woeparse.

...read moreread less

792 citations

Journal Article•DOI•

Expressive Languages for Path Queries over Graph-Structured Data

[...]

Pablo Barceló¹, Leonid Libkin², Anthony W. Lin³, Peter T. Wood⁴•Institutions (4)

University of Chile¹, University of Edinburgh², University of Oxford³, Birkbeck, University of London⁴

01 Dec 2012-ACM Transactions on Database Systems

TL;DR: A class of extended CRPQs, called ECRPZs, are proposed, which add regular relations on tuples of paths, and allow path variables in the heads of queries, and are studied for their usefulness in querying graph structured data.

...read moreread less

Abstract: For many problems arising in the setting of graph querying (such as finding semantic associations in RDF graphs, exact and approximate pattern matching, sequence alignment, etc.), the power of standard languages such as the widely studied conjunctive regular path queries (CRPQs) is insufficient in at least two ways. First, they cannot output paths and second, more crucially, they cannot express relationships among paths.We thus propose a class of extended CRPQs, called ECRPQs, which add regular relations on tuples of paths, and allow path variables in the heads of queries. We provide several examples of their usefulness in querying graph structured data, and study their properties. We analyze query evaluation and representation of tuples of paths in the output by means of automata. We present a detailed analysis of data and combined complexity of queries, and consider restrictions that lower the complexity of ECRPQs to that of relational conjunctive queries. We study the containment problem, and look at further extensions with first-order features, and with nonregular relations that add arithmetic constraints on the lengths of paths and numbers of occurrences of labels.

...read moreread less

149 citations

Patent•

Information extraction from a database

[...]

Sergey Brin¹•Institutions (1)

Google¹

14 Sep 2012

TL;DR: In this article, techniques for extracting information from a database such as the Web are described. But they do not address the problem of identifying a pattern in which the tuples of information were stored.

...read moreread less

Abstract: Techniques for extracting information from a database are provided. A database such as the Web is searched for occurrences of tuples of information. The occurrences of the tuples of information that were found in the database are analyzed to identify a pattern in which the tuples of information were stored. Additional tuples of information can then be extracted from the database utilizing the pattern. This process can be repeated with the additional tuples of information, if desired.

...read moreread less

149 citations

Patent•

Variable length encoding in a storage system

[...]

John Colgrove, John Hayes, Ethan L. Miller

27 Sep 2012

TL;DR: In this article, a system and method for maintaining a mapping table in a data storage subsystem is described, where the mapping table may be organized as a plurality of time ordered levels, with each level including one or more mapping table entries.

...read moreread less

Abstract: A system and method for maintaining a mapping table in a data storage subsystem. A data storage subsystem supports multiple mapping tables including a plurality of entries. Each of the entries comprise a tuple including a key. A data storage controller is configured to encode each tuple in the mapping table using a variable length encoding. Additionally, the mapping table may be organized as a plurality of time ordered levels, with each level including one or more mapping table entries. Further, a particular encoding of a plurality of encodings for a given tuple may be selected based at least in part on a size of the given tuple as unencoded, a size of the given tuple as encoded, and a time to encode the given tuple.

...read moreread less

145 citations

Proceedings Article•DOI•

Dynamic distributed dimensional data model (D4M) database and computation system

[...]

Jeremy Kepner¹, William Arcand¹, William Bergeron¹, Nadya T. Bliss¹, Robert A. Bond¹, Chansup Byun¹, Gary R. Condon¹, Kenneth Gregson¹, Matthew Hubbell¹, Jonathan Kurz¹, Andrew McCabe¹, Peter Michaleas¹, Andrew Prout¹, Albert Reuther¹, Antonio Rosa¹, Charles Yee¹ - Show less +12 more•Institutions (1)

Massachusetts Institute of Technology¹

25 Mar 2012

TL;DR: D4M (Dynamic Distributed Dimensional Data Model) has been developed to provide a mathematically rich interface to tuple stores (and structured query language “SQL” databases) and it is possible to create composable analytics with significantly less effort than using traditional approaches.

...read moreread less

Abstract: A crucial element of large web companies is their ability to collect and analyze massive amounts of data. Tuple store databases are a key enabling technology employed by many of these companies (e.g., Google Big Table and Amazon Dynamo). Tuple stores are highly scalable and run on commodity clusters, but lack interfaces to support efficient development of mathematically based analytics. D4M (Dynamic Distributed Dimensional Data Model) has been developed to provide a mathematically rich interface to tuple stores (and structured query language “SQL” databases). D4M allows linear algebra to be readily applied to databases. Using D4M, it is possible to create composable analytics with significantly less effort than using traditional approaches. This work describes the D4M technology and its application and performance.

...read moreread less

143 citations

Patent•

Automatic Detection Of Fraud And Error Using A Vector-Cluster Model

[...]

Amy Tai¹, Dane Roberts¹, Appala Jagadesh Padala¹, Bhaskar Ghosh¹•Institutions (1)

Business International Corporation¹

04 Mar 2012

TL;DR: In this article, one or more computers retrieve records of transactions to be analyzed together, each record identifies a date of a transaction, an amount of the transaction, a person associated with the transaction and a category into which the transaction is classified.

...read moreread less

Abstract: One or more computers retrieve records of transactions to be analyzed together. Each record identifies a date of a transaction, an amount of the transaction, a person associated with the transaction, and a category into which the transaction is classified. The one or more computers automatically prepare in computer memory, a set of tuples (also called “vectors”) corresponding to a set of persons identified in the retrieved records. Each tuple corresponds to one person, and each tuple includes at least one number representing a count within each category, of transactions classified therein, e.g. total number of cash transactions in category X. Then, the one or more computers automatically identify a subset of outliers, e.g. by grouping the tuples into clusters using k-means clustering, followed by marking in memory an indication of inappropriateness of any transaction that had been included in the count of a tuple now identified to be outlier.

...read moreread less

90 citations

Proceedings Article•DOI•

Answering Why-not Questions on Top-k Queries

[...]

Zhian He¹, Eric Lo¹•Institutions (1)

Hong Kong Polytechnic University¹

01 Apr 2012

TL;DR: This paper develops algorithms to answer why-not questions on top-k queries efficiently and case studies and experimental results show that these algorithms are able to return high quality explanations efficiently.

...read moreread less

Abstract: After decades of effort working on database performance, the quality and the usability of database systems have received more attention in recent years. In particular, the feature of explaining missing tuples in a query result, or the so-called "why-not" questions, has recently become an active topic. In this paper, we study the problem of answering why-not questions on top-k queries. Our motivation is that we know many users love to use top-k queries when they are making multi-criteria decisions. However, they often feel frustrated when they are asked to quantify their feeling as a set of numeric weightings, and feel even more frustrated after they see the query results do not include their expected answers. In this paper, we use the query-refinement method to approach the problem. Given as inputs the original top-k query and a set of missing tuples, our algorithm returns to the user a refined top-k query that includes the missing tuples. A case study and experimental results show that our approach returns high quality explanations to users efficiently.

...read moreread less

87 citations

Journal Article•DOI•

Efficient Mining of Frequent Item Sets on Large Uncertain Databases

[...]

Liang Wang¹, D. W-L Cheung¹, Reynold Cheng¹, Sau Dan Lee¹, Xuan Yang¹ - Show less +1 more•Institutions (1)

University of Hong Kong¹

01 Dec 2012-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper proposes incremental mining algorithms, which enable Probabilistic Frequent Item set (PFI) results to be refreshed, and develops an approximate algorithm, which can efficiently and accurately discover frequent item sets in a large uncertain database.

...read moreread less

Abstract: The data handled in emerging applications like location-based services, sensor monitoring systems, and data integration, are often inexact in nature. In this paper, we study the important problem of extracting frequent item sets from a large uncertain database, interpreted under the Possible World Semantics (PWS). This issue is technically challenging, since an uncertain database contains an exponential number of possible worlds. By observing that the mining process can be modeled as a Poisson binomial distribution, we develop an approximate algorithm, which can efficiently and accurately discover frequent item sets in a large uncertain database. We also study the important issue of maintaining the mining result for a database that is evolving (e.g., by inserting a tuple). Specifically, we propose incremental mining algorithms, which enable Probabilistic Frequent Item set (PFI) results to be refreshed. This reduces the need of re-executing the whole mining algorithm on the new database, which is often more expensive and unnecessary. We examine how an existing algorithm that extracts exact item sets, as well as our approximate algorithm, can support incremental mining. All our approaches support both tuple and attribute uncertainty, which are two common uncertain database models. We also perform extensive evaluation on real and synthetic data sets to validate our approaches.

...read moreread less

86 citations

Book Chapter•DOI•

Interacting with Statistical Linked Data via OLAP Operations

[...]

Benedikt Kämpgen¹, Sean O'Riain², Andreas Harth¹•Institutions (2)

Karlsruhe Institute of Technology¹, National University of Ireland²

27 May 2012

TL;DR: This work investigates the problem of executing OLAP queries via SPARQL on an RDF store and defines projection, slice, dice and roll-up operations on single data cubes published as Linked Data reusing the RDF Data Cube vocabulary and shows how a nested set of operations lead to an OLAP query.

...read moreread less

Abstract: Online Analytical Processing (OLAP) promises an interface to analyse Linked Data containing statistics going beyond other interaction paradigms such as follow-your-nose browsers, faceted-search interfaces and query builders. Transforming statistical Linked Data into a star schema to populate a relational database and applying a common OLAP engine do not allow to optimise OLAP queries on RDF or to directly propagate changes of Linked Data sources to clients. Therefore, as a new way to interact with statistics published as Linked Data, we investigate the problem of executing OLAP queries via SPARQL on an RDF store. First, we define projection, slice, dice and roll-up operations on single data cubes published as Linked Data reusing the RDF Data Cube vocabulary and show how a nested set of operations lead to an OLAP query. Second, we show how to transform an OLAP query to a SPARQL query which generates all required tuples from the data cube. In a small experiment, we show the applicability of our OLAP-to-SPARQL mapping in answering a business question in the financial domain.

...read moreread less

81 citations

Proceedings Article•DOI•

Auto-parallelizing stateful distributed streaming applications

[...]

Scott Schneider¹, Martin Hirzel¹, Bugra Gedik², Kun-Lung Wu¹•Institutions (2)

IBM¹, Bilkent University²

19 Sep 2012

TL;DR: This paper presents a compiler and runtime system that automatically extract data parallelism for distributed stream processing, guaranteeing safety, even in the presence of stateful, selective, and user-defined operators.

...read moreread less

Abstract: Streaming applications transform possibly infinite streams of data and often have both high throughput and low latency requirements. They are comprised of operator graphs that produce and consume data tuples. The streaming programming model naturally exposes task and pipeline parallelism, enabling it to exploit parallel systems of all kinds, including large clusters. However, it does not naturally expose data parallelism, which must instead be extracted from streaming applications. This paper presents a compiler and runtime system that automatically extract data parallelism for distributed stream processing. Our approach guarantees safety, even in the presence of stateful, selective, and user-defined operators. When constructing parallel regions, the compiler ensures safety by considering an operator's selectivity, state, partitioning, and dependencies on other operators in the graph. The distributed runtime system ensures that tuples always exit parallel regions in the same order they would without data parallelism, using the most efficient strategy as identified by the compiler. Our experiments using 100 cores across 14 machines show linear scalability for standard parallel regions, and near linear scalability when tuples are shuffled across parallel regions.

...read moreread less

78 citations

Journal Article•DOI•

A blind reversible method for watermarking relational databases based on a time-stamping protocol

[...]

Mahmoud E. Farfoura¹, Shi-Jinn Horng¹, Jui-Lin Lai², Ray-Shine Run², Rong-Jian Chen², Muhammad Khurram Khan³ - Show less +2 more•Institutions (3)

National Taiwan University of Science and Technology¹, National United University², King Saud University³

01 Feb 2012-Expert Systems With Applications

TL;DR: An authentication protocol based on an efficient time-stamp protocol and a blind reversible watermarking method that ensures ownership protection in the field of relational database water marking are designed and proposed.

...read moreread less

Abstract: Highlights? An authentication protocol is designed for a reversible watermarking using time-stamp protocol. ? The prediction-error expansion on integers technique is used to achieve reversibility. ? The watermark is detected successfully even most of watermarked relation tuples are deleted. Digital watermarking technology has been adopted lately as an effective solution to protecting the copyright of digital assets from illicit copying. Reversible watermark, which is also called invertible watermark, or erasable watermark, helps to recover back the original data after the content has been authenticated. Such reversibility is highly desired in some sensitive database applications, e.g. in military and medical data. Permanent distortion is one of the main drawbacks of the entire irreversible relational database watermarking schemes. In this paper, we design an authentication protocol based on an efficient time-stamp protocol, and we propose a blind reversible watermarking method that ensures ownership protection in the field of relational database watermarking. Whereas previous techniques have been mainly concerned with introducing permanent errors into the original data, our approach ensures one hundred percent recovery of the original database relation after the owner-specific watermark has been detected and authenticated. In the proposed watermarking method, we utilize a reversible data-embedding technique called prediction-error expansion on integers to achieve reversibility. The detection of the watermark can be completed successfully even when 95% of a watermarked relation tuples are deleted. Our extensive analysis shows that the proposed scheme is robust against various forms of database attacks, including adding, deleting, shuffling or modifying tuples or attributes.

...read moreread less

Journal Article•DOI•

Optimal algorithms for crawling a hidden database in the web

[...]

Cheng Sheng¹, Nan Zhang², Yufei Tao³, Xin Jin²•Institutions (3)

The Chinese University of Hong Kong¹, George Washington University², KAIST³

01 Jul 2012

TL;DR: These algorithms are provably efficient, namely, they accomplish the task by performing only a small number of queries, even in the worst case, and theoretical results indicating that these algorithms are asymptotically optimal are established.

...read moreread less

Abstract: A hidden database refers to a dataset that an organization makes accessible on the web by allowing users to issue queries through a search interface. In other words, data acquisition from such a source is not by following static hyper-links. Instead, data are obtained by querying the interface, and reading the result page dynamically generated. This, with other facts such as the interface may answer a query only partially, has prevented hidden databases from being crawled effectively by existing search engines.This paper remedies the problem by giving algorithms to extract all the tuples from a hidden database. Our algorithms are provably efficient, namely, they accomplish the task by performing only a small number of queries, even in the worst case. We also establish theoretical results indicating that these algorithms are asymptotically optimal -- i.e., it is impossible to improve their efficiency by more than a constant factor. The derivation of our upper and lower bound results reveals significant insight into the characteristics of the underlying problem. Extensive experiments confirm the proposed techniques work very well on all the real datasets examined.

...read moreread less

Journal Article•DOI•

Computing Critical $k$ -Tuples in Power Networks

[...]

Kin Cheong Sou, Henrik Sandberg, Karl Henrik Johansson

12 Mar 2012-IEEE Transactions on Power Systems

TL;DR: This paper proposes an efficient and accurate approximate solution procedure for the considered problem based on solving a minimum-cut problem and enumerating all its optimal solutions and it is shown that the sparsest critical k -tuple problem can be formulated as a mixed integer linear programming (MILP) problem.

...read moreread less

Abstract: In this paper the problem of finding the sparsest (i.e., minimum cardinality) critical k-tuple including one arbitrarily specified measurement is considered. The solution to this problem can be used to identify weak points in the measurement set, or aid the placement of new meters. The critical k-tuple problem is a combinatorial generalization of the critical measurement calculation problem. Using topological network observability results, this paper proposes an efficient and accurate approximate solution procedure for the considered problem based on solving a minimum-cut (Min-Cut) problem and enumerating all its optimal solutions. It is also shown that the sparsest critical k -tuple problem can be formulated as a mixed integer linear programming (MILP) problem. This MILP problem can be solved exactly using available solvers such as CPLEX and Gurobi. A detailed numerical study is presented to evaluate the efficiency and the accuracy of the proposed Min-Cut and MILP calculations.

...read moreread less

Proceedings Article•DOI•

Factorised representations of query results: size bounds and readability

[...]

Dan Olteanu¹, Jakub Závodný¹•Institutions (1)

University of Oxford¹

26 Mar 2012

TL;DR: A representation system for relational data based on algebraic factorisation using distributivity of product over union and commutativity of product and union is introduced and two characterisations of conjunctive queries based on factorisations of their results whose nesting structure is defined by so-called factorisation trees.

...read moreread less

Abstract: We introduce a representation system for relational data based on algebraic factorisation using distributivity of product over union and commutativity of product and union.We give two characterisations of conjunctive queries based on factorisations of their results whose nesting structure is defined by so-called factorisation trees.The first characterisation concerns sizes of factorised representations. For any query, we derive a size bound that is asymptotically tight within our class of factorisations.We also characterise the queries by tight bounds on the readability of the provenance of result tuples and define syntactically the class of queries with bounded readability.

...read moreread less

Journal Article•DOI•

Continuous monitoring of skylines over uncertain data streams

[...]

Xiaofeng Ding¹, Xiang Lian², Lei Chen², Hai Jin¹•Institutions (2)

Huazhong University of Science and Technology¹, Hong Kong University of Science and Technology²

01 Feb 2012-Information Sciences

TL;DR: This paper proposes a novel sliding window skyline model where an uncertain tuple may take the probability to be in the skyline at a certain timestamp t, and proposes an efficient and effective approach, namely the candidate list approach, which maintains lists of candidates that might become skylines in future sliding windows.

...read moreread less

Journal Article•DOI•

Probabilistic databases with MarkoViews

[...]

Abhay Jha¹, Dan Suciu¹•Institutions (1)

University of Washington¹

01 Jul 2012

TL;DR: In this article, the authors propose a Markov Logic Network (MVDB) as a framework for representing complex correlations and for efficient query evaluation in probabilistic databases, which is equivalent to evaluating a Union of Conjunctive Query (UCQ) over a tuple-independent database.

...read moreread less

Abstract: Most of the work on query evaluation in probabilistic databases has focused on the simple tuple-independent data model, where tuples are independent random events. Several efficient query evaluation techniques exists in this setting, such as safe plans, algorithms based on OBDDs, tree-decomposition and a variety of approximation algorithms. However, complex data analytics tasks often require complex correlations, and query evaluation then is significantly more expensive, or more restrictive.In this paper, we propose MVDB as a framework both for representing complex correlations and for efficient query evaluation. An MVDB specifies correlations by views, called MarkoViews, on the probabilistic relations and declaring the weights of the view's outputs. An MVDB is a (very large) Markov Logic Network. We make two sets of contributions. First, we show that query evaluation on an MVDB is equivalent to evaluating a Union of Conjunctive Query(UCQ) over a tuple-independent database. The translation is exact (thus allowing the techniques developed for tuple independent databases to be carried over to MVDB), yet it is novel and quite non-obvious (some resulting probabilities may be negative!).This translation in itself though may not lead to much gain since the translated query gets complicated as we try to capture more correlations. Our second contribution is to propose a new query evaluation strategy that exploits offline compilation to speed up online query evaluation. Here we utilize and extend our prior work on compilation of UCQ. We validate experimentally our techniques on a large probabilistic database with MarkoViews inferred from the DBLP data.

...read moreread less

Posted Content•

Probabilistic Databases with MarkoViews

[...]

Abhay Jha¹, Dan Suciu¹•Institutions (1)

University of Washington¹

01 Aug 2012-arXiv: Databases

TL;DR: It is shown that query evaluation on an MVDB is equivalent to evaluating a Union of Conjunctive Query over a tuple-independent database, and a new query evaluation strategy that exploits offline compilation to speed up online query evaluation is proposed.

...read moreread less

Abstract: Most of the work on query evaluation in probabilistic databases has focused on the simple tuple-independent data model, where tuples are independent random events. Several efficient query evaluation techniques exists in this setting, such as safe plans, algorithms based on OBDDs, tree-decomposition and a variety of approximation algorithms. However, complex data analytics tasks often require complex correlations, and query evaluation then is significantly more expensive, or more restrictive. In this paper, we propose MVDB as a framework both for representing complex correlations and for efficient query evaluation. An MVDB specifies correlations by views, called MarkoViews, on the probabilistic relations and declaring the weights of the view's outputs. An MVDB is a (very large) Markov Logic Network. We make two sets of contributions. First, we show that query evaluation on an MVDB is equivalent to evaluating a Union of Conjunctive Query(UCQ) over a tuple-independent database. The translation is exact (thus allowing the techniques developed for tuple independent databases to be carried over to MVDB), yet it is novel and quite non-obvious (some resulting probabilities may be negative!). This translation in itself though may not lead to much gain since the translated query gets complicated as we try to capture more correlations. Our second contribution is to propose a new query evaluation strategy that exploits offline compilation to speed up online query evaluation. Here we utilize and extend our prior work on compilation of UCQ. We validate experimentally our techniques on a large probabilistic database with MarkoViews inferred from the DBLP data.

...read moreread less

Proceedings Article•DOI•

A dichotomy in the complexity of deletion propagation with functional dependencies

[...]

Benny Kimelfeld¹•Institutions (1)

IBM¹

21 May 2012

TL;DR: This paper generalizes a result by Cong et al., stating that deletion propagation is in polynomial time if keys are preserved by the view, and defines a view by a self-join-free conjunctive query (sjf-CQ) over a schema with functional dependencies.

...read moreread less

Abstract: A classical variant of the view-update problem is deletion propagation, where tuples from the database are deleted in order to realize a desired deletion of a tuple from the view. This operation may cause a (sometimes necessary) side effect---deletion of additional tuples from the view, besides the intentionally deleted one. The goal is to propagate deletion so as to maximize the number of tuples that remain in the view. In this paper, a view is defined by a self-join-free conjunctive query (sjf-CQ) over a schema with functional dependencies. A condition is formulated on the schema and view definition at hand, and the following dichotomy in complexity is established. If the condition is met, then deletion propagation is solvable in polynomial time by an extremely simple algorithm (very similar to the one observed by Buneman et al.). If the condition is violated, then the problem is NP-hard, and it is even hard to realize an approximation ratio that is better than some constant; moreover, deciding whether there is a side-effect-free solution is NP-complete. This result generalizes a recent result by Kimelfeld et al., who ignore functional dependencies. For the class of sjf-CQs, it also generalizes a result by Cong et al., stating that deletion propagation is in polynomial time if keys are preserved by the view.

...read moreread less

Journal Article•DOI•

Efficient and Progressive Algorithms for Distributed Skyline Queries over Uncertain Data

[...]

Xiaofeng Ding¹, Hai Jin¹•Institutions (1)

Huazhong University of Science and Technology¹

01 Aug 2012-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper proposes the notation of distributed skyline queries over uncertain data, and two communication- and computation-efficient algorithms are proposed to retrieve the qualified skylines from distributed local sites.

...read moreread less

Abstract: The skyline operator has received considerable attention from the database community, due to its importance in many applications including multicriteria decision making, preference answering, and so forth. In many applications where uncertain data are inherently exist, i.e., data collected from different sources in distributed locations are usually with imprecise measurements, and thus exhibit kind of uncertainty. Taking into account the network delay and economic cost associated with sharing and communicating large amounts of distributed data over an internet, an important problem in this scenario is to retrieve the global skyline tuples from all the distributed local sites with minimum communication cost. Based on the well-known notation of the probabilistic skyline query over centralized uncertain data, in this paper, we propose the notation of distributed skyline queries over uncertain data. Furthermore, two communication- and computation-efficient algorithms are proposed to retrieve the qualified skylines from distributed local sites. Extensive experiments have been conducted to verify the efficiency, the effectiveness and the progressiveness of our algorithms with both the synthetic and real data sets.

...read moreread less

Journal Article•DOI•

On the Complexity of View Update Analysis and Its Application to Annotation Propagation

[...]

Gao Cong¹, Wenfei Fan², Floris Geerts³, Jianzhong Li⁴, Jizhou Luo⁴ - Show less +1 more•Institutions (4)

Nanyang Technological University¹, University of Edinburgh², University of Antwerp³, Harbin Institute of Technology⁴

01 Mar 2012-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper investigates three problems identified in [1] for annotation propagation, namely, the view side-effect, source side- effect, and annotation placement problems, by considering several dichotomies and shows that key preserving views often simplify the propagation analysis.

...read moreread less

Abstract: This paper investigates three problems identified in [1] for annotation propagation, namely, the view side-effect, source side-effect, and annotation placement problems. Given annotations entered for a tuple or an attribute in a view, these problems ask what tuples or attributes in the source have to be annotated to produce the view annotations. As observed in [1], these problems are fundamental not only for data provenance but also for the management of view updates. For an annotation attached to a single existing tuple in a view, it has been shown that these problems are often intractable even for views defined in terms of simple SPJU queries [1]. We revisit these problems by considering several dichotomies: (1) views defined in various subclasses of SPJU, versus SPJU views under a practical key preserving condition; (2) annotations attached to existing tuples in a view versus annotations on tuples to be inserted into the view; and (3) a single-tuple annotation versus a group of annotations. We provide a complete picture of intractability and tractability for the three problems in all these settings. We show that key preserving views often simplify the propagation analysis. Indeed, some problems become tractable for certain key preserving views, as opposed to the intractability of their counterparts that are not key preserving. However, group annotations often make the analysis harder. In addition, the problems have quite diverse complexity when annotations are attached to existing tuples in a view and when they are entered for tuples to be inserted into the view.

...read moreread less

Journal Article•DOI•

Formalization of workflows for extracting bridge surveying goals from laser-scanned data

[...]

Pingbo Tang¹, Burcu Akinci²•Institutions (2)

Western Michigan University¹, Carnegie Mellon University²

01 Mar 2012-Automation in Construction

TL;DR: A case study is described to show the necessity and potential value of automating the manual data processing workflows being executed for extracting geometric data items (surveying goals) from 3D point clouds, and an approach for formalizing these workflows to enable such automation is presented.

...read moreread less

Proceedings Article•DOI•

Lookup Tables: Fine-Grained Partitioning for Distributed Databases

[...]

Aubrey Tatarowicz¹, Carlo Curino¹, Evan P. C. Jones¹, Samuel Madden¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Apr 2012

TL;DR: This work presents the design of a data distribution layer that efficiently stores these tables and maintains them in the presence of inserts, deletes, and updates, and shows greater potential for further scale-out on Wikipedia, Twitter, and TPC-E workloads.

...read moreread less

Abstract: The standard way to get linear scaling in a distributed OLTP DBMS is to horizontally partition data across several nodes. Ideally, this partitioning will result in each query being executed at just one node, to avoid the overheads of distributed transactions and allow nodes to be added without increasing the amount of required coordination. For some applications, simple strategies, such as hashing on primary key, provide this property. Unfortunately, for many applications, including social networking and order-fulfillment, many-to-many relationships cause simple strategies to result in a large fraction of distributed queries. Instead, what is needed is a fine-grained partitioning, where related individual tuples (e.g., cliques of friends) are co-located together in the same partition. Maintaining such a fine-grained partitioning requires the database to store a large amount of metadata about which partition each tuple resides in. We call such metadata a lookup table, and present the design of a data distribution layer that efficiently stores these tables and maintains them in the presence of inserts, deletes, and updates. We show that such tables can provide scalability for several difficult to partition database workloads, including Wikipedia, Twitter, and TPC-E. Our implementation provides 40% to 300% better performance on these workloads than either simple range or hash partitioning and shows greater potential for further scale-out.

...read moreread less

Journal Article•DOI•

Data Cube Materialization and Mining over MapReduce

[...]

Arnab Nandi¹, Cong Yu², Philip Lewis Bohannon³, Raghu Ramakrishnan⁴•Institutions (4)

Ohio State University¹, Google², Facebook³, Microsoft⁴

01 Oct 2012-IEEE Transactions on Knowledge and Data Engineering

TL;DR: MR-Cube as discussed by the authors is a MapReduce-based framework for efficient cube computation and identification of interesting cube groups on holistic measures such as TOP-K, which can easily benefit from the recent advancement of parallel computing infrastructure.

...read moreread less

Abstract: Computing interesting measures for data cubes and subsequent mining of interesting cube groups over massive data sets are critical for many important analyses done in the real world. Previous studies have focused on algebraic measures such as SUM that are amenable to parallel computation and can easily benefit from the recent advancement of parallel computing infrastructure such as MapReduce. Dealing with holistic measures such as TOP-K, however, is nontrivial. In this paper, we detail real-world challenges in cube materialization and mining tasks on web-scale data sets. Specifically, we identify an important subset of holistic measures and introduce MR-Cube, a MapReduce-based framework for efficient cube computation and identification of interesting cube groups on holistic measures. We provide extensive experimental analyses over both real and synthetic data. We demonstrate that, unlike existing techniques which cannot scale to the 100 million tuple mark for our data sets, MR-Cube successfully and efficiently computes cubes with holistic measures over billion-tuple data sets.

...read moreread less

Journal Article•DOI•

Certain conjunctive query answering in first-order logic

[...]

Jef Wijsen¹•Institutions (1)

University of Mons¹

04 Jun 2012-ACM Transactions on Database Systems

TL;DR: This work obtains a decision procedure for first-order expressibility of CERTAINTY(q) when q is acyclic and without self-join, and shows that if CerTAINTY (q) is first-ordered expressible, its first- order definition, commonly called certain first-orders rewriting, can be constructed in a rather straightforward way.

...read moreread less

Abstract: Primary key violations provide a natural means for modeling uncertainty in the relational data model. A repair (or possible world) of a database is then obtained by selecting a maximal number of tuples without ever selecting two distinct tuples that have the same primary key value. For a Boolean query q, the problem CERTAINTY(q) takes as input a database db and asks whether q evaluates to true on every repair of db. We are interested in determining queries q for which CERTAINTY(q) is first-order expressible (and hence in the low complexity class AC°). For queries q in the class of conjunctive queries without self-join, we provide a necessary syntactic condition for first-order expressibility of CERTAINTY(q). For acyclic queries (in the sense of Beeri et al. l1983r), this necessary condition is also a sufficient condition. So we obtain a decision procedure for first-order expressibility of CERTAINTY(q) when q is acyclic and without self-join. We also show that if CERTAINTY(q) is first-order expressible, its first-order definition, commonly called certain first-order rewriting, can be constructed in a rather straightforward way.

...read moreread less

Journal Article•DOI•

Group skyline computation

[...]

Hyeonseung Im¹, Sungwoo Park¹•Institutions (1)

Pohang University of Science and Technology¹

01 Apr 2012-Information Sciences

TL;DR: A group skyline algorithm GDynamic is developed which is equivalent to a dynamic algorithm that fills a table of skyline groups that determines the dominance relation between two groups by comparing their aggregate values such as sums or averages of elements of individual dimensions.

...read moreread less

Journal Article•DOI•

Fundamentals of order dependencies

[...]

Jaroslaw Szlichta¹, Parke Godfrey¹, Jarek Gryz¹•Institutions (1)

York University¹

01 Jul 2012

TL;DR: It is proved that functional dependencies are subsumed by order dependencies and that the set of axioms for order dependencies is sound and complete.

...read moreread less

Abstract: Dependencies have played a significant role in database design for many years. They have also been shown to be useful in query optimization. In this paper, we discuss dependencies between lexicographically ordered sets of tuples. We introduce formally the concept of order dependency and present a set of axioms (inference rules) for them. We show how query rewrites based on these axioms can be used for query optimization. We present several interesting theorems that can be derived using the inference rules. We prove that functional dependencies are subsumed by order dependencies and that our set of axioms for order dependencies is sound and complete.

...read moreread less

Posted Content•

Fundamentals of Order Dependencies

[...]

Jaroslaw Szlichta¹, Parke Godfrey¹, Jarek Gryz¹•Institutions (1)

York University¹

01 Aug 2012-arXiv: Databases

TL;DR: In this article, the concept of order dependencies is introduced and a set of axioms (inference rules) for them are presented, which can be used for query rewrites.

...read moreread less

Book Chapter•DOI•

Classification: Advanced Methods

[...]

Jiawei Han¹, Micheline Kamber, Jian Pei•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Jan 2012

TL;DR: This chapter discusses the advanced techniques for data classification starting with Bayesian belief networks, which do not assume class conditional independence, and other approaches to classification, such as genetic algorithms, rough sets, and fuzzy logic techniques, are introduced.

...read moreread less

Abstract: This chapter discusses the advanced techniques for data classification starting with Bayesian belief networks, which do not assume class conditional independence. Bayesian belief networks allow class conditional independencies to be defined between subsets of variables. They provide a graphical model of causal relationships, on which learning can be performed. Trained Bayesian belief networks can be used for classification. Backpropagation is a neural network algorithm for classification that employs a method of gradient descent. It searches for a set of weights that can model the data so as to minimize the mean-squared distance between the network's class prediction and the actual class label of data tuples. Rules may be extracted from trained neural networks to help improve the interpretability of the learned network. In general terms, a neural network is a set of connected input/output units in which each connection has a weight associated with it. The weights are adjusted during the learning phase to help the network predict the correct class label of the input tuples. A more recent approach to classification known as support vector machines, a support vector machine transforms training data into a higher dimension, where it finds a hyperplane that separates the data by class using essential training tuples called support vectors. Pairs classification using frequent patterns, exploring relationships between attribute–value that occurs frequently in data is described. This methodology builds on research on frequent pattern mining. Lazy learners or instance-based methods of classification, such as nearest-neighbor classifiers and case-based reasoning classifiers, which store all of the training tuples in pattern space and wait until presented with a test tuple before performing generalization are also presented. Other approaches to classification, such as genetic algorithms, rough sets, and fuzzy logic techniques, are introduced. Multiclass classification, semi-supervised classification, active learning, and transfer learning are explored.

...read moreread less

Book Chapter•DOI•

Linda in space-time: an adaptive coordination model for mobile ad-hoc environments

[...]

Mirko Viroli¹, Danilo Pianini¹, Jacob Beal²•Institutions (2)

University of Bologna¹, BBN Technologies²

14 Jun 2012

TL;DR: This work defines a new coordination language that adds to the basic Linda primitives a small set of space-time constructs for linking coordination processes with their environment, and shows how this framework supports the global-level emergence of adaptive coordination policies.

...read moreread less

Abstract: We present a vision of distributed system coordination as a set of activities affecting the space-time fabric of interaction events. In the tuple space setting that we consider, coordination amounts to control of the spatial and temporal configuration of tuples spread across the network, which in turn drives the behaviour of situated agents. We therefore draw on prior work in spatial computing and distributed systems coordination, to define a new coordination language that adds to the basic Linda primitives a small set of space-time constructs for linking coordination processes with their environment. We show how this framework supports the global-level emergence of adaptive coordination policies, applying it to two example cases: crowd steering in a pervasive computing scenario and a gradient-based implementation of Linda primitives for mobile ad-hoc networks.

...read moreread less

Proceedings Article•DOI•

Same Queries, Different Data: Can We Predict Runtime Performance?

[...]

Adrian Daniel Popescu¹, Vuk Ercegovac², Andrey Balmin², Miguel Branco¹, Anastasia Ailamaki¹ - Show less +1 more•Institutions (2)

École Normale Supérieure¹, IBM²

01 Apr 2012

TL;DR: This work proposes a technique that predicts the runtime performance for a fixed set of queries running over varying input data sets by splitting each query into several segments where each segment's performance is estimated using machine learning models.

...read moreread less

Abstract: We consider MapReduce workloads that are produced by analytics applications. In contrast to ad hoc query workloads, analytics applications are comprised of fixed data flows that are run over newly arriving data sets or on different portions of an existing data set. Examples of such workloads include document analysis/indexing, social media analytics, and ETL (Extract Transform Load). Motivated by these workloads, we propose a technique that predicts the runtime performance for a fixed set of queries running over varying input data sets. Our prediction technique splits each query into several segments where each segment's performance is estimated using machine learning models. These per-segment estimates are plugged into a global analytical model to predict the overall query runtime. Our approach uses minimal statistics about the input data sets (e.g., tuple size, cardinality), which are complemented with historical information about prior query executions (e.g., execution time). We analyze the accuracy of predictions for several segment granularities on both standard analytical benchmarks such as TPC-DS [17], and on several real workloads. We obtain less than 25% prediction errors for 90% of predictions.

...read moreread less

Collapse