Showing papers on "Tuple published in 2015"

PDF

Open Access

Proceedings Article•DOI•

BigDansing: A System for Big Data Cleansing

[...]

Zuhair Khayyat¹, Ihab F. Ilyas², Alekh Jindal³, Samuel Madden³, Mourad Ouzzani⁴, Paolo Papotti⁴, Jorge-Arnulfo Quiané-Ruiz⁴, Nan Tang⁴, Si Yin⁴ - Show less +5 more•Institutions (4)

King Abdullah University of Science and Technology¹, University of Waterloo², Massachusetts Institute of Technology³, Qatar Computing Research Institute⁴

27 May 2015

TL;DR: Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.

...read moreread less

Abstract: Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.

...read moreread less

146 citations

Proceedings Article•DOI•

FlashRelate: extracting relational data from semi-structured spreadsheets using examples

[...]

Daniel W. Barowy¹, Sumit Gulwani², Ted Hart², Benjamin G. Zorn²•Institutions (2)

University of Massachusetts Amherst¹, Microsoft²

03 Jun 2015

TL;DR: This work introduces FlashRelate, a synthesis engine that lets ordinary users extract structured relational data from spreadsheets without programming, and demonstrates its usefulness addressing the widespread problem of data trapped in corporate and government formats.

...read moreread less

Abstract: With hundreds of millions of users, spreadsheets are one of the most important end-user applications. Spreadsheets are easy to use and allow users great flexibility in storing data. This flexibility comes at a price: users often treat spreadsheets as a poor man's database, leading to creative solutions for storing high-dimensional data. The trouble arises when users need to answer queries with their data. Data manipulation tools make strong assumptions about data layouts and cannot read these ad-hoc databases. Converting data into the appropriate layout requires programming skills or a major investment in manual reformatting. The effect is that a vast amount of real-world data is "locked-in" to a proliferation of one-off formats. We introduce FlashRelate, a synthesis engine that lets ordinary users extract structured relational data from spreadsheets without programming. Instead, users extract data by supplying examples of output relational tuples. FlashRelate uses these examples to synthesize a program in Flare. Flare is a novel extraction language that extends regular expressions with geometric constructs. An interactive user interface on top of FlashRelate lets end users extract data by point-and-click. We demonstrate that correct Flare programs can be synthesized in seconds from a small set of examples for 43 real-world scenarios. Finally, our case study demonstrates FlashRelate's usefulness addressing the widespread problem of data trapped in corporate and government formats.

...read moreread less

136 citations

Journal Article•DOI•

Mining Attribute-Based Access Control Policies

[...]

Zhongyuan Xu¹, Scott D. Stoller¹•Institutions (1)

Stony Brook University¹

01 Sep 2015-IEEE Transactions on Dependable and Secure Computing

TL;DR: In this article, the authors present an ABAC policy mining algorithm, which iterates over tuples in the given user-permission relation, uses selected tuples as seeds for constructing candidate rules, and attempts to generalize each candidate rule to cover additional tuples by replacing conjuncts in attribute expressions with constraints.

...read moreread less

Abstract: Attribute-based access control (ABAC) provides a high level of flexibility that promotes security and information sharing. ABAC policy mining algorithms have potential to significantly reduce the cost of migration to ABAC, by partially automating the development of an ABAC policy from an access control list (ACL) policy or role-based access control (RBAC) policy with accompanying attribute data. This paper presents an ABAC policy mining algorithm. To the best of our knowledge, it is the first ABAC policy mining algorithm. Our algorithm iterates over tuples in the given user-permission relation, uses selected tuples as seeds for constructing candidate rules, and attempts to generalize each candidate rule to cover additional tuples in the user-permission relation by replacing conjuncts in attribute expressions with constraints. Our algorithm attempts to improve the policy by merging and simplifying candidate rules, and then it selects the highest-quality candidate rules for inclusion in the generated policy.

...read moreread less

107 citations

Journal Article•DOI•

Querying Knowledge Graphs by Example Entity Tuples

[...]

Nandish Jayaram¹, Arijit Khan², Chengkai Li¹, Xifeng Yan³, Ramez Elmasri¹ - Show less +1 more•Institutions (3)

University of Texas at Arlington¹, ETH Zurich², University of California, Santa Barbara³

01 Oct 2015-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The system, Graph Query By Example, automatically discovers a weighted hidden maximum query graph based on input query tuples, to capture a user’s query intent, and efficiently finds and ranks the top approximate matching answer graphs and answer tuples.

...read moreread less

Abstract: We witness an unprecedented proliferation of knowledge graphs that record millions of entities and their relationships. While knowledge graphs are structure-flexible and content-rich, they are difficult to use. The challenge lies in the gap between their overwhelming complexity and the limited database knowledge of non-professional users. If writing structured queries over “simple” tables is difficult, complex graphs are only harder to query. As an initial step toward improving the usability of knowledge graphs, we propose to query such data by example entity tuples, without requiring users to form complex graph queries. Our system, Graph Query By Example ( $\mathsf {GQBE}$ ), automatically discovers a weighted hidden maximum query graph based on input query tuples, to capture a user’s query intent. It then efficiently finds and ranks the top approximate matching answer graphs and answer tuples. We conducted experiments and user studies on the large Freebase and DBpedia datasets and observed appealing accuracy and efficiency. Our system provides a complementary approach to the existing keyword-based methods, facilitating user-friendly graph querying. To the best of our knowledge, there was no such proposal in the past in the context of graphs.

...read moreread less

105 citations

Proceedings Article•DOI•

EVALution 1.0: an Evolving Semantic Dataset for Training and Evaluation of Distributional Semantic Models

[...]

Enrico Santus, Frances Yung¹, Alessandro Lenci², Chu-Ren Huang³•Institutions (3)

Nara Institute of Science and Technology¹, University of Pisa², Hong Kong Polytechnic University³

01 Jul 2015

TL;DR: This paper introduces EVALution 1.0, a dataset designed for the training and the evaluation of Distributional Semantic Models (DSMs), which consists of almost 7.5K tuples, instantiating several semantic relations between word pairs.

...read moreread less

Abstract: In this paper, we introduce EVALution 1.0, a dataset designed for the training and the evaluation of Distributional Semantic Models (DSMs). This version consists of almost 7.5K tuples, instantiating several semantic relations between word pairs (including hypernymy, synonymy, antonymy, meronymy). The dataset is enriched with a large amount of additional information (i.e. relation domain, word frequency, word POS, word semantic field, etc.) that can be used for either filtering the pairs or performing an in-depth analysis of the results. The tuples were extracted from a combination of ConceptNet 5.0 and WordNet 4.0, and subsequently filtered through automatic methods and crowdsourcing in order to ensure their quality. The dataset is freely downloadable1. An extension in RDF format, including also scripts for data processing, is under development.

...read moreread less

103 citations

Journal Article•DOI•

Messing up with BART: error generation for evaluating data-cleaning algorithms

[...]

Patricia C. Arocena¹, Boris Glavic², Giansalvatore Mecca³, Renée J. Miller¹, Paolo Papotti⁴, Donatello Santoro³ - Show less +2 more•Institutions (4)

University of Toronto¹, Illinois Institute of Technology², University of Basilicata³, Qatar Airways⁴

01 Oct 2015

TL;DR: The error-generation problem is surprisingly challenging, and in fact, NP-complete, and to provide a scalable solution, a correct and efficient greedy algorithm is developed that sacrifices completeness, but succeeds under very reasonable assumptions.

...read moreread less

Abstract: We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.

...read moreread less

69 citations

Proceedings Article•DOI•

S4: Top-k Spreadsheet-Style Search for Query Discovery

[...]

Fotis Psallidas¹, Bolin Ding², Kaushik Chakrabarti², Surajit Chaudhuri²•Institutions (2)

Columbia University¹, Microsoft²

27 May 2015

TL;DR: This work studies the problem of efficiently discovering top-k project join queries which approximately contain the given example tuples in their output and extends the algorithms to incrementally produce results as soon as the user finishes typing/modifying a cell.

...read moreread less

Abstract: An enterprise information worker is often aware of a few example tuples that should be present in the output of the query. Query discovery systems have been developed to discover project-join queries that contain the given example tuples in their output. However, they require the output to exactly contain all the example tuples and do not perform any ranking. To address this limitation, we study the problem of efficiently discovering top-k project join queries which approximately contain the given example tuples in their output. We extend our algorithms to incrementally produce results as soon as the user finishes typing/modifying a cell. Our experiments on real-life and synthetic datasets show that our proposed solution is significantly more efficient compared with applying state-of-the-art algorithms.

...read moreread less

66 citations

Journal Article•DOI•

A Model Based on Linguistic 2-Tuples for Dealing With Heterogeneous Relationship Among Attributes in Multi-expert Decision Making

[...]

Bapi Dutta¹, Debashree Guha¹, Radko Mesiar²•Institutions (2)

Indian Institute of Technology Patna¹, Slovak University of Technology in Bratislava²

01 Oct 2015-IEEE Transactions on Fuzzy Systems

TL;DR: This paper provides an interpretation of the extended Bonferroni mean (EBM) operator by assuming that some of the attributes, which are denoted as A, are related to a subset B of the set A \ {Ai}, and others have no relation with the remaining attributes.

...read moreread less

Abstract: Classical Bonferroni mean, defined by Bonferroni in 1950, assumes homogeneous relation among the attributes, i.e., each attribute A is related to the rest of the attributes A \ {A i }, where A = {A 1 , A 2 , ...,A n } denotes the attribute set. In this paper, we emphasize the importance of having an aggregation operator, which we will refer to as the extended Bonferroni mean (EBM) operator to capture heterogeneous interrelationship among the attributes. We provide an interpretation of “heterogeneous interrelationship” by assuming that some of the attributes, which are denoted as A , are related to a subset B of the set A \ {A i }, and others have no relation with the remaining attributes. We provide an interpretation of this operator as computing different aggregated values for a given set of inputs as interrelationship pattern is changed. We also investigate the behavior of the proposed EBM aggregation operator. Furthermore, to investigate a multiattribute group decision making (MAGDM) problem with linguistic information, we analyze the proposed EBM operator in linguistic 2-tuple environment and develop three new linguistic aggregation operators: 2-tuple linguistic EBM, weighted 2-tuple linguistic EBM, and linguistic weighted 2-tuple linguistic EBM. A concept of linguistic similarity measure of 2-tuple linguistic information is introduced. Subsequently, an MAGDM technique is developed, in which the attributes' weights are in the form of 2-tuple linguistic information and experts' weights information is completely unknown. Finally, a practical example is presented to demonstrate the applicability of our results.

...read moreread less

61 citations

Proceedings Article•DOI•

Query-Oriented Data Cleaning with Oracles

[...]

Moria Bergman¹, Tova Milo¹, Slava Novgorodov¹, Wang-Chiew Tan²•Institutions (2)

Tel Aviv University¹, University of California, Santa Cruz²

27 May 2015

TL;DR: This work presents QOCO, a novel query-oriented system for cleaning data with oracles, and shows that the problem of determining minimal interactions with oracle crowds to derive database edits for removing (adding) incorrect tuples to the result of a query is NP-hard in general.

...read moreread less

Abstract: As key decisions are often made based on information contained in a database, it is important for the database to be as complete and correct as possible. For this reason, many data cleaning tools have been developed to automatically resolve inconsistencies in databases. However, data cleaning tools provide only best-effort results and usually cannot eradicate all errors that may exist in a database. Even more importantly, existing data cleaning tools do not typically address the problem of determining what information is missing from a database. To overcome the limitations of existing data cleaning techniques, we present QOCO, a novel query-oriented system for cleaning data with oracles. Under this framework, incorrect (resp. missing) tuples are removed from (added to) the result of a query through edits that are applied to the underlying database, where the edits are derived by interacting with domain experts which we model as oracle crowds. We show that the problem of determining minimal interactions with oracle crowds to derive database edits for removing (adding) incorrect (missing) tuples to the result of a query is NP-hard in general and present heuristic algorithms that interact with oracle crowds. Finally, we implement our algorithms in our prototype system QOCO and show that it is effective and efficient through a comprehensive suite of experiments.

...read moreread less

56 citations

Proceedings Article•DOI•

Efficient key grouping for near-optimal load balancing in stream processing systems

[...]

Nicolo Rivetti¹, Leonardo Querzoni¹, Emmanuelle Anceaume², Yann Busnel³, Bruno Sericola³ - Show less +1 more•Institutions (3)

Sapienza University of Rome¹, Centre national de la recherche scientifique², French Institute for Research in Computer Science and Automation³

24 Jun 2015

TL;DR: DKG as discussed by the authors is an approach to key grouping that provides near-optimal load distribution for input streams with skewed value distribution, based on the simple observation that with such inputs the load balance is strongly driven by the most frequent values; it identifies such values and explicitly maps them to sub-streams together with groups of less frequent items.

...read moreread less

Abstract: Key grouping is a technique used by stream processing frameworks to simplify the development of parallel stateful operators. Through key grouping a stream of tuples is partitioned in several disjoint sub-streams depending on the values contained in the tuples themselves. Each operator instance target of one sub-stream is guaranteed to receive all the tuples containing a specific key value. A common solution to implement key grouping is through hash functions that, however, are known to cause load imbalances on the target operator instances when the input data stream is characterized by a skewed value distribution. In this paper we present DKG, a novel approach to key grouping that provides near-optimal load distribution for input streams with skewed value distribution. DKG starts from the simple observation that with such inputs the load balance is strongly driven by the most frequent values; it identifies such values and explicitly maps them to sub-streams together with groups of less frequent items to achieve a near-optimal load balance. We provide theoretical approximation bounds for the quality of the mapping derived by DKG and show, through both simulations and a running prototype, its impact on stream processing applications.

...read moreread less

52 citations

Journal Article•DOI•

The complexity of resilience and responsibility for self-join-free conjunctive queries

[...]

Cibele Freire¹, Wolfgang Gatterbauer², Neil Immerman¹, Alexandra Meliou¹•Institutions (2)

University of Massachusetts Amherst¹, Carnegie Mellon University²

01 Nov 2015

TL;DR: This work shows a dichotomy for the complexity of resilience, which identifies previously unknown tractable families for deletion propagation with source side-effects, and extends this result to account for functional dependencies.

...read moreread less

Abstract: Several research thrusts in the area of data management have focused on understanding how changes in the data affect the output of a view or standing query. Example applications are explaining query results, propagating updates through views, and anonymizing datasets. An important aspect of this analysis is the problem of deleting a minimum number of tuples from the input tables to make a given Boolean query false, which we refer to as "the resilience of a query." In this paper, we study the complexity of resilience for self-join-free conjunctive queries with arbitrary functional dependencies. The cornerstone of our work is the novel concept of triads, a simple structural property of a query that leads to the several dichotomy results we show in this paper. The concepts of triads and resilience bridge the connections between the problems of deletion propagation and causal responsibility, and allow us to substantially advance the known complexity results in these topics. Specifically, we show a dichotomy for the complexity of resilience, which identifies previously unknown tractable families for deletion propagation with source side-effects, and we extend this result to account for functional dependencies. Further, we identify a mistake in a previous dichotomy for causal responsibility, and offer a revised characterization based purely on the structural form of the query (presence or absence of triads). Finally, we extend the dichotomy for causal responsibility in two ways: (a) we account for functional dependencies in the input tables, and (b) we compute responsibility for sets of tuples specified via wildcards.

...read moreread less

Proceedings Article•DOI•

Scalejoin: A deterministic, disjoint-parallel and skew-resilient stream join

[...]

Vincenzo Gulisano¹, Yiannis Nikolakopoulos¹, Marina Papatriantafilou¹, Philippas Tsigas¹•Institutions (1)

Chalmers University of Technology¹

29 Oct 2015

TL;DR: ScaleJoin is presented, an algorithmic construction for deterministic and parallel stream joins that guarantees deterministic, disjoint and skew-resilient parallelism, but also achieves higher throughput than state-of-the-art parallel stream joining.

...read moreread less

Abstract: The inherently large and varying volumes of data generated to facilitate autonomous functionality in large scale cyber-physical systems demand near real-time processing of data streams, often as close to the sensing devices as possible. In this context, data streaming is imperative for dataintensive processing infrastructures. Stream joins, the streaming counterpart of database joins, compare tuples coming from different streams and constitute one of the most important and expensive data streaming operators. Dictated by the needs of big data streaming analytics, algorithmic implementations of stream joins have to be capable of efficiently processing bursty and rate-varying data streams in a deterministic and skew-resilient fashion. To leverage the design of modern multicore architectures, scalability and parallelism need to be addressed also in the algorithmic design. In this paper we present Scalejoin, an algorithmic construction for deterministic and parallel stream joins that guarantees all the above properties, thus filling in a gap in the existing state-of-the art. Key to the novelty of Scalejoin is a new data structure, Scalegate, and its lock-free implementation. ScaleGate facilitates concurrent data exchange and balances independent actions among processing threads; it also enables fine-grain parallelism while providing the necessary synchronization for deterministic processing. As a result, it allows Scalejoin to run on an arbitrary number of processing threads that can evenly share the overall comparisons run in parallel and achieve high processing throughput and low processing latency. As we show, Scalejoin not only guarantees deterministic, disjoint and skew-resilient parallelism, but also achieves higher throughput than state-of-the-art parallel stream joins.

...read moreread less

Proceedings Article•DOI•

The Data Complexity of Consistent Query Answering for Self-Join-Free Conjunctive Queries Under Primary Key Constraints

[...]

Paraschos Koutris¹, Jef Wijsen²•Institutions (2)

University of Washington¹, University of Mons²

20 May 2015

TL;DR: It is shown that for any self-join-free Boolean conjunctive query q, it can be decided whether or not CERTAINTY(q) is in FO, P, or coNP-complete, and the complexity dichotomy is effective.

...read moreread less

Abstract: A relational database is said to be uncertain if primary key constraints can possibly be violated A repair (or possible world) of an uncertain database is obtained by selecting a maximal number of tuples without ever selecting two distinct tuples with the same primary key value For any Boolean query q, CERTAINTY(q) is the problem that takes an uncertain database db as input, and asks whether q is true in every repair of db The complexity of this problem has been particularly studied for q ranging over the class of self-join-free Boolean conjunctive queries A research challenge is to determine, given q, whether CERTAINTY(q) belongs to complexity classes FO, P, or coNP-complete In this paper, we combine existing techniques for studying the above complexity classification task We show that for any self-join-free Boolean conjunctive query q, it can be decided whether or not CERTAINTY(q) is in FO Further, for any self-join-free Boolean conjunctive query q, CERTAINTY(q) is either in P or coNP-complete, and the complexity dichotomy is effective This settles a research question that has been open for ten years

...read moreread less

Proceedings Article•DOI•

Symmetric Weighted First-Order Model Counting

[...]

Paul Beame¹, Guy Van den Broeck², Eric Gribkoff¹, Dan Suciu¹•Institutions (2)

University of Washington¹, Katholieke Universiteit Leuven²

20 May 2015

TL;DR: This paper proves that all γ-acyclic queries have polynomial time data complexity, and proves that, for every fragment FOk, k ≥ 2, the combined complexity of FOMC (or WFOMC) is #P-complete.

...read moreread less

Abstract: The FO Model Counting problem (FOMC) is the following: given a sentence Φ in FO and a number n, compute the number of models of Φ over a domain of size n; the Weighted variant (WFOMC) generalizes the problem by associating a weight to each tuple and defining the weight of a model to be the product of weights of its tuples. In this paper we study the complexity of the symmetric WFOMC, where all tuples of a given relation have the same weight. Our motivation comes from an important application, inference in Knowledge Bases with soft constraints, like Markov Logic Networks, but the problem is also of independent theoretical interest. We study both the data complexity, and the combined complexity of FOMC and WFOMC. For the data complexity we prove the existence of an FO3 formula for which FOMC is #P1-complete, and the existence of a Conjunctive Query for which WFOMC is #P1-complete. We also prove that all γ-acyclic queries have polynomial time data complexity. For the combined complexity, we prove that, for every fragment FOk, k ≥ 2, the combined complexity of FOMC (or WFOMC) is #P-complete.

...read moreread less

Journal Article•DOI•

Safe Data Parallelism for General Streaming

[...]

Scott Schneider¹, Martin Hirzel¹, Bugra Gedik², Kun-Lung Wu¹•Institutions (2)

IBM¹, Bilkent University²

01 Feb 2015-IEEE Transactions on Computers

TL;DR: This article presents a compiler and runtime system that automatically extracts data parallelism for general stream processing, and shows linear scalability for parallel regions that are computation-bound, and nearlinear scalability when tuples are shuffled across parallel regions.

...read moreread less

Abstract: Streaming applications process possibly infinite streams of data and often have both high throughput and low latency requirements. They are comprised of operator graphs that produce and consume data tuples. General streaming applications use stateful, selective, and user-defined operators. The stream programming model naturally exposes task and pipeline parallelism, enabling it to exploit parallel systems of all kinds, including large clusters. However, data parallelism must either be manually introduced by programmers, or extracted as an optimization by compilers. Previous data parallel optimizations did not apply to selective, stateful and user-defined operators. This article presents a compiler and runtime system that automatically extracts data parallelism for general stream processing. Data-parallelization is safe if the transformed program has the same semantics as the original sequential version. The compiler forms parallel regions while considering operator selectivity, state, partitioning, and graph dependencies. The distributed runtime system ensures that tuples always exit parallel regions in the same order they would without data parallelism, using the most efficient strategy as identified by the compiler. Our experiments using 100 cores across 14 machines show linear scalability for parallel regions that are computation-bound, and near linear scalability when tuples are shuffled across parallel regions.

...read moreread less

Proceedings Article•DOI•

Constraint handling in combinatorial test generation using forbidden tuples

[...]

Linbin Yu¹, Feng Duan², Yu Lei², Raghu N. Kacker³, D. Richard Kuhn³ - Show less +1 more•Institutions (3)

Facebook¹, University of Texas at Arlington², National Institute of Standards and Technology³

13 Apr 2015

TL;DR: This work proposes an on-demand strategy that only generates minimum forbidden tuples for validity checks as they are encountered, instead of generating all of them up front.

...read moreread less

Abstract: Constraint handling is a challenging problem in combinatorial test generation. In general, there are two ways to handle constraints, i.e., constraint solving and forbidden tuples. In our earlier work, we proposed a constraint handling approach based on forbidden tuples for software product line systems consisting of only Boolean parameters. In this paper, we generalize this approach for general software systems that may consist of other types of parameter. The key idea of our approach is using the notion of minimum forbidden tuples to perform validity checks on both complete and partial tests. Furthermore, we propose an on-demand strategy that only generates minimum forbidden tuples for validity checks as they are encountered, instead of generating all of them up front. We implemented our generalized approach with and without the on-demand strategy in our combinatorial testing tool called ACTS. We performed experiments on 35 systems using ACTS and PICT. The results show that for these 35 systems, our generalized approach performed faster than PICT and the constraint solving-based approach in ACTS. For some large systems, the improvement on test generation time is up to two orders of magnitude.

...read moreread less

Posted Content•

A Mathematical Theory of Co-Design.

[...]

Andrea Censi

25 Dec 2015-arXiv: Logic in Computer Science

TL;DR: A theory of co-design is introduced that describes "design problems", defined as tuples of "functionality space", "implementation space", and "resources space", together with a feasibility relation that relates the three spaces.

...read moreread less

Abstract: One of the challenges of modern engineering, and robotics in particular, is designing complex systems, composed of many subsystems, rigorously and with optimality guarantees. This paper introduces a theory of co-design that describes "design problems", defined as tuples of "functionality space", "implementation space", and "resources space", together with a feasibility relation that relates the three spaces. Design problems can be interconnected together to create "co-design problems", which describe possibly recursive co-design constraints among subsystems. A co-design problem induces a family of optimization problems of the type "find the minimal resources needed to implement a given functionality"; the solution is an antichain (Pareto front) of resources. A special class of co-design problems are Monotone Co-Design Problems (MCDPs), for which functionality and resources are complete partial orders and the feasibility relation is monotone and Scott continuous. The induced optimization problems are multi-objective, nonconvex, nondifferentiable, noncontinuous, and not even defined on continuous spaces; yet, there exists a complete solution. The antichain of minimal resources can be characterized as a least fixed point, and it can be computed using Kleene's algorithm. The computation needed to solve a co-design problem can be bounded by a function of a graph property that quantifies the interdependence of the subproblems. These results make us much more optimistic about the problem of designing complex systems in a rigorous way.

...read moreread less

Proceedings Article•DOI•

Possible and certain SQL keys

[...]

Henning Köhler¹, Sebastian Link², Xiaofang Zhou³•Institutions (3)

Massey University¹, University of Auckland², University of Queensland³

01 Jul 2015

TL;DR: This work revisits the fundamental notion of a key in relational databases with NULLs, and investigates the notions of possible and certain keys, which are keys that hold in some or all possible worlds that can originate from an SQL table, respectively.

...read moreread less

Abstract: Driven by the dominance of the relational model, the requirements of modern applications, and the veracity of data, we revisit the fundamental notion of a key in relational databases with NULLs. In SQL database systems primary key columns are NOT NULL by default. NULL columns may occur in unique constraints which only guarantee uniqueness for tuples which do not feature null markers in any of the columns involved, and therefore serve a different function than primary keys. We investigate the notions of possible and certain keys, which are keys that hold in some or all possible worlds that can originate from an SQL table, respectively. Possible keys coincide with the unique constraint of SQL, and thus provide a semantics for their syntactic definition in the SQL standard. Certain keys extend primary keys to include NULL columns, and thus form a sufficient and necessary condition to identify tuples uniquely, while primary keys are only sufficient for that purpose. In addition to basic characterization, axiomatization, and simple discovery approaches for possible and certain keys, we investigate the existence and construction of Armstrong tables, and describe an indexing scheme for enforcing certain keys. Our experiments show that certain keys with NULLs do occur in real-world databases, and that related computational problems can be solved efficiently. Certain keys are therefore semantically well-founded and able to maintain data quality in the form of Codd's entity integrity rule while handling the requirements of modern applications, that is, higher volumes of incomplete data from different formats.

...read moreread less

Journal Article•DOI•

DataSpread: unifying databases and spreadsheets

[...]

Mangesh Bendre¹, Bofan Sun¹, Ding Zhang¹, Xinyan Zhou¹, Kevin Chen-Chuan Chang¹, Aditya Parameswaran¹ - Show less +2 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Aug 2015

TL;DR: The demonstration will center on the first and early prototype of the DataSpread, a data exploration tool that holistically unifies databases and spreadsheets, and will give the attendees a sense for the enormous data exploration capabilities offered by unifying spreadsheets and databases.

...read moreread less

Abstract: Spreadsheet software is often the tool of choice for ad-hoc tabular data management, processing, and visualization, especially on tiny data sets. On the other hand, relational database systems offer significant power, expressivity, and efficiency over spreadsheet software for data management, while lacking in the ease of use and ad-hoc analysis capabilities. We demonstrate DataSpread, a data exploration tool that holistically unifies databases and spreadsheets. It continues to offer a Microsoft Excel-based spreadsheet front-end, while in parallel managing all the data in a back-end database, specifically, PostgreSQL. DataSpread retains all the advantages of spreadsheets, including ease of use, ad-hoc analysis and visualization capabilities, and a schema-free nature, while also adding the advantages of traditional relational databases, such as scalability and the ability to use arbitrary SQL to import, filter, or join external or internal tables and have the results appear in the spreadsheet. DataSpread needs to reason about and reconcile differences in the notions of schema, addressing of cells and tuples, and the current "pane" (which exists in spreadsheets but not in traditional databases), and support data modifications at both the front-end and the back-end. Our demonstration will center on our first and early prototype of the DataSpread, and will give the attendees a sense for the enormous data exploration capabilities offered by unifying spreadsheets and databases.

...read moreread less

Proceedings Article•DOI•

Quality-Driven Continuous Query Execution over Out-of-Order Data Streams

[...]

Yuanzhen Ji, Hongjin Zhou, Zbigniew Jerzak, Anisoara Nica, Gregor Hackenbroich, Christof Fetzer¹ - Show less +2 more•Institutions (1)

Dresden University of Technology¹

27 May 2015

TL;DR: This work showcases AQ-K-slack, an adaptive, buffer-based disorder handling approach, which supports executing sliding window aggregate queries over out-of-order data streams in a quality-driven manner and dynamically adjusts the input buffer size at query runtime to minimize the result latency.

...read moreread less

Abstract: Executing continuous queries over out-of-order data streams, where tuples are not ordered according to timestamps, is challenging; because high result accuracy and low result latency are two conflicting performance metrics. Although many applications allow trading exact query results for lower latency, they still expect the produced results to meet a certain quality requirement. However, none of existing disorder handling approaches have considered minimizing the result latency while meeting user-specified requirements on the quality of query results. In this demonstration, we showcase AQ-K-slack, an adaptive, buffer-based disorder handling approach, which supports executing sliding window aggregate queries over out-of-order data streams in a quality-driven manner. By adapting techniques from the field of sampling-based approximate query processing and control theory, AQ-K-slack dynamically adjusts the input buffer size at query runtime to minimize the result latency, while respecting a user-specified threshold on relative errors in produced query results. We demonstrate a prototype stream processing system, which extends SAP Event Stream Processor with the implementation of AQ-K-slack. Through an interactive interface, the audience will learn the effect of different factors, such as the aggregate function, the window specification, the result error threshold, and stream properties, on the latency and the accuracy of query results. Moreover, they can experience the effectiveness of AQ-K-slack in obtaining user-desired latency vs. result accuracy trade-offs, compared to naive disorder handling approaches that make extreme trade-offs. For instance, by scarifying 1% result accuracy, our system can reduce the result latency by 80% when compared to the state of the art.

...read moreread less

Journal Article•DOI•

Crowdsourcing-Based Framework for Teaching Quality Evaluation and Feedback Using Linguistic 2-Tuple

[...]

Baoyu Ma, Guansuo Dui, Shengyou Yang, Libiao Xin

01 Dec 2015-Cmc-computers Materials & Continua

TL;DR: A crowdsourcing-based framework to evaluate teaching quality in the classroom using a weighted average operator to aggregate information from students questionnaires described by linguistic 2-tuple terms to provide a strong tolerance for the abnormal student to make the evaluation more accurate.

...read moreread less

Abstract: Crowdsourcing is widely used in various fields to collect goods and services from large participants. Evaluating teaching quality by collecting feedback from experts or students after class is not only delayed but also not accurate. In this paper, we present a crowdsourcing-based framework to evaluate teaching quality in the classroom using a weighted average operator to aggregate information from students questionnaires described by linguistic 2-tuple terms. Then we define crowd grade based on similarity degree to distinguish contribution from different students and minimize the abnormal students impact on the evaluation. The crowd grade would be updated at the end of each feedback so it can guarantee the evaluation accurately. Moreover, a simulated case is shown to illustrate how to apply this framework to assess teaching quality in the classroom. Finally, we developed a prototype and carried out some experiments on a series of real questionnaires and two sets of modified data. The results show that teachers can locate the weak points of teaching and furthermore to identify the abnormal students to improve the teaching quality. Meanwhile, our approach provides a strong tolerance for the abnormal student to make the evaluation more accurate.

...read moreread less

Journal Article•DOI•

Lightning fast and space efficient inequality joins

[...]

Zuhair Khayyat¹, William Lucia², Meghna Singh², Mourad Ouzzani², Paolo Papotti², Jorge-Arnulfo Quiané-Ruiz², Nan Tang², Panos Kalnis¹ - Show less +4 more•Institutions (2)

King Abdullah University of Science and Technology¹, Qatar Computing Research Institute²

01 Sep 2015

TL;DR: This paper introduces fast inequality join algorithms that put columns to be joined in sorted arrays and use permutation arrays to encode positions of tuples in one sorted array w.r.t. the other sorted array and uses space efficient bit-arrays that enable optimizations for fast computation of the join results.

...read moreread less

Abstract: Inequality joins, which join relational tables on inequality conditions, are used in various applications. While there have been a wide range of optimization methods for joins in database systems, from algorithms such as sort-merge join and band join, to various indices such as B+-tree, R*-tree and Bitmap, inequality joins have received little attention and queries containing such joins are usually very slow. In this paper, we introduce fast inequality join algorithms. We put columns to be joined in sorted arrays and we use permutation arrays to encode positions of tuples in one sorted array w.r.t. the other sorted array. In contrast to sort-merge join, we use space efficient bit-arrays that enable optimizations, such as Bloom filter indices, for fast computation of the join results. We have implemented a centralized version of these algorithms on top of PostgreSQL, and a distributed version on top of Spark SQL. We have compared against well known optimization techniques for inequality joins and show that our solution is more scalable and several orders of magnitude faster.

...read moreread less

Proceedings Article•DOI•

High-Level Why-Not Explanations using Ontologies

[...]

Balder ten Cate¹, Cristina Civili², Evgeny Sherkhonov³, Wang-Chiew Tan¹•Institutions (3)

University of California, Santa Cruz¹, Sapienza University of Rome², University of Amsterdam³

20 May 2015

TL;DR: In this paper, the authors propose a framework for why-not explanations, that is, explanations for why a tuple is missing from a query result, which can either be provided by the user, or it may be automatically derived from the data and/or schema.

...read moreread less

Abstract: We propose a novel foundational framework for why-not explanations, that is, explanations for why a tuple is missing from a query result. Our why-not explanations leverage concepts from an ontology to provide high-level and meaningful reasons for why a tuple is missing from the result of a query.A key algorithmic problem in our framework is that of computing a most-general explanation for a why-not question, relative to an ontology, which can either be provided by the user, or it may be automatically derived from the data and/or schema. We study the complexity of this problem and associated problems, and present concrete algorithms for computing why-not explanations. In the case where an external ontology is provided, we first show that the problem of deciding the existence of an explanation to a why-not question is NP-complete in general. However, the problem is solvable in polynomial time for queries of bounded arity, provided that the ontology is specified in a suitable language, such as a member of the DL-Lite family of description logics, which allows for efficient concept subsumption checking. Furthermore, we show that a most-general explanation can be computed in polynomial time in this case. In addition, we propose a method for deriving a suitable (virtual) ontology from a database and/or a schema, and we present an algorithm for computing a most-general explanation to a why-not question, relative to such ontologies. This algorithm runs in polynomial-time in the case when concepts are defined in a selection-free language, or if the underlying schema is fixed. Finally, we also study the problem of computing short most-general explanations, and we briefly discuss alternative definitions of what it means to be an explanation, and to be most general.

...read moreread less

Journal Article•DOI•

Scalable real-time OLAP on cloud architectures

[...]

Frank Dehne¹, Q. Kong², Andrew Rau-Chaplin², Hamidreza Zaboli¹, R. Zhou¹ - Show less +1 more•Institutions (2)

Carleton University¹, Dalhousie University²

01 May 2015-Journal of Parallel and Distributed Computing

TL;DR: This paper introduces CR-OLAP, a scalable Cloud based Real-time OLAP system based on a new distributed index structure for OLAP, the distributed PDCR tree, and studies the use of parallel computing on scalable clouds to accelerate queries.

...read moreread less

Journal Article•DOI•

Robust Watermarking of Relational Databases With Ontology-Guided Distortion Control

[...]

Javier Franco-Contreras¹, Gouenou Coatrieux¹•Institutions (1)

Institut Mines-Télécom¹

01 Jun 2015-IEEE Transactions on Information Forensics and Security

TL;DR: A new robust database watermarking scheme the originality of which stands on a semantic control of the data distortion and on the extension of quantization index modulation (QIM) to circular histograms of numerical attributes.

...read moreread less

Abstract: In this paper, we present a new robust database watermarking scheme the originality of which stands on a semantic control of the data distortion and on the extension of quantization index modulation (QIM) to circular histograms of numerical attributes. The semantic distortion control of the embedding process we propose relies on the identification of existing semantic links in between values of attributes in a tuple by means of an ontology. By doing so, we avoid incoherent or very rare record occurrences which may bias data interpretation or betray the presence of the watermark. In a second time, we adapt QIM to database watermarking. Watermark embedding is conducted by modulating the relative angular position of the circular histogram center of mass of one numerical attribute. We theoretically demonstrate the robustness performance of our scheme against most common attacks (i.e., tuple insertion and deletion). This makes it suitable for copyright protection, owner identification, or traitor tracing purposes. We further verify experimentally these theoretical limits within the framework of a medical database of more than one half million of inpatient hospital stay records. Under the assumption imposed by the central limit theorem, experimental results fit the theory. We also compare our approach with two efficient schemes so as to prove its benefits.

...read moreread less

Proceedings Article•DOI•

Giga-scale exhaustive points-to analysis for Java in under a minute

[...]

Jens Dietrich¹, Nicholas Hollingum², Bernhard Scholz³•Institutions (3)

Massey University¹, University of Sydney², Oracle Corporation³

23 Oct 2015

TL;DR: This work presents a novel approach for solving the field-sensitive points-to problem for Java with the means of a transitive-closure data-structure, and a pre-computed set of potentially matching load/store pairs to accelerate the fix-point calculation.

...read moreread less

Abstract: Computing a precise points-to analysis for very large Java programs remains challenging despite the large body of research on points-to analysis. Any approach must solve an underlying dynamic graph reachability problem, for which the best algorithms have near-cubic worst-case runtime complexity, and, hence, previous work does not scale to programs with millions of lines of code. In this work, we present a novel approach for solving the field-sensitive points-to problem for Java with the means of (1) a transitive-closure data-structure, and (2) a pre-computed set of potentially matching load/store pairs to accelerate the fix-point calculation. Experimentation on Java benchmarks validates the superior performance of our approach over the standard context-free language reachability implementations. Our approach computes a points-to index for the OpenJDK with over 1.5 billion tuples in under a minute.

...read moreread less

Journal Article•DOI•

FastRAQ: A Fast Approach to Range-Aggregate Queries in Big Data Environments

[...]

Xiaochun Yun¹, Guangjun Wu¹, Guangyan Zhang², Keqin Li³, Shupeng Wang¹ - Show less +1 more•Institutions (3)

Chinese Academy of Sciences¹, Tsinghua University², State University of New York System³

01 Apr 2015

TL;DR: Experimental results demonstrate that FastRAQ provides range-aggregate query results within a time period two orders of magnitude lower than that of Hive, while the relative error is less than 3 percent within the given confidence interval.

...read moreread less

Abstract: Range-aggregate queries are to apply a certain aggregate function on all tuples within given query ranges. Existing approaches to range-aggregate queries are insufficient to quickly provide accurate results in big data environments. In this paper, we propose FastRAQ—a fast approach to range-aggregate queries in big data environments. FastRAQ first divides big data into independent partitions with a balanced partitioning algorithm, and then generates a local estimation sketch for each partition. When a range-aggregate query request arrives, FastRAQ obtains the result directly by summarizing local estimates from all partitions. FastRAQ has $O(1)$ time complexity for data updates and $O(\frac{N}{P\times {B}})$ time complexity for range-aggregate queries, where $N$ is the number of distinct tuples for all dimensions, $P$ is the partition number, and $B$ is the bucket number in the histogram. We implement the FastRAQ approach on the Linux platform, and evaluate its performance with about 10 billions data records. Experimental results demonstrate that FastRAQ provides range-aggregate query results within a time period two orders of magnitude lower than that of Hive, while the relative error is less than 3 percent within the given confidence interval.

...read moreread less

Journal Article•DOI•

Watermarking relational databases using bacterial foraging algorithm

[...]

Vidhi Khanduja¹, Om Prakash Verma², Shampa Chakraverty¹•Institutions (2)

Netaji Subhas Institute of Technology¹, Delhi Technological University²

01 Feb 2015-Multimedia Tools and Applications

TL;DR: This work presents a novel method for watermarking relational databases for identification and proof of ownership based on the secure embedding of blind and multi-bit watermarks using Bacterial Foraging Algorithm (BFA).

...read moreread less

Abstract: The main aspect of database protection is to prove the ownership of data that describes who is the originator of data. It is of particular importance in the case of electronic data, as data sets are often modified and copied without proper citation or acknowledgement of originating data set. We present a novel method for watermarking relational databases for identification and proof of ownership based on the secure embedding of blind and multi-bit watermarks using Bacterial Foraging Algorithm (BFA). Feasibility of BFA implementation is shown in the framed watermarking databases application. Identification of owner is cryptographically made secure and used as an embedded watermark. An improved hash partitioning approach is used that is independent of primary key of the database to secure ordering of the tuples. Strength of BFA is explored to make the technique robust, secure and imperceptible. BFA is implemented to give nearly global optimal values bounded by data usability constraints and thus makes database fragile to any attack. The parameters of BFA are tuned to reduce the execution time. BFA is experimentally proved to be better solution than Genetic Algorithm (GA). The technique proposed is experimentally proved to be resilient against malicious attacks.

...read moreread less

Book Chapter•DOI•

The Smart Table Constraint

[...]

Jean-Baptiste Mairy¹, Yves Deville¹, Christopthe Lecoutre²•Institutions (2)

Université catholique de Louvain¹, Artois University²

18 May 2015

TL;DR: This paper proposes to authorize entries in tables to contain simple arithmetic constraints, replacing classical tuples of values by so-called smart tuples, and demonstrates that the smart table constraint is a highly promising general purpose tool for CP.

...read moreread less

Abstract: Table Constraints are very useful for modeling combinatorial problems in Constraint Programming (CP). They are a universal mechanism for representing constraints, but unfortunately the size of their tables can grow exponentially with their arities. In this paper, we propose to authorize entries in tables to contain simple arithmetic constraints, replacing classical tuples of values by so-called smart tuples. Smart table constraints can thus be viewed as logical combinations of those simple arithmetic constraints. This new form of tuples allows us to encode compactly many constraints, including a dozen of well-known global constraints. We show that, under a very reasonable assumption about the acyclicity of smart tuples, a Generalized Arc Consistency algorithm of low time complexity can be devised. Our experimental results demonstrate that the smart table constraint is a highly promising general purpose tool for CP.

...read moreread less

Posted Content•

Yin and Yang: Balancing and Answering Binary Visual Questions

[...]

Peng Zhang¹, Yash Goyal¹, Douglas Summers-Stay², Dhruv Batra¹, Devi Parikh¹ - Show less +1 more•Institutions (2)

Virginia Tech¹, United States Army Research Laboratory²

16 Nov 2015-arXiv: Computation and Language

TL;DR: This paper addresses binary Visual Question Answering on abstract scenes as visual verification of concepts inquired in the questions by converting the question to a tuple that concisely summarizes the visual concept to be detected in the image.

...read moreread less

Abstract: The complex compositional structure of language makes problems at the intersection of vision and language challenging. But language also provides a strong prior that can result in good superficial performance, without the underlying models truly understanding the visual content. This can hinder progress in pushing state of art in the computer vision aspects of multi-modal AI. In this paper, we address binary Visual Question Answering (VQA) on abstract scenes. We formulate this problem as visual verification of concepts inquired in the questions. Specifically, we convert the question to a tuple that concisely summarizes the visual concept to be detected in the image. If the concept can be found in the image, the answer to the question is "yes", and otherwise "no". Abstract scenes play two roles (1) They allow us to focus on the high-level semantics of the VQA task as opposed to the low-level recognition problems, and perhaps more importantly, (2) They provide us the modality to balance the dataset such that language priors are controlled, and the role of vision is essential. In particular, we collect fine-grained pairs of scenes for every question, such that the answer to the question is "yes" for one scene, and "no" for the other for the exact same question. Indeed, language priors alone do not perform better than chance on our balanced dataset. Moreover, our proposed approach matches the performance of a state-of-the-art VQA approach on the unbalanced dataset, and outperforms it on the balanced dataset.

...read moreread less

Collapse