scispace - formally typeset
Search or ask a question

Showing papers on "Tuple published in 2016"


Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors address binary visual question answering (VQA) on abstract scenes by converting the question to a tuple that concisely summarizes the visual concept to be detected in the image.
Abstract: The complex compositional structure of language makes problems at the intersection of vision and language challenging. But language also provides a strong prior that can result in good superficial performance, without the underlying models truly understanding the visual content. This can hinder progress in pushing state of art in the computer vision aspects of multi-modal AI. In this paper, we address binary Visual Question Answering (VQA) on abstract scenes. We formulate this problem as visual verification of concepts inquired in the questions. Specifically, we convert the question to a tuple that concisely summarizes the visual concept to be detected in the image. If the concept can be found in the image, the answer to the question is "yes", and otherwise "no". Abstract scenes play two roles (1) They allow us to focus on the highlevel semantics of the VQA task as opposed to the low-level recognition problems, and perhaps more importantly, (2) They provide us the modality to balance the dataset such that language priors are controlled, and the role of vision is essential. In particular, we collect fine-grained pairs of scenes for every question, such that the answer to the question is "yes" for one scene, and "no" for the other for the exact same question. Indeed, language priors alone do not perform better than chance on our balanced dataset. Moreover, our proposed approach matches the performance of a state-of-the-art VQA approach on the unbalanced dataset, and outperforms it on the balanced dataset.

290 citations


Proceedings ArticleDOI
01 Jan 2016
TL;DR: This paper presents an inference attack, using real datasets, where an adversary leverages the probabilistic dependence between tuples to extract users’ sensitive information from differentially private query results (violating the DP guarantees), and introduces the notion of dependent differential privacy (DDP) that accounts for the dependence that exists between tupling and proposes a dependent perturbation mechanism (DPM) to achieve the privacy guarantees in DDP.
Abstract: Differential privacy (DP) is a widely accepted mathematical framework for protecting data privacy. Simply stated, it guarantees that the distribution of query results changes only slightly due to the modification of any one tuple in the database. This allows protection, even against powerful adversaries, who know the entire database except one tuple. For providing this guarantee, differential privacy mechanisms assume independence of tuples in the database – a vulnerable assumption that can lead to degradation in expected privacy levels especially when applied to real-world datasets that manifest natural dependence owing to various social, behavioral, and genetic relationships between users. In this paper, we make several contributions that not only demonstrate the feasibility of exploiting the above vulnerability but also provide steps towards mitigating it. First, we present an inference attack, using real datasets, where an adversary leverages the probabilistic dependence between tuples to extract users’ sensitive information from differentially private query results (violating the DP guarantees). Second, we introduce the notion of dependent differential privacy (DDP) that accounts for the dependence that exists between tuples and propose a dependent perturbation mechanism (DPM) to achieve the privacy guarantees in DDP. Finally, using a combination of theoretical analysis and extensive experiments involving different classes of queries (e.g., machine learning queries, graph queries) issued over multiple large-scale real-world datasets, we show that our DPM consistently outperforms state-of-the-art approaches in managing the privacy-utility tradeoffs for dependent data.

184 citations


Proceedings ArticleDOI
01 Aug 2016
TL;DR: This work develops neural network models for scoring tuples on arbitrary phrases and evaluates them by their ability to distinguish true held-out tuples from false ones and finds strong performance from a bilinear model using a simple additive architecture to model phrases.
Abstract: We enrich a curated resource of commonsense knowledge by formulating the problem as one of knowledge base completion (KBC). Most work in KBC focuses on knowledge bases like Freebase that relate entities drawn from a fixed set. However, the tuples in ConceptNet (Speer and Havasi, 2012) define relations between an unbounded set of phrases. We develop neural network models for scoring tuples on arbitrary phrases and evaluate them by their ability to distinguish true held-out tuples from false ones. We find strong performance from a bilinear model using a simple additive architecture to model phrases. We manually evaluate our trained model’s ability to assign quality scores to novel tuples, finding that it can propose tuples at the same quality level as mediumconfidence tuples from ConceptNet.

181 citations


Proceedings Article
09 Jul 2016
TL;DR: A decade of progress on building Open IE extractors is described, which results in the latest extractor, OPENIE4, which is computationally efficient, outputs n-ary and nested relations, and also outputs relations mediated by nouns in addition to verbs.
Abstract: Open Information Extraction (Open IE) extracts textual tuples comprising relation phrases and argument phrases from within a sentence, without requiring a pre-specified relation vocabulary In this paper we first describe a decade of our progress on building Open IE extractors, which results in our latest extractor, OPENIE4, which is computationally efficient, outputs n-ary and nested relations, and also outputs relations mediated by nouns in addition to verbs We also identify several strengths of the Open IE paradigm, which enable it to be a useful intermediate structure for end tasks We survey its use in both human-facing applications and downstream NLP tasks, including event schema induction, sentence similarity, text comprehension, learning word vector embeddings, and more

175 citations


Proceedings ArticleDOI
14 Jun 2016
TL;DR: PrivTree is a histogram construction algorithm that adopts hierarchical decomposition but completely eliminates the dependency on a pre-defined h, and exploits a new analysis on the Laplace distribution, which enables it to use only a constant amount of noise in deciding whether a sub-domain should be split, without worrying about the recursion depth of splitting.
Abstract: Given a set D of tuples defined on a domain Omega, we study differentially private algorithms for constructing a histogram over Omega to approximate the tuple distribution in D. Existing solutions for the problem mostly adopt a hierarchical decomposition approach, which recursively splits Omega into sub-domains and computes a noisy tuple count for each sub-domain, until all noisy counts are below a certain threshold. This approach, however, requires that we (i) impose a limit h on the recursion depth in the splitting of Omega and (ii) set the noise in each count to be proportional to h. The choice of h is a serious dilemma: a small h makes the resulting histogram too coarse-grained, while a large h leads to excessive noise in the tuple counts used in deciding whether sub-domains should be split. Furthermore, h cannot be directly tuned based on D; otherwise, the choice of h itself reveals private information and violates differential privacy. To remedy the deficiency of existing solutions, we present PrivTree, a histogram construction algorithm that adopts hierarchical decomposition but completely eliminates the dependency on a pre-defined h. The core of PrivTree is a novel mechanism that (i) exploits a new analysis on the Laplace distribution and (ii) enables us to use only a constant amount of noise in deciding whether a sub-domain should be split, without worrying about the recursion depth of splitting. We demonstrate the application of PrivTree in modelling spatial data, and show that it can be extended to handle sequence data (where the decision in sub-domain splitting is not based on tuple counts but a more sophisticated measure). Our experiments on a variety of real datasets show that PrivTree considerably outperforms the states of the art in terms of data utility.

169 citations


Journal ArticleDOI
TL;DR: In this article, a data-driven control approach for building climate control is presented based on reinforcement learning, where the underlying sequential decision making problem is cast into a Markov decision problem, after which the control algorithm is detailed.

93 citations


Journal ArticleDOI
TL;DR: A tree-structured discrete graphical model is introduced that is used to select and label a set of non-overlapping regions in the image by a global optimization of a classification score, and the performance of the model can be improved by considering a proxy problem for learning the surface that allows better selection of the extremal regions.

87 citations


Journal ArticleDOI
TL;DR: An adaptive algorithm, called ADSUD, is proposed, which redefine the approximate global skyline probability and choose local representative tuples due to minimum probabilistic bounding rectangle adaptively and design a progressive pruning method and apply the reuse mechanism to improve its efficiency.
Abstract: Query processing over uncertain data has gained growing attention, because it is necessary to deal with uncertain data in many real-life applications. In this paper, we investigate skyline queries over uncertain data in distributed environments (DSUD query) whose research is only in an early stage. The state-of-the-art algorithm, called e-DSUD algorithm, is designed for processing this query. It has the desirable characteristics of progressiveness and minimum bandwidth consumption. However, it still needs to be perfected in three aspects. (1) Progressiveness. Each time it only returns one query result at most. (2) Efficiency. There are a significant amount of redundant I/O cost and numerous iterations which causes a long total query time. (3) Universality. It is restricted to the case where local skyline tuples are incomparability. To address these concerns, we first present a detailed analysis of the e-DSUD algorithm and then develop an improved framework for the DSUD query, namely IDSUD. Based on the new framework, we propose an adaptive algorithm, called ADSUD, for the DSUD query. In the algorithm, we redefine the approximate global skyline probability and choose local representative tuples due to minimum probabilistic bounding rectangle adaptively. Furthermore, we design a progressive pruning method and apply the reuse mechanism to improve its efficiency. The results of extensive experiments verify the better overall performance of our algorithm than the e-DSUD algorithm.

82 citations


Journal ArticleDOI
01 Nov 2016
TL;DR: A new on-line partitioning approach, called Clay, that supports both tree-based schemas and more complex "general" schemas with arbitrary foreign key relationships is presented and it is shown that it can generate partitioning schemes that enable the system to achieve up to 15× better throughput and 99% lower latency than existing approaches.
Abstract: Transaction processing database management systems (DBMSs) are critical for today's data-intensive applications because they enable an organization to quickly ingest and query new information. Many of these applications exceed the capabilities of a single server, and thus their database has to be deployed in a distributed DBMS. The key factor affecting such a system's performance is how the database is partitioned. If the database is partitioned incorrectly, the number of distributed transactions can be high. These transactions have to synchronize their operations over the network, which is considerably slower and leads to poor performance. Previous work on elastic database repartitioning has focused on a certain class of applications whose database schema can be represented in a hierarchical tree structure. But many applications cannot be partitioned in this manner, and thus are subject to distributed transactions that impede their performance and scalability.In this paper, we present a new on-line partitioning approach, called Clay, that supports both tree-based schemas and more complex "general" schemas with arbitrary foreign key relationships. Clay dynamically creates blocks of tuples to migrate among servers during repartitioning, placing no constraints on the schema but taking care to balance load and reduce the amount of data migrated. Clay achieves this goal by including in each block a set of hot tuples and other tuples co-accessed with these hot tuples. To evaluate our approach, we integrate Clay in a distributed, main-memory DBMS and show that it can generate partitioning schemes that enable the system to achieve up to 15× better throughput and 99% lower latency than existing approaches.

82 citations


Journal ArticleDOI
18 Mar 2016
TL;DR: The goal of the article is to bridge the difference between theoretical and practical approaches to answering queries over databases with nulls, and show that a slight modification of the three-valued semantics for relational calculus queries can provide the required certainty guarantees.
Abstract: The goal of the article is to bridge the difference between theoretical and practical approaches to answering queries over databases with nulls. Theoretical research has long ago identified the notion of correctness of query answering over incomplete data: one needs to find certain answers, which are true regardless of how incomplete information is interpreted. This serves as the notion of correctness of query answering, but carries a huge complexity tag. In practice, on the other hand, query answering must be very efficient, and to achieve this, SQL uses three-valued logic for evaluating queries on databases with nulls. Due to the complexity mismatch, the two approaches cannot coincide, but perhaps they are related in some way. For instance, does SQL always produce answers we can be certain about?This is not so: SQL’s and certain answers semantics could be totally unrelated. We show, however, that a slight modification of the three-valued semantics for relational calculus queries can provide the required certainty guarantees. The key point of the new scheme is to fully utilize the three-valued semantics, and classify answers not into certain or noncertain, as was done before, but rather into certainly true, certainly false, or unknown. This yields relatively small changes to the evaluation procedure, which we consider at the level of both declarative (relational calculus) and procedural (relational algebra) queries. These new evaluation procedures give us certainty guarantees even for queries returning tuples with null values.

74 citations


Posted Content
TL;DR: In this article, a co-attention model is proposed to exploit a set of external off-the-shelf algorithms to achieve its goal, an approach that has something in common with the Neural Turing Machine.
Abstract: One of the most intriguing features of the Visual Question Answering (VQA) challenge is the unpredictability of the questions. Extracting the information required to answer them demands a variety of image operations from detection and counting, to segmentation and reconstruction. To train a method to perform even one of these operations accurately from {image,question,answer} tuples would be challenging, but to aim to achieve them all with a limited set of such training data seems ambitious at best. We propose here instead a more general and scalable approach which exploits the fact that very good methods to achieve these operations already exist, and thus do not need to be trained. Our method thus learns how to exploit a set of external off-the-shelf algorithms to achieve its goal, an approach that has something in common with the Neural Turing Machine. The core of our proposed method is a new co-attention model. In addition, the proposed approach generates human-readable reasons for its decision, and can still be trained end-to-end without ground truth reasons being given. We demonstrate the effectiveness on two publicly available datasets, Visual Genome and VQA, and show that it produces the state-of-the-art results in both cases.

Journal ArticleDOI
01 Jul 2016
TL;DR: This paper shows how to further speed up data deduplication by leveraging parallelism in a shared-nothing computing environment and proposes a distribution strategy, called Dis-Dedup, that minimizes the maximum workload across all worker nodes and provides strong theoretical guarantees.
Abstract: Data deduplication refers to the process of identifying tuples in a relation that refer to the same real world entity. The complexity of the problem is inherently quadratic with respect to the number of tuples, since a similarity value must be computed for every pair of tuples. To avoid comparing tuple pairs that are obviously non-duplicates, blocking techniques are used to divide the tuples into blocks and only tuples within the same block are compared. However, even with the use of blocking, data deduplication remains a costly problem for large datasets. In this paper, we show how to further speed up data deduplication by leveraging parallelism in a shared-nothing computing environment. Our main contribution is a distribution strategy, called Dis-Dedup, that minimizes the maximum workload across all worker nodes and provides strong theoretical guarantees. We demonstrate the effectiveness of our proposed strategy by performing extensive experiments on both synthetic datasets with varying block size distributions, as well as real world datasets.

Journal ArticleDOI
TL;DR: The frontier between tractability and intractability is precisely characterized for the following problems of interest in these settings: consistency checking, learnability, and deciding the informativeness of a tuple.
Abstract: We investigate the problem of learning join queries from user examples. The user is presented with a set of candidate tuples and is asked to label them as positive or negative examples, depending on whether or not she would like the tuples as part of the join result. The goal is to quickly infer an arbitrary n-ary join predicate across an arbitrary number m of relations while keeping the number of user interactions as minimal as possible. We assume no prior knowledge of the integrity constraints across the involved relations. Inferring the join predicate across multiple relations when the referential constraints are unknown may occur in several applications, such as data integration, reverse engineering of database queries, and schema inference. In such scenarios, the number of tuples involved in the join is typically large. We introduce a set of strategies that let us inspect the search space and aggressively prune what we call uninformative tuples, and we directly present to the user the informative ones—that is, those that allow the user to quickly find the goal query she has in mind. In this article, we focus on the inference of joins with equality predicates and also allow disjunctive join predicates and projection in the queries. We precisely characterize the frontier between tractability and intractability for the following problems of interest in these settings: consistency checking, learnability, and deciding the informativeness of a tuple. Next, we propose several strategies for presenting tuples to the user in a given order that allows minimization of the number of interactions. We show the efficiency of our approach through an experimental study on both benchmark and synthetic datasets.

Journal ArticleDOI
TL;DR: A unified framework to efficiently and privately process queries that require both search and compute operations to be performed on fully encrypted databases is constructed.
Abstract: Private query processing on encrypted databases allows users to obtain data from encrypted databases in such a way that the users’ sensitive data will be protected from exposure. Given an encrypted database, users typically submit queries similar to the following examples: 1) How many employees in an organization make over U.S. $100000? 2) What is the average age of factory workers suffering from leukemia? Answering the questions requires one to search and then compute over the relevant encrypted data sets in sequence. In this paper, we are interested in efficiently processing queries that require both operations to be performed on fully encrypted databases. One immediate solution is to use several special-purpose encryption schemes simultaneously; however, this approach is associated with a high computational cost for maintaining multiple encryption contexts. Another solution is to use a privacy homomorphic scheme. However, no secure solutions have been developed that satisfy the efficiency requirements. In this paper, we construct a unified framework to efficiently and privately process queries with search and compute operations. For this purpose, the first part of our work involves devising several underlying circuits as primitives for queries on encrypted data. Second, we apply two optimization techniques to improve the efficiency of these circuit primitives. One technique involves exploiting single-instruction-multiple-data (SIMD) techniques to accelerate the basic circuit operations. Unlike general SIMD approaches, our SIMD implementation can be applied even to a single basic operation. The other technique is to use a large integer ring (e.g., $\mathbb {Z}_{2^{t}}$ ) as a message space rather than a binary field. Even for an integer of $k$ bits with $k>t$ , addition can be performed using degree 1 circuits with lazy carry operations. Finally, we present various experiments performed by varying the considered parameters, such as the query type and the number of tuples.

Journal ArticleDOI
TL;DR: A decision making approach is provided to solve multi-attribute group decision making (MAGDM) problem with 2DLL assessment using two new 2-dimension linguistic aggregation operators, based on 2DLWAA and 2DLOWAA operators.
Abstract: The 2-dimension linguistic information includes two common linguistic labels. One dimension is used for describing the evaluation result of alternatives provided by the decision maker, and the other is used for describing the self-assessment of the decision maker on the reliability of the given evaluation result. In order to deal with comparability and incomparability of 2-dimension linguistic labels (2DLLs), a 2-dimension linguistic lattice implication algebra (2DL-LIA) is constructed as a linguistic evaluation set with lattice structure. In this paper, a 2DLL is firstly represented by two 2-tuples in a 2DL-LIA for more precise computing and aggregating 2-dimension linguistic information. This model allows a continuous representation of 2DLL on its domain, therefore, it can represent any continuous 2-dimension linguistic information obtained in the aggregation process. Next, two new 2-dimension linguistic aggregation operators, including 2-dimension linguistic weighted arithmetic aggregation (2DLWAA) operator and 2-dimension linguistic ordered weighted arithmetic aggregation (2DLOWAA) operator, are developed, and then some desirable properties of the operators are studied. Subsequently, based on 2DLWAA and 2DLOWAA operators, a decision making approach is provided to solve multi-attribute group decision making (MAGDM) problem with 2DLL assessment. Finally, an illustrative example is provided to show the concrete steps of the developed decision making approach and to demonstrate the practicality and the flexibility of this proposal by comparing with existing decision making approaches.

Journal ArticleDOI
TL;DR: The article first provides the necessary evidence that the axiomatization presented in this article characterizes indeed the whole class of (synchronous) parallel algorithms, then formally proves that parallel algorithms are captured by Abstract State Machines (ASMs).

Book ChapterDOI
05 Sep 2016
TL;DR: Compact-Table (CT), a bitwise algorithm to enforce Generalized Arc Consistency (GAC) on table constraints, is described and shows that CT outperforms state-of-the-art algorithms STR2, STR3, GAC4R, MDD4R and AC5-TC on standard benchmarks.
Abstract: In this paper, we describe Compact-Table (CT), a bitwise algorithm to enforce Generalized Arc Consistency (GAC) on table constraints. Although this algorithm is the default propagator for table constraints in or-tools and OscaR, two publicly available CP solvers, it has never been described so far. Importantly, CT has been recently improved further with the introduction of residues, resetting operations and a data-structure called reversible sparse bit-set, used to maintain tables of supports (following the idea of tabular reduction): tuples are invalidated incrementally on value removals by means of bit-set operations. The experimentation that we have conducted with OscaR shows that CT outperforms state-of-the-art algorithms STR2, STR3, GAC4R, MDD4R and AC5-TC on standard benchmarks.

Proceedings ArticleDOI
25 Aug 2016
TL;DR: A technique to detect forbidden tuples lazily during a greedy test case generation is proposed, which significantly reduces the number of required SAT solving calls and significantly improves the efficiency of constraint handling in the greedy combinatorial testing algorithm.
Abstract: Combinatorial testing aims at covering the interactions of parameters in a system under test, while some combinations may be forbidden by given constraints (forbidden tuples). In this paper, we illustrate that such forbidden tuples correspond to unsatisfiable cores, a widely understood notion in the SAT solving community. Based on this observation, we propose a technique to detect forbidden tuples lazily during a greedy test case generation, which significantly reduces the number of required SAT solving calls. We further reduce the amount of time spent in SAT solving by essentially ignoring constraints while constructing each test case, but then “amending” it to obtain a test case that satisfies the constraints, again using unsatisfiable cores. Finally, to complement a disturbance due to ignoring constraints, we implement an efficient approximative SAT checking function in the SAT solver Lingeling. Through experiments we verify that our approach significantly improves the efficiency of constraint handling in our greedy combinatorial testing algorithm.

Posted Content
TL;DR: This work addresses the limitations of prior work on OD discovery, and develops an efficient set-containment, lattice-driven OD discovery algorithm that uses the inference rules to prune the search space.
Abstract: Integrity constraints (ICs) provide a valuable tool for expressing and enforcing application semantics. However, formulating constraints manually requires domain expertise, is prone to human errors, and may be excessively time consuming, especially on large datasets. Hence, proposals for automatic discovery have been made for some classes of ICs, such as functional dependencies (FDs), and recently, order dependencies (ODs). ODs properly subsume FDs, as they can additionally express business rules involving order; e.g., an employee never has a higher salary while paying lower taxes compared with another employee. We address the limitations of prior work on OD discovery which has factorial complexity in the number of attributes, is incomplete (i.e., it does not discover valid ODs that cannot be inferred from the ones found) and is not concise (i.e., it can result in "redundant" discovery and overly large discovery sets). We improve significantly on complexity, offer completeness, and define a compact canonical form. This is based on a novel polynomial mapping to a canonical form for ODs, and a sound and complete set of axioms (inference rules) for canonical ODs. This allows us to develop an efficient set-containment, lattice-driven OD discovery algorithm that uses the inference rules to prune the search space. Our algorithm has exponential worst-case time complexity in the number of attributes and linear complexity in the number of tuples. We prove that it produces a complete, minimal set of ODs (i.e., minimal with regards to the canonical representation). Finally, using real and synthetic datasets, we experimentally show orders-of-magnitude performance improvements over the current state-of-the-art algorithm and demonstrate effectiveness of our techniques.

Journal ArticleDOI
01 Aug 2016
TL;DR: This work revisits the fundamental notion of a key in relational databases with NULL and investigates the notions of possible and certain keys, which are keys that hold in some or all possible worlds that originate from an SQL table, respectively.
Abstract: Driven by the dominance of the relational model and the requirements of modern applications, we revisit the fundamental notion of a key in relational databases with NULL. In SQL, primary key columns are NOT NULL, and UNIQUE constraints guarantee uniqueness only for tuples without NULL. We investigate the notions of possible and certain keys, which are keys that hold in some or all possible worlds that originate from an SQL table, respectively. Possible keys coincide with UNIQUE, thus providing a semantics for their syntactic definition in the SQL standard. Certain keys extend primary keys to include NULL columns and can uniquely identify entities whenever feasible, while primary keys may not. In addition to basic characterization, axiomatization, discovery, and extremal combinatorics problems, we investigate the existence and construction of Armstrong tables, and describe an indexing scheme for enforcing certain keys. Our experiments show that certain keys with NULLs occur in real-world data, and related computational problems can be solved efficiently. Certain keys are therefore semantically well founded and able to meet Codd's entity integrity rule while handling high volumes of incomplete data from different formats.

08 Feb 2016
TL;DR: A tuple expansion procedure which reconstructs rich information from semantically poor SQL data types such as strings, integers, and floating point numbers is described, and this procedure is used as the foundation of a new user-guided outlier detection framework, dBoost, which relies on inference and statistical modeling of heterogeneous data to flag suspicious fields in database tuples.
Abstract: Rapidly developing areas of information technology are generating massive amounts of data. Human errors, sensor failures, and other unforeseen circumstances unfortunately tend to undermine the quality and consistency of these datasets by introducing outliers – data points that exhibit surprising behavior when compared to the rest of the data. Characterizing, locating, and in some cases eliminating these outliers offers interesting insight about the data under scrutiny and reinforces the confidence that one may have in conclusions drawn from otherwise noisy datasets. In this paper, we describe a tuple expansion procedure which reconstructs rich information from semantically poor SQL data types such as strings, integers, and floating point numbers. We then use this procedure as the foundation of a new user-guided outlier detection framework, dBoost, which relies on inference and statistical modeling of heterogeneous data to flag suspicious fields in database tuples. We show that this novel approach achieves good classification performance, both in traditional numerical datasets and in highly non-numerical contexts such as mostly textual datasets. Our implementation is publicly available, under version 3 of the GNU General Public License.

01 Nov 2016
TL;DR: This thesis presents four assistants to help data explorers interrogate a database to discover its content: Claude, Blaeu, Ziggy and Raimond, which are an attempt to generalize semi-automatic exploration to text data.
Abstract: Data explorers interrogate a database to discover its content. Their aim is to get an overview of the data and discover interesting new facts. They have little to no knowledge of the data, and their requirements are often vague and abstract. How can such users write database queries? This thesis presents four assistants to help them through this task: Claude, Blaeu, Ziggy and Raimond. Each assistant focuses on a specific exploration task. Claude helps users analyze data warehouses, by highlighting the combinations of variables which influence a predefined measure of interest. Blaeu helps users build and refine queries, by allowing them to select and project clusters of tuples. Ziggy is a tuple characterization engine: its aim is to show what makes a selection of tuples unique, by highlighting the differences between those and the rest of the database. Finally, Raimond is an attempt to generalize semi-automatic exploration to text data, inspired by an industrial use case. For each system, we present a user model, that is, a formalized set of assumptions about the users’ goals. We then present practical methods to make recommendations. We either adapt existing algorithms from the machine learning literature or present our own. Next, we validate our approaches with experiments. We present use cases in which our systems led to discoveries, and we benchmark their speed, quality and robustness.

Proceedings ArticleDOI
01 May 2016
TL;DR: It is demonstrated that the underlying optimization problems are NP-HARD, and an algorithm for finding the approximately optimal list of rules to display when the user uses a smart drill-down is described.
Abstract: We present smart drill-down, an operator for interactively exploring a relational table to discover and summarize “interesting” groups of tuples. Each group of tuples is described by a rule. For instance, the rule (a, b, ★, 1000) tells us that there are a thousand tuples with value a in the first column and b in the second column (and any value in the third column). Smart drill-down presents an analyst with a list of rules that together describe interesting aspects of the table. The analyst can tailor the definition of interesting, and can interactively apply smart drill-down on an existing rule to explore that part of the table. We demonstrate that the underlying optimization problems are NP-HARD, and describe an algorithm for finding the approximately optimal list of rules to display when the user uses a smart drill-down, and a dynamic sampling scheme for efficiently interacting with large tables. Finally, we perform experiments on real datasets on our experimental prototype to demonstrate the usefulness of smart drill-down and study the performance of our algorithms.

Journal ArticleDOI
TL;DR: In this paper, the shortest vector problem (SVP) on Euclidean lattices has been solved with less memory than enumeration, and the space requirement scales with the dimension of the lattice.
Abstract: Lattice sieving is asymptotically the fastest approach for solving the shortest vector problem (SVP) on Euclidean lattices. All known sieving algorithms for solving the SVP require space which (heuristically) grows as $2^{0.2075n+o(n)}$ , where $n$ is the lattice dimension. In high dimensions, the memory requirement becomes a limiting factor for running these algorithms, making them uncompetitive with enumeration algorithms, despite their superior asymptotic time complexity. We generalize sieving algorithms to solve SVP with less memory. We consider reductions of tuples of vectors rather than pairs of vectors as existing sieve algorithms do. For triples, we estimate that the space requirement scales as $2^{0.1887n+o(n)}$ . The naive algorithm for this triple sieve runs in time $2^{0.5661n+o(n)}$ . With appropriate filtering of pairs, we reduce the time complexity to $2^{0.4812n+o(n)}$ while keeping the same space complexity. We further analyze the effects of using larger tuples for reduction, and conjecture how this provides a continuous trade-off between the memory-intensive sieving and the asymptotically slower enumeration.

Proceedings ArticleDOI
11 Apr 2016
TL;DR: This work introduces the list recommendation problem and proposes a novel two-layered framework that builds upon existing CF algorithms to optimize a list's click probability and evaluates the approach using a novel adaptation of Inverse Propensity Scoring which facilitates off-policy estimation of the method's CTR and showcases its effectiveness in real-world settings.
Abstract: Most Collaborative Filtering (CF) algorithms are optimized using a dataset of isolated user-item tuples. However, in commercial applications recommended items are usually served as an ordered list of several items and not as isolated items. In this setting, inter-item interactions have an effect on the list's Click-Through Rate (CTR) that is unaccounted for using traditional CF approaches. Most CF approaches also ignore additional important factors like click propensity variation, item fatigue, etc. In this work, we introduce the list recommendation problem. We present useful insights gleaned from user behavior and consumption patterns from a large scale real world recommender system. We then propose a novel two-layered framework that builds upon existing CF algorithms to optimize a list's click probability. Our approach accounts for inter-item interactions as well as additional information such as item fatigue, trendiness patterns, contextual information etc. Finally, we evaluate our approach using a novel adaptation of Inverse Propensity Scoring (IPS) which facilitates off-policy estimation of our method's CTR and showcases its effectiveness in real-world settings.

Posted Content
TL;DR: The proposed approach incorporates caching directly into LFTJ and can dynamically adjust the size of the cache, tying variable ordering to tree decomposition, which balances memory usage and repeated computation.
Abstract: Traditional algorithms for multiway join computation are based on rewriting the order of joins and combining results of intermediate subqueries. Recently, several approaches have been proposed for algorithms that are "worst-case optimal" wherein all relations are scanned simultaneously. An example is Veldhuizen's Leapfrog Trie Join (LFTJ). An important advantage of LFTJ is its small memory footprint, due to the fact that intermediate results are full tuples that can be dumped immediately. However, since the algorithm does not store intermediate results, recurring joins must be reconstructed from the source relations, resulting in excessive memory traffic. In this paper, we address this problem by incorporating caches into LFTJ. We do so by adopting recent developments on join optimization, tying variable ordering to tree decomposition. While the traditional usage of tree decomposition computes the result for each bag in advance, our proposed approach incorporates caching directly into LFTJ and can dynamically adjust the size of the cache. Consequently, our solution balances memory usage and repeated computation, as confirmed by our experiments over SNAP datasets.

Posted Content
TL;DR: This work proposes the Private Data Network, a federated database for querying over the collective data of mutually distrustful parties, and introduces a framework for executing PDN queries named SMCQL, which translates SQL statements into SMC primitives to compute query results over the union of its source databases.
Abstract: People and machines are collecting data at an unprecedented rate. Despite this newfound abundance of data, progress has been slow in sharing it for open science, business, and other data-intensive endeavors. Many such efforts are stymied by privacy concerns and regulatory compliance issues. For example, many hospitals are interested in pooling their medical records for research, but none may disclose arbitrary patient records to researchers or other healthcare providers. In this context we propose the Private Data Network (PDN), a federated database for querying over the collective data of mutually distrustful parties. In a PDN, each member database does not reveal its tuples to its peers nor to the query writer. Instead, the user submits a query to an honest broker that plans and coordinates its execution over multiple private databases using secure multiparty computation (SMC). Here, each database's query execution is oblivious, and its program counters and memory traces are agnostic to the inputs of others. We introduce a framework for executing PDN queries named SMCQL. This system translates SQL statements into SMC primitives to compute query results over the union of its source databases without revealing sensitive information about individual tuples to peer data providers or the honest broker. Only the honest broker and the querier receive the results of a PDN query. For fast, secure query evaluation, we explore a heuristics-driven optimizer that minimizes the PDN's use of secure computation and partitions its query evaluation into scalable slices.

Proceedings ArticleDOI
13 Jun 2016
TL;DR: This paper provides a theoretical analysis proving that LAS is an (ε, δ)-approximation of the optimal online load shedder and shows its performance through a practical evaluation based both on simulations and on a running prototype.
Abstract: Load shedding is a technique employed by stream processing systems to handle unpredictable spikes in the input load whenever available computing resources are not adequately provisioned. A load shedder drops tuples to keep the input load below a critical threshold and thus avoid tuple queuing and system trashing. In this paper we propose Load-Aware Shedding (LAS), a novel load shedding solution that drops tuples with the aim of maintaining queuing times below a tunable threshold. Tuple execution durations are estimated at runtime using efficient sketch data structures. We provide a theoretical analysis proving that LAS is an (e, δ)-approximation of the optimal online load shedder and show its performance through a practical evaluation based both on simulations and on a running prototype.

Proceedings ArticleDOI
09 May 2016
TL;DR: Important correctness properties for Custard, including stability (once an event has occurred, it has occurred forever) and safety (a query returns a finite set of tuples) are proved, thus demonstrating Custard's practical benefits.
Abstract: Norms provide a way to model the social architecture of a sociotechnical system (STS) and are thus crucial for understanding how such a system supports secure collaboration between principals, that is, autonomous parties such as humans and organizations. Accordingly, an important challenge is to compute the state of a norm instance at runtime in a sociotechnical system. Custard addresses this challenge by providing a relational syntax for schemas of important norm types along with their canonical lifecycles and providing a mapping from each schema to queries that compute instances of the schema in different lifecycle stages. In essence, Custard supports a norm-based abstraction layer over underlying information stores such as databases and event logs. Specifically, it supports deadlines; complex events, including those based on aggregation; and norms that reference other norms.We prove important correctness properties for Custard, including stability (once an event has occurred, it has occurred forever) and safety (a query returns a finite set of tuples). Our compiler generates SQL queries from Custard specifications. Writing out such SQL queries by hand is tedious and error-prone even for simple norms, thus demonstrating Custard's practical benefits.

Proceedings Article
12 Feb 2016
TL;DR: This work investigates the problem of learning description logic (DL) ontologies in Angluin et al.'s framework of exact learning via queries posed to an oracle and provides an almost complete classification for DLs that are fragments of EL with role inclusions and of DL-Lite.
Abstract: We investigate the problem of learning description logic (DL) ontologies in Angluin et al.'s framework of exact learning via queries posed to an oracle. We consider membership queries of the form "is a tuple a of individuals a certain answer to a data retrieval query q in a given ABox and the unknown target ontology?" and completeness queries of the form "does a hypothesis ontology entail the unknown target ontology?". Given a DL L and a data retrieval query language Q, we study polynomial learnability of ontologies in L using data retrieval queries in Q and provide an almost complete classification for DLs that are fragments of EL with role inclusions and of DL-Lite and for data retrieval queries that range from atomic queries and EL / ELI-instance queries to conjunctive queries. Some results are proved by non-trivial reductions to learning from subsumption examples.