scispace - formally typeset
Search or ask a question

Showing papers on "Tuple published in 2015"


Proceedings ArticleDOI
27 May 2015
TL;DR: Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.
Abstract: Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.

146 citations


Proceedings ArticleDOI
03 Jun 2015
TL;DR: This work introduces FlashRelate, a synthesis engine that lets ordinary users extract structured relational data from spreadsheets without programming, and demonstrates its usefulness addressing the widespread problem of data trapped in corporate and government formats.
Abstract: With hundreds of millions of users, spreadsheets are one of the most important end-user applications. Spreadsheets are easy to use and allow users great flexibility in storing data. This flexibility comes at a price: users often treat spreadsheets as a poor man's database, leading to creative solutions for storing high-dimensional data. The trouble arises when users need to answer queries with their data. Data manipulation tools make strong assumptions about data layouts and cannot read these ad-hoc databases. Converting data into the appropriate layout requires programming skills or a major investment in manual reformatting. The effect is that a vast amount of real-world data is "locked-in" to a proliferation of one-off formats. We introduce FlashRelate, a synthesis engine that lets ordinary users extract structured relational data from spreadsheets without programming. Instead, users extract data by supplying examples of output relational tuples. FlashRelate uses these examples to synthesize a program in Flare. Flare is a novel extraction language that extends regular expressions with geometric constructs. An interactive user interface on top of FlashRelate lets end users extract data by point-and-click. We demonstrate that correct Flare programs can be synthesized in seconds from a small set of examples for 43 real-world scenarios. Finally, our case study demonstrates FlashRelate's usefulness addressing the widespread problem of data trapped in corporate and government formats.

136 citations


Journal ArticleDOI
TL;DR: In this article, the authors present an ABAC policy mining algorithm, which iterates over tuples in the given user-permission relation, uses selected tuples as seeds for constructing candidate rules, and attempts to generalize each candidate rule to cover additional tuples by replacing conjuncts in attribute expressions with constraints.
Abstract: Attribute-based access control (ABAC) provides a high level of flexibility that promotes security and information sharing. ABAC policy mining algorithms have potential to significantly reduce the cost of migration to ABAC, by partially automating the development of an ABAC policy from an access control list (ACL) policy or role-based access control (RBAC) policy with accompanying attribute data. This paper presents an ABAC policy mining algorithm. To the best of our knowledge, it is the first ABAC policy mining algorithm. Our algorithm iterates over tuples in the given user-permission relation, uses selected tuples as seeds for constructing candidate rules, and attempts to generalize each candidate rule to cover additional tuples in the user-permission relation by replacing conjuncts in attribute expressions with constraints. Our algorithm attempts to improve the policy by merging and simplifying candidate rules, and then it selects the highest-quality candidate rules for inclusion in the generated policy.

107 citations


Journal ArticleDOI
TL;DR: The system, Graph Query By Example, automatically discovers a weighted hidden maximum query graph based on input query tuples, to capture a user’s query intent, and efficiently finds and ranks the top approximate matching answer graphs and answer tuples.
Abstract: We witness an unprecedented proliferation of knowledge graphs that record millions of entities and their relationships. While knowledge graphs are structure-flexible and content-rich, they are difficult to use. The challenge lies in the gap between their overwhelming complexity and the limited database knowledge of non-professional users. If writing structured queries over “simple” tables is difficult, complex graphs are only harder to query. As an initial step toward improving the usability of knowledge graphs, we propose to query such data by example entity tuples, without requiring users to form complex graph queries. Our system, Graph Query By Example ( $\mathsf {GQBE}$ ), automatically discovers a weighted hidden maximum query graph based on input query tuples, to capture a user’s query intent. It then efficiently finds and ranks the top approximate matching answer graphs and answer tuples. We conducted experiments and user studies on the large Freebase and DBpedia datasets and observed appealing accuracy and efficiency. Our system provides a complementary approach to the existing keyword-based methods, facilitating user-friendly graph querying. To the best of our knowledge, there was no such proposal in the past in the context of graphs.

105 citations


Proceedings ArticleDOI
01 Jul 2015
TL;DR: This paper introduces EVALution 1.0, a dataset designed for the training and the evaluation of Distributional Semantic Models (DSMs), which consists of almost 7.5K tuples, instantiating several semantic relations between word pairs.
Abstract: In this paper, we introduce EVALution 1.0, a dataset designed for the training and the evaluation of Distributional Semantic Models (DSMs). This version consists of almost 7.5K tuples, instantiating several semantic relations between word pairs (including hypernymy, synonymy, antonymy, meronymy). The dataset is enriched with a large amount of additional information (i.e. relation domain, word frequency, word POS, word semantic field, etc.) that can be used for either filtering the pairs or performing an in-depth analysis of the results. The tuples were extracted from a combination of ConceptNet 5.0 and WordNet 4.0, and subsequently filtered through automatic methods and crowdsourcing in order to ensure their quality. The dataset is freely downloadable1. An extension in RDF format, including also scripts for data processing, is under development.

103 citations


Journal ArticleDOI
01 Oct 2015
TL;DR: The error-generation problem is surprisingly challenging, and in fact, NP-complete, and to provide a scalable solution, a correct and efficient greedy algorithm is developed that sacrifices completeness, but succeeds under very reasonable assumptions.
Abstract: We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.

69 citations


Proceedings ArticleDOI
27 May 2015
TL;DR: This work studies the problem of efficiently discovering top-k project join queries which approximately contain the given example tuples in their output and extends the algorithms to incrementally produce results as soon as the user finishes typing/modifying a cell.
Abstract: An enterprise information worker is often aware of a few example tuples that should be present in the output of the query. Query discovery systems have been developed to discover project-join queries that contain the given example tuples in their output. However, they require the output to exactly contain all the example tuples and do not perform any ranking. To address this limitation, we study the problem of efficiently discovering top-k project join queries which approximately contain the given example tuples in their output. We extend our algorithms to incrementally produce results as soon as the user finishes typing/modifying a cell. Our experiments on real-life and synthetic datasets show that our proposed solution is significantly more efficient compared with applying state-of-the-art algorithms.

66 citations


Journal ArticleDOI
TL;DR: This paper provides an interpretation of the extended Bonferroni mean (EBM) operator by assuming that some of the attributes, which are denoted as A, are related to a subset B of the set A \ {Ai}, and others have no relation with the remaining attributes.
Abstract: Classical Bonferroni mean, defined by Bonferroni in 1950, assumes homogeneous relation among the attributes, i.e., each attribute A is related to the rest of the attributes A \ {A i }, where A = {A 1 , A 2 , ...,A n } denotes the attribute set. In this paper, we emphasize the importance of having an aggregation operator, which we will refer to as the extended Bonferroni mean (EBM) operator to capture heterogeneous interrelationship among the attributes. We provide an interpretation of “heterogeneous interrelationship” by assuming that some of the attributes, which are denoted as A , are related to a subset B of the set A \ {A i }, and others have no relation with the remaining attributes. We provide an interpretation of this operator as computing different aggregated values for a given set of inputs as interrelationship pattern is changed. We also investigate the behavior of the proposed EBM aggregation operator. Furthermore, to investigate a multiattribute group decision making (MAGDM) problem with linguistic information, we analyze the proposed EBM operator in linguistic 2-tuple environment and develop three new linguistic aggregation operators: 2-tuple linguistic EBM, weighted 2-tuple linguistic EBM, and linguistic weighted 2-tuple linguistic EBM. A concept of linguistic similarity measure of 2-tuple linguistic information is introduced. Subsequently, an MAGDM technique is developed, in which the attributes' weights are in the form of 2-tuple linguistic information and experts' weights information is completely unknown. Finally, a practical example is presented to demonstrate the applicability of our results.

61 citations


Proceedings ArticleDOI
27 May 2015
TL;DR: This work presents QOCO, a novel query-oriented system for cleaning data with oracles, and shows that the problem of determining minimal interactions with oracle crowds to derive database edits for removing (adding) incorrect tuples to the result of a query is NP-hard in general.
Abstract: As key decisions are often made based on information contained in a database, it is important for the database to be as complete and correct as possible. For this reason, many data cleaning tools have been developed to automatically resolve inconsistencies in databases. However, data cleaning tools provide only best-effort results and usually cannot eradicate all errors that may exist in a database. Even more importantly, existing data cleaning tools do not typically address the problem of determining what information is missing from a database. To overcome the limitations of existing data cleaning techniques, we present QOCO, a novel query-oriented system for cleaning data with oracles. Under this framework, incorrect (resp. missing) tuples are removed from (added to) the result of a query through edits that are applied to the underlying database, where the edits are derived by interacting with domain experts which we model as oracle crowds. We show that the problem of determining minimal interactions with oracle crowds to derive database edits for removing (adding) incorrect (missing) tuples to the result of a query is NP-hard in general and present heuristic algorithms that interact with oracle crowds. Finally, we implement our algorithms in our prototype system QOCO and show that it is effective and efficient through a comprehensive suite of experiments.

56 citations


Proceedings ArticleDOI
24 Jun 2015
TL;DR: DKG as discussed by the authors is an approach to key grouping that provides near-optimal load distribution for input streams with skewed value distribution, based on the simple observation that with such inputs the load balance is strongly driven by the most frequent values; it identifies such values and explicitly maps them to sub-streams together with groups of less frequent items.
Abstract: Key grouping is a technique used by stream processing frameworks to simplify the development of parallel stateful operators. Through key grouping a stream of tuples is partitioned in several disjoint sub-streams depending on the values contained in the tuples themselves. Each operator instance target of one sub-stream is guaranteed to receive all the tuples containing a specific key value. A common solution to implement key grouping is through hash functions that, however, are known to cause load imbalances on the target operator instances when the input data stream is characterized by a skewed value distribution. In this paper we present DKG, a novel approach to key grouping that provides near-optimal load distribution for input streams with skewed value distribution. DKG starts from the simple observation that with such inputs the load balance is strongly driven by the most frequent values; it identifies such values and explicitly maps them to sub-streams together with groups of less frequent items to achieve a near-optimal load balance. We provide theoretical approximation bounds for the quality of the mapping derived by DKG and show, through both simulations and a running prototype, its impact on stream processing applications.

52 citations


Journal ArticleDOI
01 Nov 2015
TL;DR: This work shows a dichotomy for the complexity of resilience, which identifies previously unknown tractable families for deletion propagation with source side-effects, and extends this result to account for functional dependencies.
Abstract: Several research thrusts in the area of data management have focused on understanding how changes in the data affect the output of a view or standing query. Example applications are explaining query results, propagating updates through views, and anonymizing datasets. An important aspect of this analysis is the problem of deleting a minimum number of tuples from the input tables to make a given Boolean query false, which we refer to as "the resilience of a query." In this paper, we study the complexity of resilience for self-join-free conjunctive queries with arbitrary functional dependencies. The cornerstone of our work is the novel concept of triads, a simple structural property of a query that leads to the several dichotomy results we show in this paper. The concepts of triads and resilience bridge the connections between the problems of deletion propagation and causal responsibility, and allow us to substantially advance the known complexity results in these topics. Specifically, we show a dichotomy for the complexity of resilience, which identifies previously unknown tractable families for deletion propagation with source side-effects, and we extend this result to account for functional dependencies. Further, we identify a mistake in a previous dichotomy for causal responsibility, and offer a revised characterization based purely on the structural form of the query (presence or absence of triads). Finally, we extend the dichotomy for causal responsibility in two ways: (a) we account for functional dependencies in the input tables, and (b) we compute responsibility for sets of tuples specified via wildcards.

Proceedings ArticleDOI
29 Oct 2015
TL;DR: ScaleJoin is presented, an algorithmic construction for deterministic and parallel stream joins that guarantees deterministic, disjoint and skew-resilient parallelism, but also achieves higher throughput than state-of-the-art parallel stream joining.
Abstract: The inherently large and varying volumes of data generated to facilitate autonomous functionality in large scale cyber-physical systems demand near real-time processing of data streams, often as close to the sensing devices as possible. In this context, data streaming is imperative for dataintensive processing infrastructures. Stream joins, the streaming counterpart of database joins, compare tuples coming from different streams and constitute one of the most important and expensive data streaming operators. Dictated by the needs of big data streaming analytics, algorithmic implementations of stream joins have to be capable of efficiently processing bursty and rate-varying data streams in a deterministic and skew-resilient fashion. To leverage the design of modern multicore architectures, scalability and parallelism need to be addressed also in the algorithmic design. In this paper we present Scalejoin, an algorithmic construction for deterministic and parallel stream joins that guarantees all the above properties, thus filling in a gap in the existing state-of-the art. Key to the novelty of Scalejoin is a new data structure, Scalegate, and its lock-free implementation. ScaleGate facilitates concurrent data exchange and balances independent actions among processing threads; it also enables fine-grain parallelism while providing the necessary synchronization for deterministic processing. As a result, it allows Scalejoin to run on an arbitrary number of processing threads that can evenly share the overall comparisons run in parallel and achieve high processing throughput and low processing latency. As we show, Scalejoin not only guarantees deterministic, disjoint and skew-resilient parallelism, but also achieves higher throughput than state-of-the-art parallel stream joins.

Proceedings ArticleDOI
20 May 2015
TL;DR: It is shown that for any self-join-free Boolean conjunctive query q, it can be decided whether or not CERTAINTY(q) is in FO, P, or coNP-complete, and the complexity dichotomy is effective.
Abstract: A relational database is said to be uncertain if primary key constraints can possibly be violated A repair (or possible world) of an uncertain database is obtained by selecting a maximal number of tuples without ever selecting two distinct tuples with the same primary key value For any Boolean query q, CERTAINTY(q) is the problem that takes an uncertain database db as input, and asks whether q is true in every repair of db The complexity of this problem has been particularly studied for q ranging over the class of self-join-free Boolean conjunctive queries A research challenge is to determine, given q, whether CERTAINTY(q) belongs to complexity classes FO, P, or coNP-complete In this paper, we combine existing techniques for studying the above complexity classification task We show that for any self-join-free Boolean conjunctive query q, it can be decided whether or not CERTAINTY(q) is in FO Further, for any self-join-free Boolean conjunctive query q, CERTAINTY(q) is either in P or coNP-complete, and the complexity dichotomy is effective This settles a research question that has been open for ten years

Proceedings ArticleDOI
20 May 2015
TL;DR: This paper proves that all γ-acyclic queries have polynomial time data complexity, and proves that, for every fragment FOk, k ≥ 2, the combined complexity of FOMC (or WFOMC) is #P-complete.
Abstract: The FO Model Counting problem (FOMC) is the following: given a sentence Φ in FO and a number n, compute the number of models of Φ over a domain of size n; the Weighted variant (WFOMC) generalizes the problem by associating a weight to each tuple and defining the weight of a model to be the product of weights of its tuples. In this paper we study the complexity of the symmetric WFOMC, where all tuples of a given relation have the same weight. Our motivation comes from an important application, inference in Knowledge Bases with soft constraints, like Markov Logic Networks, but the problem is also of independent theoretical interest. We study both the data complexity, and the combined complexity of FOMC and WFOMC. For the data complexity we prove the existence of an FO3 formula for which FOMC is #P1-complete, and the existence of a Conjunctive Query for which WFOMC is #P1-complete. We also prove that all γ-acyclic queries have polynomial time data complexity. For the combined complexity, we prove that, for every fragment FOk, k ≥ 2, the combined complexity of FOMC (or WFOMC) is #P-complete.

Journal ArticleDOI
TL;DR: This article presents a compiler and runtime system that automatically extracts data parallelism for general stream processing, and shows linear scalability for parallel regions that are computation-bound, and nearlinear scalability when tuples are shuffled across parallel regions.
Abstract: Streaming applications process possibly infinite streams of data and often have both high throughput and low latency requirements. They are comprised of operator graphs that produce and consume data tuples. General streaming applications use stateful, selective, and user-defined operators. The stream programming model naturally exposes task and pipeline parallelism, enabling it to exploit parallel systems of all kinds, including large clusters. However, data parallelism must either be manually introduced by programmers, or extracted as an optimization by compilers. Previous data parallel optimizations did not apply to selective, stateful and user-defined operators. This article presents a compiler and runtime system that automatically extracts data parallelism for general stream processing. Data-parallelization is safe if the transformed program has the same semantics as the original sequential version. The compiler forms parallel regions while considering operator selectivity, state, partitioning, and graph dependencies. The distributed runtime system ensures that tuples always exit parallel regions in the same order they would without data parallelism, using the most efficient strategy as identified by the compiler. Our experiments using 100 cores across 14 machines show linear scalability for parallel regions that are computation-bound, and near linear scalability when tuples are shuffled across parallel regions.

Proceedings ArticleDOI
13 Apr 2015
TL;DR: This work proposes an on-demand strategy that only generates minimum forbidden tuples for validity checks as they are encountered, instead of generating all of them up front.
Abstract: Constraint handling is a challenging problem in combinatorial test generation. In general, there are two ways to handle constraints, i.e., constraint solving and forbidden tuples. In our earlier work, we proposed a constraint handling approach based on forbidden tuples for software product line systems consisting of only Boolean parameters. In this paper, we generalize this approach for general software systems that may consist of other types of parameter. The key idea of our approach is using the notion of minimum forbidden tuples to perform validity checks on both complete and partial tests. Furthermore, we propose an on-demand strategy that only generates minimum forbidden tuples for validity checks as they are encountered, instead of generating all of them up front. We implemented our generalized approach with and without the on-demand strategy in our combinatorial testing tool called ACTS. We performed experiments on 35 systems using ACTS and PICT. The results show that for these 35 systems, our generalized approach performed faster than PICT and the constraint solving-based approach in ACTS. For some large systems, the improvement on test generation time is up to two orders of magnitude.

Posted Content
TL;DR: A theory of co-design is introduced that describes "design problems", defined as tuples of "functionality space", "implementation space", and "resources space", together with a feasibility relation that relates the three spaces.
Abstract: One of the challenges of modern engineering, and robotics in particular, is designing complex systems, composed of many subsystems, rigorously and with optimality guarantees. This paper introduces a theory of co-design that describes "design problems", defined as tuples of "functionality space", "implementation space", and "resources space", together with a feasibility relation that relates the three spaces. Design problems can be interconnected together to create "co-design problems", which describe possibly recursive co-design constraints among subsystems. A co-design problem induces a family of optimization problems of the type "find the minimal resources needed to implement a given functionality"; the solution is an antichain (Pareto front) of resources. A special class of co-design problems are Monotone Co-Design Problems (MCDPs), for which functionality and resources are complete partial orders and the feasibility relation is monotone and Scott continuous. The induced optimization problems are multi-objective, nonconvex, nondifferentiable, noncontinuous, and not even defined on continuous spaces; yet, there exists a complete solution. The antichain of minimal resources can be characterized as a least fixed point, and it can be computed using Kleene's algorithm. The computation needed to solve a co-design problem can be bounded by a function of a graph property that quantifies the interdependence of the subproblems. These results make us much more optimistic about the problem of designing complex systems in a rigorous way.

Proceedings ArticleDOI
01 Jul 2015
TL;DR: This work revisits the fundamental notion of a key in relational databases with NULLs, and investigates the notions of possible and certain keys, which are keys that hold in some or all possible worlds that can originate from an SQL table, respectively.
Abstract: Driven by the dominance of the relational model, the requirements of modern applications, and the veracity of data, we revisit the fundamental notion of a key in relational databases with NULLs. In SQL database systems primary key columns are NOT NULL by default. NULL columns may occur in unique constraints which only guarantee uniqueness for tuples which do not feature null markers in any of the columns involved, and therefore serve a different function than primary keys. We investigate the notions of possible and certain keys, which are keys that hold in some or all possible worlds that can originate from an SQL table, respectively. Possible keys coincide with the unique constraint of SQL, and thus provide a semantics for their syntactic definition in the SQL standard. Certain keys extend primary keys to include NULL columns, and thus form a sufficient and necessary condition to identify tuples uniquely, while primary keys are only sufficient for that purpose. In addition to basic characterization, axiomatization, and simple discovery approaches for possible and certain keys, we investigate the existence and construction of Armstrong tables, and describe an indexing scheme for enforcing certain keys. Our experiments show that certain keys with NULLs do occur in real-world databases, and that related computational problems can be solved efficiently. Certain keys are therefore semantically well-founded and able to maintain data quality in the form of Codd's entity integrity rule while handling the requirements of modern applications, that is, higher volumes of incomplete data from different formats.

Journal ArticleDOI
01 Aug 2015
TL;DR: The demonstration will center on the first and early prototype of the DataSpread, a data exploration tool that holistically unifies databases and spreadsheets, and will give the attendees a sense for the enormous data exploration capabilities offered by unifying spreadsheets and databases.
Abstract: Spreadsheet software is often the tool of choice for ad-hoc tabular data management, processing, and visualization, especially on tiny data sets. On the other hand, relational database systems offer significant power, expressivity, and efficiency over spreadsheet software for data management, while lacking in the ease of use and ad-hoc analysis capabilities. We demonstrate DataSpread, a data exploration tool that holistically unifies databases and spreadsheets. It continues to offer a Microsoft Excel-based spreadsheet front-end, while in parallel managing all the data in a back-end database, specifically, PostgreSQL. DataSpread retains all the advantages of spreadsheets, including ease of use, ad-hoc analysis and visualization capabilities, and a schema-free nature, while also adding the advantages of traditional relational databases, such as scalability and the ability to use arbitrary SQL to import, filter, or join external or internal tables and have the results appear in the spreadsheet. DataSpread needs to reason about and reconcile differences in the notions of schema, addressing of cells and tuples, and the current "pane" (which exists in spreadsheets but not in traditional databases), and support data modifications at both the front-end and the back-end. Our demonstration will center on our first and early prototype of the DataSpread, and will give the attendees a sense for the enormous data exploration capabilities offered by unifying spreadsheets and databases.

Proceedings ArticleDOI
27 May 2015
TL;DR: This work showcases AQ-K-slack, an adaptive, buffer-based disorder handling approach, which supports executing sliding window aggregate queries over out-of-order data streams in a quality-driven manner and dynamically adjusts the input buffer size at query runtime to minimize the result latency.
Abstract: Executing continuous queries over out-of-order data streams, where tuples are not ordered according to timestamps, is challenging; because high result accuracy and low result latency are two conflicting performance metrics. Although many applications allow trading exact query results for lower latency, they still expect the produced results to meet a certain quality requirement. However, none of existing disorder handling approaches have considered minimizing the result latency while meeting user-specified requirements on the quality of query results. In this demonstration, we showcase AQ-K-slack, an adaptive, buffer-based disorder handling approach, which supports executing sliding window aggregate queries over out-of-order data streams in a quality-driven manner. By adapting techniques from the field of sampling-based approximate query processing and control theory, AQ-K-slack dynamically adjusts the input buffer size at query runtime to minimize the result latency, while respecting a user-specified threshold on relative errors in produced query results. We demonstrate a prototype stream processing system, which extends SAP Event Stream Processor with the implementation of AQ-K-slack. Through an interactive interface, the audience will learn the effect of different factors, such as the aggregate function, the window specification, the result error threshold, and stream properties, on the latency and the accuracy of query results. Moreover, they can experience the effectiveness of AQ-K-slack in obtaining user-desired latency vs. result accuracy trade-offs, compared to naive disorder handling approaches that make extreme trade-offs. For instance, by scarifying 1% result accuracy, our system can reduce the result latency by 80% when compared to the state of the art.

Journal ArticleDOI
TL;DR: A crowdsourcing-based framework to evaluate teaching quality in the classroom using a weighted average operator to aggregate information from students questionnaires described by linguistic 2-tuple terms to provide a strong tolerance for the abnormal student to make the evaluation more accurate.
Abstract: Crowdsourcing is widely used in various fields to collect goods and services from large participants. Evaluating teaching quality by collecting feedback from experts or students after class is not only delayed but also not accurate. In this paper, we present a crowdsourcing-based framework to evaluate teaching quality in the classroom using a weighted average operator to aggregate information from students questionnaires described by linguistic 2-tuple terms. Then we define crowd grade based on similarity degree to distinguish contribution from different students and minimize the abnormal students impact on the evaluation. The crowd grade would be updated at the end of each feedback so it can guarantee the evaluation accurately. Moreover, a simulated case is shown to illustrate how to apply this framework to assess teaching quality in the classroom. Finally, we developed a prototype and carried out some experiments on a series of real questionnaires and two sets of modified data. The results show that teachers can locate the weak points of teaching and furthermore to identify the abnormal students to improve the teaching quality. Meanwhile, our approach provides a strong tolerance for the abnormal student to make the evaluation more accurate.

Journal ArticleDOI
01 Sep 2015
TL;DR: This paper introduces fast inequality join algorithms that put columns to be joined in sorted arrays and use permutation arrays to encode positions of tuples in one sorted array w.r.t. the other sorted array and uses space efficient bit-arrays that enable optimizations for fast computation of the join results.
Abstract: Inequality joins, which join relational tables on inequality conditions, are used in various applications. While there have been a wide range of optimization methods for joins in database systems, from algorithms such as sort-merge join and band join, to various indices such as B+-tree, R*-tree and Bitmap, inequality joins have received little attention and queries containing such joins are usually very slow. In this paper, we introduce fast inequality join algorithms. We put columns to be joined in sorted arrays and we use permutation arrays to encode positions of tuples in one sorted array w.r.t. the other sorted array. In contrast to sort-merge join, we use space efficient bit-arrays that enable optimizations, such as Bloom filter indices, for fast computation of the join results. We have implemented a centralized version of these algorithms on top of PostgreSQL, and a distributed version on top of Spark SQL. We have compared against well known optimization techniques for inequality joins and show that our solution is more scalable and several orders of magnitude faster.

Proceedings ArticleDOI
20 May 2015
TL;DR: In this paper, the authors propose a framework for why-not explanations, that is, explanations for why a tuple is missing from a query result, which can either be provided by the user, or it may be automatically derived from the data and/or schema.
Abstract: We propose a novel foundational framework for why-not explanations, that is, explanations for why a tuple is missing from a query result. Our why-not explanations leverage concepts from an ontology to provide high-level and meaningful reasons for why a tuple is missing from the result of a query.A key algorithmic problem in our framework is that of computing a most-general explanation for a why-not question, relative to an ontology, which can either be provided by the user, or it may be automatically derived from the data and/or schema. We study the complexity of this problem and associated problems, and present concrete algorithms for computing why-not explanations. In the case where an external ontology is provided, we first show that the problem of deciding the existence of an explanation to a why-not question is NP-complete in general. However, the problem is solvable in polynomial time for queries of bounded arity, provided that the ontology is specified in a suitable language, such as a member of the DL-Lite family of description logics, which allows for efficient concept subsumption checking. Furthermore, we show that a most-general explanation can be computed in polynomial time in this case. In addition, we propose a method for deriving a suitable (virtual) ontology from a database and/or a schema, and we present an algorithm for computing a most-general explanation to a why-not question, relative to such ontologies. This algorithm runs in polynomial-time in the case when concepts are defined in a selection-free language, or if the underlying schema is fixed. Finally, we also study the problem of computing short most-general explanations, and we briefly discuss alternative definitions of what it means to be an explanation, and to be most general.

Journal ArticleDOI
TL;DR: This paper introduces CR-OLAP, a scalable Cloud based Real-time OLAP system based on a new distributed index structure for OLAP, the distributed PDCR tree, and studies the use of parallel computing on scalable clouds to accelerate queries.

Journal ArticleDOI
TL;DR: A new robust database watermarking scheme the originality of which stands on a semantic control of the data distortion and on the extension of quantization index modulation (QIM) to circular histograms of numerical attributes.
Abstract: In this paper, we present a new robust database watermarking scheme the originality of which stands on a semantic control of the data distortion and on the extension of quantization index modulation (QIM) to circular histograms of numerical attributes. The semantic distortion control of the embedding process we propose relies on the identification of existing semantic links in between values of attributes in a tuple by means of an ontology. By doing so, we avoid incoherent or very rare record occurrences which may bias data interpretation or betray the presence of the watermark. In a second time, we adapt QIM to database watermarking. Watermark embedding is conducted by modulating the relative angular position of the circular histogram center of mass of one numerical attribute. We theoretically demonstrate the robustness performance of our scheme against most common attacks (i.e., tuple insertion and deletion). This makes it suitable for copyright protection, owner identification, or traitor tracing purposes. We further verify experimentally these theoretical limits within the framework of a medical database of more than one half million of inpatient hospital stay records. Under the assumption imposed by the central limit theorem, experimental results fit the theory. We also compare our approach with two efficient schemes so as to prove its benefits.

Proceedings ArticleDOI
23 Oct 2015
TL;DR: This work presents a novel approach for solving the field-sensitive points-to problem for Java with the means of a transitive-closure data-structure, and a pre-computed set of potentially matching load/store pairs to accelerate the fix-point calculation.
Abstract: Computing a precise points-to analysis for very large Java programs remains challenging despite the large body of research on points-to analysis. Any approach must solve an underlying dynamic graph reachability problem, for which the best algorithms have near-cubic worst-case runtime complexity, and, hence, previous work does not scale to programs with millions of lines of code. In this work, we present a novel approach for solving the field-sensitive points-to problem for Java with the means of (1) a transitive-closure data-structure, and (2) a pre-computed set of potentially matching load/store pairs to accelerate the fix-point calculation. Experimentation on Java benchmarks validates the superior performance of our approach over the standard context-free language reachability implementations. Our approach computes a points-to index for the OpenJDK with over 1.5 billion tuples in under a minute.

Journal ArticleDOI
01 Apr 2015
TL;DR: Experimental results demonstrate that FastRAQ provides range-aggregate query results within a time period two orders of magnitude lower than that of Hive, while the relative error is less than 3 percent within the given confidence interval.
Abstract: Range-aggregate queries are to apply a certain aggregate function on all tuples within given query ranges. Existing approaches to range-aggregate queries are insufficient to quickly provide accurate results in big data environments. In this paper, we propose FastRAQ—a fast approach to range-aggregate queries in big data environments. FastRAQ first divides big data into independent partitions with a balanced partitioning algorithm, and then generates a local estimation sketch for each partition. When a range-aggregate query request arrives, FastRAQ obtains the result directly by summarizing local estimates from all partitions. FastRAQ has $O(1)$ time complexity for data updates and $O(\frac{N}{P\times {B}})$ time complexity for range-aggregate queries, where $N$ is the number of distinct tuples for all dimensions, $P$ is the partition number, and $B$ is the bucket number in the histogram. We implement the FastRAQ approach on the Linux platform, and evaluate its performance with about 10 billions data records. Experimental results demonstrate that FastRAQ provides range-aggregate query results within a time period two orders of magnitude lower than that of Hive, while the relative error is less than 3 percent within the given confidence interval.

Journal ArticleDOI
TL;DR: This work presents a novel method for watermarking relational databases for identification and proof of ownership based on the secure embedding of blind and multi-bit watermarks using Bacterial Foraging Algorithm (BFA).
Abstract: The main aspect of database protection is to prove the ownership of data that describes who is the originator of data. It is of particular importance in the case of electronic data, as data sets are often modified and copied without proper citation or acknowledgement of originating data set. We present a novel method for watermarking relational databases for identification and proof of ownership based on the secure embedding of blind and multi-bit watermarks using Bacterial Foraging Algorithm (BFA). Feasibility of BFA implementation is shown in the framed watermarking databases application. Identification of owner is cryptographically made secure and used as an embedded watermark. An improved hash partitioning approach is used that is independent of primary key of the database to secure ordering of the tuples. Strength of BFA is explored to make the technique robust, secure and imperceptible. BFA is implemented to give nearly global optimal values bounded by data usability constraints and thus makes database fragile to any attack. The parameters of BFA are tuned to reduce the execution time. BFA is experimentally proved to be better solution than Genetic Algorithm (GA). The technique proposed is experimentally proved to be resilient against malicious attacks.

Book ChapterDOI
18 May 2015
TL;DR: This paper proposes to authorize entries in tables to contain simple arithmetic constraints, replacing classical tuples of values by so-called smart tuples, and demonstrates that the smart table constraint is a highly promising general purpose tool for CP.
Abstract: Table Constraints are very useful for modeling combinatorial problems in Constraint Programming (CP). They are a universal mechanism for representing constraints, but unfortunately the size of their tables can grow exponentially with their arities. In this paper, we propose to authorize entries in tables to contain simple arithmetic constraints, replacing classical tuples of values by so-called smart tuples. Smart table constraints can thus be viewed as logical combinations of those simple arithmetic constraints. This new form of tuples allows us to encode compactly many constraints, including a dozen of well-known global constraints. We show that, under a very reasonable assumption about the acyclicity of smart tuples, a Generalized Arc Consistency algorithm of low time complexity can be devised. Our experimental results demonstrate that the smart table constraint is a highly promising general purpose tool for CP.

Posted Content
TL;DR: This paper addresses binary Visual Question Answering on abstract scenes as visual verification of concepts inquired in the questions by converting the question to a tuple that concisely summarizes the visual concept to be detected in the image.
Abstract: The complex compositional structure of language makes problems at the intersection of vision and language challenging. But language also provides a strong prior that can result in good superficial performance, without the underlying models truly understanding the visual content. This can hinder progress in pushing state of art in the computer vision aspects of multi-modal AI. In this paper, we address binary Visual Question Answering (VQA) on abstract scenes. We formulate this problem as visual verification of concepts inquired in the questions. Specifically, we convert the question to a tuple that concisely summarizes the visual concept to be detected in the image. If the concept can be found in the image, the answer to the question is "yes", and otherwise "no". Abstract scenes play two roles (1) They allow us to focus on the high-level semantics of the VQA task as opposed to the low-level recognition problems, and perhaps more importantly, (2) They provide us the modality to balance the dataset such that language priors are controlled, and the role of vision is essential. In particular, we collect fine-grained pairs of scenes for every question, such that the answer to the question is "yes" for one scene, and "no" for the other for the exact same question. Indeed, language priors alone do not perform better than chance on our balanced dataset. Moreover, our proposed approach matches the performance of a state-of-the-art VQA approach on the unbalanced dataset, and outperforms it on the balanced dataset.