Showing papers on "Tuple published in 2018"

PDF

Open Access

Posted Content•

Learned Cardinalities: Estimating Correlated Joins with Deep Learning

[...]

Andreas Kipf¹, Thomas Kipf², Bernhard Radke¹, Viktor Leis¹, Peter Boncz³, Alfons Kemper¹ - Show less +2 more•Institutions (3)

Technische Universität München¹, University of Amsterdam², Centrum Wiskunde & Informatica³

03 Sep 2018-arXiv: Databases

TL;DR: In this article, a multi-set convolutional network (MSCN) is proposed for cardinality estimation in relational query plans, which employs set semantics to capture query features and true cardinalities.

...read moreread less

Abstract: We describe a new deep learning approach to cardinality estimation MSCN is a multi-set convolutional network, tailored to representing relational query plans, that employs set semantics to capture query features and true cardinalities MSCN builds on sampling-based estimation, addressing its weaknesses when no sampled tuples qualify a predicate, and in capturing join-crossing correlations Our evaluation of MSCN using a real-world dataset shows that deep learning significantly enhances the quality of cardinality estimation, which is the core problem in query optimization

...read moreread less

210 citations

Proceedings Article•DOI•

Distributed representations of tuples for entity resolution

[...]

Muhammad Ebraheem¹, Saravanan Thirumuruganathan¹, Shafiq Joty², Mourad Ouzzani¹, Nan Tang¹ - Show less +1 more•Institutions (2)

Qatar Computing Research Institute¹, Nanyang Technological University²

01 Jul 2018

TL;DR: This work proposes a locality sensitive hashing (LSH) based blocking approach that takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes.

...read moreread less

Abstract: Despite the efforts in 70+ years in all aspects of entity resolution (ER), there is still a high demand for democratizing ER - by reducing the heavy human involvement in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representations of words (a.k.a. word embeddings), we present a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts). We use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation (i.e., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations that are customized for a specific ER task under different scenarios. We propose a locality sensitive hashing (LSH) based blocking approach that takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that DeepER outperforms existing solutions.

...read moreread less

199 citations

Journal Article•DOI•

Picture 2-tuple linguistic aggregation operators in multiple attribute decision making

[...]

Guiwu Wei¹, Guiwu Wei², Fuad E. Alsaadi¹, Tasawar Hayat³, Tasawar Hayat¹, Ahmed Alsaedi¹ - Show less +2 more•Institutions (3)

King Abdulaziz University¹, Sichuan Normal University², Quaid-i-Azam University³

01 Feb 2018

TL;DR: This paper utilizes arithmetic and geometric operations to develop several picture 2-tuple linguistic aggregation operators and utilizes these operators to develop some approaches to solving thePicture 2-Tuple linguistic multiple attribute decision-making problems.

...read moreread less

Abstract: In this paper, we investigate the multiple attribute decision-making problems with picture 2-tuple linguistic information. Then, we utilize arithmetic and geometric operations to develop several picture 2-tuple linguistic aggregation operators. The prominent characteristic of these proposed operators is studied. Then, we have utilized these operators to develop some approaches to solving the picture 2-tuple linguistic multiple attribute decision-making problems. Finally, a practical example for enterprise resource planning (ERP) system selection is given to verify the developed approach and to demonstrate its practicality and effectiveness.

...read moreread less

146 citations

Proceedings Article•

Dynamic Neural Program Embeddings for Program Repair

[...]

Ke Wang¹, Zhendong Su¹, Rishabh Singh²•Institutions (2)

University of California, Davis¹, Microsoft²

15 Feb 2018

TL;DR: A novel semantic program embedding that is learned from program execution traces is proposed, showing that program states expressed as sequential tuples of live variable values not only captures program semantics more precisely, but also offer a more natural fit for Recurrent Neural Networks to model.

...read moreread less

Abstract: Neural program embeddings have shown much promise recently for a variety of program analysis tasks, including program synthesis, program repair, fault localization, etc. However, most existing program embeddings are based on syntactic features of programs, such as raw token sequences or abstract syntax trees. Unlike images and text, a program has an unambiguous semantic meaning that can be difﬁcult to capture by only considering its syntax(i.e. syntactically similar programs can exhibit vastly different run-time behavior), which makes syntax-based program embeddings fundamentally limited. This paper proposes a novel semantic program embedding that is learned from program execution traces. Our key insight is that program states expressed as sequential tuples of live variable values not only captures program semantics more precisely, but also offer a more natural ﬁt for Recurrent Neural Networks to model. We evaluate different syntactic and semantic program embeddings on predicting the types of errors that students make in their submissions to an introductory programming class and two exercises on the CodeHunt education platform. Evaluation results show that our new semantic program embedding signiﬁcantly outperforms the syntactic program embeddings based on token sequences and abstract syntax trees. In addition, we augment a search-based program repair system with the predictions obtained from our semantic embedding, and show that search efﬁciency is also signiﬁcantly improved.

...read moreread less

102 citations

Posted Content•

Neural Open Information Extraction

[...]

Lei Cui¹, Furu Wei¹, Ming Zhou¹•Institutions (1)

Microsoft¹

11 May 2018-arXiv: Computation and Language

TL;DR: The authors proposed a neural Open Information Extraction (Open IE) approach with an encoder-decoder framework, which learns highly confident arguments and relation tuples bootstrapped from a state-of-the-art Open IE system.

...read moreread less

Abstract: Conventional Open Information Extraction (Open IE) systems are usually built on hand-crafted patterns from other NLP tools such as syntactic parsing, yet they face problems of error propagation. In this paper, we propose a neural Open IE approach with an encoder-decoder framework. Distinct from existing methods, the neural Open IE approach learns highly confident arguments and relation tuples bootstrapped from a state-of-the-art Open IE system. An empirical study on a large benchmark dataset shows that the neural Open IE system significantly outperforms several baselines, while maintaining comparable computational efficiency.

...read moreread less

97 citations

Proceedings Article•DOI•

Online Multi-Target Tracking with Tensor-Based High-Order Graph Matching

[...]

Zongwei Zhou¹, Junliang Xing¹, Mengdan Zhang¹, Weiming Hu¹•Institutions (1)

Chinese Academy of Sciences¹

01 Aug 2018

TL;DR: A dual-direction unit $\ell_{1}$-norm constrained tensor power iteration algorithm is proposed and a deep pair-wise appearance similarity metric based on object mask is presented in this paper where just the features from true target region are utilized.

...read moreread less

Abstract: In this paper we formulate multi-target tracking (MTT) as a high-order graph matching problem and propose a $\ell_{1}$ -norm tensor power iteration solution. Concretely, the search for trajectory-observation correspondences in MTT task is cast as a hypergraph matching problem to maximize a multi-linear objective function over all permutations of the associations. This function is defined by a tensor representing the affinity between association tuples where pair-wise similarities, motion consistency and spatial structural information can be embedded expediently. To solve the matching problem, a dual-direction unit $\ell_{1}$ -norm constrained tensor power iteration algorithm is proposed. Additionally, as measuring the appearance affinity with features extracted from the rectangle patch, which is adopted in most methods, has a weak discrimination when bounding boxes overlap each other heavily, we present a deep pair-wise appearance similarity metric based on object mask in this paper where just the features from true target region are utilized. Experimental evaluation shows that our approach achieves an accuracy comparable to state-of-the-art online trackers. The source code of the proposed approach will be released to facilitate further studies on the MTT problem.

...read moreread less

89 citations

Journal Article•DOI•

Distributed representations of tuples for entity resolution

[...]

EbraheemMuhammad, ThirumuruganathanSaravanan, JotyShafiq, OuzzaniMourad, TangNan - Show less +1 more

01 Jul 2018

TL;DR: This research presents a novel approach called "SmartLabeling™", which automates the very labor-intensive and therefore time-heavy and expensive process of manually cataloging and labeling entity resolution data.

...read moreread less

87 citations

Proceedings Article•DOI•

RStream: marrying relational algebra with streaming for efficient graph mining on a single machine

[...]

Kai Wang¹, Zhiqiang Zuo², John Thorpe¹, Tien Quang Nguyen³, Guoqing Harry Xu¹ - Show less +1 more•Institutions (3)

University of California, Los Angeles¹, Nanjing University², Facebook³

08 Oct 2018

TL;DR: RStream is the first single-machine, out-of-core mining system that leverages disk support to store intermediate data and demonstrates that RStream outperforms all of them, running on a 10-node cluster, e.g., by at least a factor of 1.7×, and can process large graphs on an inexpensive machine.

...read moreread less

Abstract: Graph mining is an important category of graph algorithms that aim to discover structural patterns such as cliques and motifs in a graph. While a great deal of work has been done recently on graph computation such as PageRank, systems support for scalable graph mining is still limited. Existing mining systems such as Arabesque focus on distributed computing and need large amounts of compute and memory resources.We built RStream, the first single-machine, out-of-core mining system that leverages disk support to store intermediate data. At its core are two innovations: (1) a rich programming model that exposes relational algebra for developers to express a wide variety of mining tasks; and (2) a runtime engine that implements relational algebra efficiently with tuple streaming. A comparison between RStream and four state-of-the-art distributed mining/Datalog systems--Arabesque, ScaleMine, DistGraph, and BigDatalog -- demonstrates that RStream outperforms all of them, running on a 10-node cluster, e.g., by at least a factor of 1.7×, and can process large graphs on an inexpensive machine.

...read moreread less

81 citations

Proceedings Article•DOI•

Neural Open Information Extraction

[...]

Lei Cui¹, Furu Wei¹, Ming Zhou¹•Institutions (1)

Microsoft¹

15 Jul 2018

TL;DR: A neural Open IE approach with an encoder-decoder framework that learns highly confident arguments and relation tuples bootstrapped from a state-of-the-art Open IE system.

...read moreread less

80 citations

Proceedings Article•DOI•

Model-free control for distributed stream data processing using deep reinforcement learning

[...]

Teng Li¹, Zhiyuan Xu¹, Jian Tang¹, Yanzhi Wang¹•Institutions (1)

Syracuse University¹

01 Feb 2018

TL;DR: This paper proposes to leverage emerging Deep Reinforcement Learning (DRL) for enabling model-free control in DSDPSs and presents a novel and highly effective DRL-based control framework, which minimizes average end-to-end tuple processing time.

...read moreread less

Abstract: In this paper, we focus on general-purpose Distributed Stream Data Processing Systems (DSDPSs), which deal with processing of unbounded streams of continuous data at scale distributedly in real or near-real time. A fundamental problem in a DSDPS is the scheduling problem (i.e., assigning workload to workers/machines) with the objective of minimizing average end-to-end tuple processing time. A widely-used solution is to distribute workload evenly over machines in the cluster in a round-robin manner, which is obviously not efficient due to lack of consideration for communication delay. Model-based approaches (such as queueing theory) do not work well either due to the high complexity of the system environment.We aim to develop a novel model-free approach that can learn to well control a DSDPS from its experience rather than accurate and mathematically solvable system models, just as a human learns a skill (such as cooking, driving, swimming, etc). Specifically, we, for the first time, propose to leverage emerging Deep Reinforcement Learning (DRL) for enabling model-free control in DSDPSs; and present design, implementation and evaluation of a novel and highly effective DRL-based control framework, which minimizes average end-to-end tuple processing time by jointly learning the system environment via collecting very limited runtime statistics data and making decisions under the guidance of powerful Deep Neural Networks (DNNs). To validate and evaluate the proposed framework, we implemented it based on a widely-used DSDPS, Apache Storm, and tested it with three representative applications: continuous queries, log stream processing and word count (stream version). Extensive experimental results show 1) Compared to Storm's default scheduler and the state-of-the-art model-based method, the proposed framework reduces average tuple processing by 33.5% and 14.0% respectively on average. 2) The proposed framework can quickly reach a good scheduling solution during online learning, which justifies its practicability for online control in DSDPSs.

...read moreread less

60 citations

Journal Article•DOI•

Some operators of intuitionistic uncertain 2-tuple linguistic variables and application to multi-attribute group decision making with heterogeneous relationship among attributes

[...]

Gao-Feng Yu¹, Deng-Feng Li¹, Jin-Ming Qiu, Xiao-Xue Zheng¹•Institutions (1)

Fuzhou University¹

01 Jan 2018-Journal of Intelligent and Fuzzy Systems

Proceedings Article•DOI•

Efficient k-Regret Query Algorithm with Restriction-free Bound for any Dimensionality

[...]

Min Xie¹, Raymond Chi-Wing Wong¹, Jian Li², Cheng Long³, Ashwin Lall⁴ - Show less +1 more•Institutions (4)

Hong Kong University of Science and Technology¹, Tsinghua University², Queen's University Belfast³, Denison University⁴

27 May 2018

TL;DR: A novel algorithm, called SPHERE, is proposed, whose upper bound on the maximum regret ratio is asymptotically optimal and restriction-free for any dimensionality, the best-known result in the literature.

...read moreread less

Abstract: Extracting interesting tuples from a large database is an important problem in multi-criteria decision making. Two representative queries were proposed in the literature: top- k queries and skyline queries. A top- k query requires users to specify their utility functions beforehand and then returns k tuples to the users. A skyline query does not require any utility function from users but it puts no control on the number of tuples returned to users. Recently, a k-regret query was proposed and received attention from the community because it does not require any utility function from users and the output size is controllable, and thus it avoids those deficiencies of top- k queries and skyline queries. Specifically, it returns k tuples that minimize a criterion called the maximum regret ratio . In this paper, we present the lower bound of the maximum regret ratio for the k -regret query. Besides, we propose a novel algorithm, called SPHERE, whose upper bound on the maximum regret ratio is asymptotically optimal and restriction-free for any dimensionality, the best-known result in the literature. We conducted extensive experiments to show that SPHERE performs better than the state-of-the-art methods for the k -regret query.

...read moreread less

Journal Article•DOI•

Scalable Graph Processing Frameworks: A Taxonomy and Open Challenges

[...]

Safiollah Heidari¹, Yogesh Simmhan², Rodrigo N. Calheiros¹, Rajkumar Buyya¹•Institutions (2)

University of Melbourne¹, Indian Institute of Science²

12 Jun 2018-ACM Computing Surveys

TL;DR: A taxonomy of graph processing systems is proposed and existing systems are mapped to this classification, which captures the diversity in programming and computation models, runtime aspects of partitioning and communication, both for in-memory and distributed frameworks.

...read moreread less

Abstract: The world is becoming a more conjunct place and the number of data sources such as social networks, online transactions, web search engines, and mobile devices is increasing even more than had been predicted. A large percentage of this growing dataset exists in the form of linked data, more generally, graphs, and of unprecedented sizes. While today's data from social networks contain hundreds of millions of nodes connected by billions of edges, inter-connected data from globally distributed sensors that forms the Internet of Things can cause this to grow exponentially larger. Although analyzing these large graphs is critical for the companies and governments that own them, big data tools designed for text and tuple analysis such as MapReduce cannot process them efficiently. So, graph distributed processing abstractions and systems are developed to design iterative graph algorithms and process large graphs with better performance and scalability. These graph frameworks propose novel methods or extend previous methods for processing graph data. In this article, we propose a taxonomy of graph processing systems and map existing systems to this classification. This captures the diversity in programming and computation models, runtime aspects of partitioning and communication, both for in-memory and distributed frameworks. Our effort helps to highlight key distinctions in architectural approaches, and identifies gaps for future research in scalable graph systems.

...read moreread less

Proceedings Article•

Learned Cardinalities: Estimating Correlated Joins with Deep Learning

[...]

Andreas Kipf¹, Thomas Kipf², Bernhard Radke¹, Viktor Leis¹, Peter Boncz³, Alfons Kemper¹ - Show less +2 more•Institutions (3)

Technische Universität München¹, University of Amsterdam², Centrum Wiskunde & Informatica³

03 Sep 2018

TL;DR: This work describes a new deep learning approach to cardinality estimation that builds on sampling-based estimation, addressing its weaknesses when no sampled tuples qualify a predicate, and in capturing join-crossing correlations.

...read moreread less

Abstract: We describe a new deep learning approach to cardinality estimation. MSCN is a multi-set convolutional network, tailored to representing relational query plans, that employs set semantics to capture query features and true cardinalities. MSCN builds on sampling-based estimation, addressing its weaknesses when no sampled tuples qualify a predicate, and in capturing join-crossing correlations. Our evaluation of MSCN using a real-world dataset shows that deep learning signiicantly enhances the quality of cardinality estimation, which is the core problem in query optimization.

...read moreread less

Proceedings Article•DOI•

Incremental View Maintenance with Triple Lock Factorization Benefits

[...]

Milos Nikolic¹, Dan Olteanu¹•Institutions (1)

University of Oxford¹

27 May 2018

TL;DR: F-IVM as mentioned in this paper is a unified incremental view maintenance (IVM) approach for a variety of tasks, including gradient computation for learning linear regression models over joins, matrix chain multiplication, and factorized evaluation of conjunctive queries.

...read moreread less

Abstract: We introduce F-IVM, a unified incremental view maintenance (IVM) approach for a variety of tasks, including gradient computation for learning linear regression models over joins, matrix chain multiplication, and factorized evaluation of conjunctive queries. F-IVM is a higher-order IVM algorithm that reduces the maintenance of the given task to the maintenance of a hierarchy of increasingly simpler views. The views are functions mapping keys, which are tuples of input data values, to payloads, which are elements from a task-specific ring. Whereas the computation over the keys is the same for all tasks, the computation over the payloads depends on the task. F-IVM achieves efficiency by factorizing the computation of the keys, payloads, and updates. We implemented F-IVM as an extension of DBToaster. We show in a range of scenarios that it can outperform classical first-order IVM, DBToaster's fully recursive higher-order IVM, and plain recomputation by orders of magnitude while using less memory.

...read moreread less

Proceedings Article•DOI•

Computing Optimal Repairs for Functional Dependencies

[...]

Ester Livshits¹, Benny Kimelfeld¹, Sudeepa Roy²•Institutions (2)

Technion – Israel Institute of Technology¹, Duke University²

27 May 2018

TL;DR: In this paper, the complexity of computing an optimal repair of an inconsistent database, in the case where integrity constraints are functional dependencies (FDs), was investigated. And the authors established a dichotomy in complexity of finding a "most probable database" that satisfies a set of FDs with a single attribute on the left hand side.

...read moreread less

Abstract: We investigate the complexity of computing an optimal repair of an inconsistent database, in the case where integrity constraints are Functional Dependencies (FDs). We focus on two types of repairs: an optimal subset repair (optimal S-repair) that is obtained by a minimum number of tuple deletions, and an optimal update repair (optimal U-repair) that is obtained by a minimum number of value (cell) updates. For computing an optimal S-repair, we present a polynomial-time algorithm that succeeds on certain sets of FDs and fails on others. We prove the following about the algorithm. When it succeeds, it can also incorporate weighted tuples and duplicate tuples. When it fails, the problem is NP-hard, and in fact, APX-complete (hence, cannot be approximated better than some constant). Thus, we establish a dichotomy in the complexity of computing an optimal S-repair. We present general analysis techniques for the complexity of computing an optimal U-repair, some based on the dichotomy for S-repairs. We also draw a connection to a past dichotomy in the complexity of finding a "most probable database" that satisfies a set of FDs with a single attribute on the left hand side; the case of general FDs was left open, and we show how our dichotomy provides the missing generalization and thereby settles the open problem.

...read moreread less

Journal Article•DOI•

SkinnerDB: regret-bounded query evaluation via reinforcement learning

[...]

Immanuel Trummer¹, Samuel Moseley¹, Deepak Maram¹, Saehan Jo¹, Joseph Antonakakis¹ - Show less +1 more•Institutions (1)

Cornell University¹

01 Aug 2018

TL;DR: This work presents SkinnerDB, a novel database management system that is designed from the ground up for reliable optimization and robust performance, and it is claimed that its execution strategies are the first to provide comparable formal guarantees.

...read moreread less

Abstract: Robust query optimization becomes illusory in the presence of correlated predicates or user-defined functions. Occasionally, the query optimizer will choose join orders whose execution time is by many orders of magnitude higher than necessary. We present SkinnerDB, a novel database management system that is designed from the ground up for reliable optimization and robust performance.SkinnerDB implements several adaptive query processing strategies based on reinforcement learning. We divide the execution of a query into small time periods in which different join orders are executed. Thereby, we converge to optimal join orders with regret bounds, meaning that the expected difference between actual execution time and time for an optimal join order is bounded. To the best of our knowledge, our execution strategies are the first to provide comparable formal guarantees. SkinnerDB can be used as a layer on top of any existing database management system. We use optimizer hints to force existing systems to try out different join orders, carefully restricting execution time per join order and data batch via timeouts. We choose timeouts according to an iterative scheme that balances execution time over different timeouts to guarantee bounded regret. Alternatively, SkinnerDB can be used as a standalone, featuring an execution engine that is tailored to the requirements of join order learning. In particular, we use a specialized multi-way join algorithm and a concise tuple representation to facilitate fast switches between join orders. In our demonstration, we let participants experiment with different query types and databases. We visualize the learning process and compare against baselines.

...read moreread less

Journal Article•DOI•

Cost-effective data annotation using game-based crowdsourcing

[...]

Jingru Yang¹, Ju Fan¹, Zhewei Wei¹, Guoliang Li², Tongyu Liu¹, Xiaoyong Du¹ - Show less +2 more•Institutions (2)

Renmin University of China¹, Tsinghua University²

01 Sep 2018

TL;DR: This paper focuses on the labeling rule generation problem that aims to generate high-quality rules to largely reduce the labeling cost while preserving quality, and generates candidate rules and devise a game-based crowdsourcing approach to select high- quality rules by considering coverage and precision.

...read moreread less

Abstract: Large-scale data annotation is indispensable for many applications, such as machine learning and data integration. However, existing annotation solutions either incur expensive cost for large datasets or produce noisy results. This paper introduces a cost-effective annotation approach, and focuses on the labeling rule generation problem that aims to generate high-quality rules to largely reduce the labeling cost while preserving quality. To address the problem, we first generate candidate rules, and then devise a game-based crowdsourcing approach CROWDGAME to select high-quality rules by considering coverage and precision. CROWDGAME employs two groups of crowd workers: one group answers rule validation tasks (whether a rule is valid) to play a role of rule generator, while the other group answers tuple checking tasks (whether the annotated label of a data tuple is correct) to play a role of rule refuter. We let the two groups play a two-player game: rule generator identifies high-quality rules with large coverage and precision, while rule refuter tries to refute its opponent rule generator by checking some tuples that provide enough evidence to reject rules covering the tuples. This paper studies the challenges in CROWDGAME. The first is to balance the trade-off between coverage and precision. We define the loss of a rule by considering the two factors. The second is rule precision estimation. We utilize Bayesian estimation to combine both rule validation and tuple checking tasks. The third is to select crowdsourcing tasks to fulfill the game-based framework for minimizing the loss. We introduce a minimax strategy and develop efficient task selection algorithms. We conduct experiments on entity matching and relation extraction, and the results show that our method outperforms state-of-the-art solutions.

...read moreread less

Proceedings Article•DOI•

Joining Extractions of Regular Expressions

[...]

Dominik D. Freydenberger¹, Benny Kimelfeld², Liat Peterfreund²•Institutions (2)

Loughborough University¹, Technion – Israel Institute of Technology²

27 May 2018

TL;DR: It is shown that the lower bounds (NP-completeness and W[1]-hardness) from the relational world also hold in the setting; in particular, hardness hits already single-character text.

...read moreread less

Abstract: Regular expressions with capture variables, also known as "regex formulas,'' extract relations of spans (interval positions) from text. These relations can be further manipulated via the relational Algebra as studied in the context of "document spanners," Fagin et al.'s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of CQs (UCQs) on top of regex formulas. Such queries have been investigated in prior work on document spanners, but little is known about the (combined) complexity of their evaluation. We show that the lower bounds (NP-completeness and W[1]-hardness) from the relational world also hold in our setting; in particular, hardness hits already single-character text. Yet, the upper bounds from the relational world do not carry over. Unlike the relational world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source of hardness is that it may be intractable to instantiate the relation defined by a regex formula, simply because it has an exponential number of tuples. Yet, we are able to establish general upper bounds. In particular, UCQs can be evaluated with polynomial delay, provided that every CQ has a bounded number of atoms (while unions and projection can be arbitrary). Furthermore, UCQ evaluation is solvable with FPT (Fixed-Parameter Tractable) delay when the parameter is the size of the UCQ.

...read moreread less

Posted Content•

Graphene: Semantically-Linked Propositions in Open Information Extraction

[...]

Matthias Cetto¹, Christina Niklaus¹, André Freitas, Siegfried Handschuh²•Institutions (2)

University of St. Gallen¹, University of Passau²

30 Jul 2018-arXiv: Computation and Language

TL;DR: In a comparative evaluation, the reference implementation Graphene outperforms state-of-the-art Open IE systems in the construction of correct n-ary predicate-argument structures and it is shown that existing Open IE approaches can benefit from the transformation process of this framework.

...read moreread less

Abstract: We present an Open Information Extraction (IE) approach that uses a two-layered transformation stage consisting of a clausal disembedding layer and a phrasal disembedding layer, together with rhetorical relation identification. In that way, we convert sentences that present a complex linguistic structure into simplified, syntactically sound sentences, from which we can extract propositions that are represented in a two-layered hierarchy in the form of core relational tuples and accompanying contextual information which are semantically linked via rhetorical relations. In a comparative evaluation, we demonstrate that our reference implementation Graphene outperforms state-of-the-art Open IE systems in the construction of correct n-ary predicate-argument structures. Moreover, we show that existing Open IE approaches can benefit from the transformation process of our framework.

...read moreread less

Journal Article•DOI•

Answering FO+MOD Queries under Updates on Bounded Degree Databases

[...]

Christoph Berkholz¹, Jens Keppeler¹, Nicole Schweikardt¹•Institutions (1)

Humboldt University of Berlin¹

22 Aug 2018-ACM Transactions on Database Systems

TL;DR: The query evaluation problem for fixed queries over fully dynamic databases, where tuples can be inserted or deleted, is investigated and a data structure is constructed that allows to answer a Boolean FO+MOD query and to compute the size of the result of a non-Boolean query within constant time after every database update.

...read moreread less

Abstract: We investigate the query evaluation problem for fixed queries over fully dynamic databases, where tuples can be inserted or deleted. The task is to design a dynamic algorithm that immediately reports the new result of a fixed query after every database update.We consider queries in first-order logic (FO) and its extension with modulo-counting quantifiers (FO+MOD) and show that they can be efficiently evaluated under updates, provided that the dynamic database does not exceed a certain degree bound.In particular, we construct a data structure that allows us to answer a Boolean FO+MOD query and to compute the size of the result of a non-Boolean query within constant time after every database update. Furthermore, after every database update, we can update the data structure in constant time such that afterwards we are able to test within constant time for a given tuple whether or not it belongs to the query result, to enumerate all tuples in the new query result, and to enumerate the difference between the old and the new query result with constant delay between the output tuples. The preprocessing time needed to build the data structure is linear in the size of the database.Our results extend earlier work on the evaluation of first-order queries on static databases of bounded degree and rely on an effective Hanf normal form for FO+MOD recently obtained by Heimberg, Kuske, and Schweikardt (LICS 2016).

...read moreread less

Journal Article•DOI•

Trajectory Mining Using Uncertain Sensor Data

[...]

Muhammad Muzammal¹, Moneeb Gohar², Arif Ur Rahman², Qiang Qu¹, Awais Ahmad³, Gwanggil Jeon⁴ - Show less +2 more•Institutions (4)

Chinese Academy of Sciences¹, Bahria University², Yeungnam University³, Incheon National University⁴

01 Jan 2018-IEEE Access

TL;DR: The results show that the trajectories could be modeled and worked as probabilistic data and that the results could be computed efficiently using dynamic programming.

...read moreread less

Abstract: Trajectory mining is an interesting data mining problem. Traditionally, it is either assumed that the time-ordered location data recorded as trajectories are either deterministic or that the uncertainty, e.g., due to equipment or technological limitations, is removed by incorporating some pre-processing routines. Thus, the trajectories are processed as deterministic paths of mobile object location data. However, it is important to understand that the transformation from uncertain to deterministic trajectory data may result in the loss of information about the level of confidence in the recorded events. Probabilistic databases offer ways to model uncertainties using possible world semantics. In this paper, we consider uncertain sensor data and transform this to probabilistic trajectory data using pre-processing routines. Next, we model this data as tuple level uncertain data and propose dynamic programming-based algorithms to mine interesting trajectories. A comprehensive empirical study is performed to evaluate the effectiveness of the approach. The results show that the trajectories could be modeled and worked as probabilistic data and that the results could be computed efficiently using dynamic programming.

...read moreread less

Journal Article•DOI•

Towards Max-Min Fair Resource Allocation for Stream Big Data Analytics in Shared Clouds

[...]

Yuxuan Jiang¹, Zhe Huang², Danny H. K. Tsang¹•Institutions (2)

Hong Kong University of Science and Technology¹, Princeton University²

01 Mar 2018-IEEE Transactions on Big Data

TL;DR: Simulations show that the proposed resource allocation scheme remarkably improves the max-min fairness in utilities of the topology throughput, and is low in computational complexity.

...read moreread less

Abstract: Distributed stream big data analytics platforms have emerged to tackle the continuously generated data streams. In stream big data analytics, the data processing workflow is abstracted as a directed graph referred to as a topology. Data are read from the storage and processed tuple by tuple, and these processing results are updated dynamically. The performance of a topology is evaluated by its throughput. This paper proposes an efficient resource allocation scheme for a heterogeneous stream big data analytics cluster shared by multiple topologies, in order to achieve max-min fairness in the utilities of the throughput for all the topologies. We first formulate a novel resource allocation problem, which is a mixed 0-1 integer program. The NP-hardness of the problem is rigorously proven. To tackle this problem, we transform the non-convex constraint to several linear constraints using linearization and reformulation techniques. Based on the analysis of the problem-specific structure and characteristics, we propose an approach that iteratively solves the continuous problem with a fixed set of discrete variables optimally, and updates the discrete variables heuristically. Simulations show that our proposed resource allocation scheme remarkably improves the max-min fairness in utilities of the topology throughput, and is low in computational complexity.

...read moreread less

Journal Article•DOI•

Julia subtyping: a rational reconstruction

[...]

Francesco Zappa Nardelli¹, Julia Belyakova², Artem Pelenitsyn², Benjamin Chung³, Jeff Bezanson, Jan Vitek² - Show less +2 more•Institutions (3)

French Institute for Research in Computer Science and Automation¹, Czech Technical University in Prague², Northeastern University³

24 Oct 2018

TL;DR: This paper provides the first formal definition of Julia's subtype relation and motivates its design and validates the specification empirically with an implementation of the definition that is compared against the existing Julia implementation on a collection of real-world programs.

...read moreread less

Abstract: Programming languages that support multiple dispatch rely on an expressive notion of subtyping to specify method applicability. In these languages, type annotations on method declarations are used to select, out of a potentially large set of methods, the one that is most appropriate for a particular tuple of arguments. Julia is a language for scientific computing built around multiple dispatch and an expressive subtyping relation. This paper provides the first formal definition of Julia's subtype relation and motivates its design. We validate our specification empirically with an implementation of our definition that we compare against the existing Julia implementation on a collection of real-world programs. Our subtype implementation differs on 122 subtype tests out of 6,014,476. The first 120 differences are due to a bug in Julia that was fixed once reported; the remaining 2 are under discussion.

...read moreread less

Journal Article•DOI•

Graded modalities in Strategy Logic

[...]

Benjamin Aminof¹, Vadim Malvone², Aniello Murano², Sasha Rubin²•Institutions (2)

Vienna University of Technology¹, University of Naples Federico II²

01 Aug 2018-Information & Computation

TL;DR: This paper introduces Graded Strategy Logic, an extension of SL by graded quantifiers over tuples of strategy variables, and proves that the model-checking problem of Graded SL is decidable, which is not harder than merely checking for the existence of such an equilibrium.

...read moreread less

Abstract: Strategy Logic (SL) is a logical formalism for strategic reasoning in multi-agent systems. Its main feature is that it has variables for strategies that are associated to specific agents using a binding operator. In this paper we introduce Graded Strategy Logic ( Graded SL), an extension of SL by graded quantifiers over tuples of strategy variables, i.e., “there exist at least g different tuples ( x 1 , . . . , x n ) of strategies” where g is a cardinal from the set N ∪ { ℵ 0 , ℵ 1 , 2 ℵ 0 } . We prove that the model-checking problem of Graded SL is decidable. We then turn to the complexity of fragments of Graded SL. When the g's are restricted to finite cardinals, written Graded N SL, the complexity of model-checking is no harder than for SL, i.e., it is non-elementary in the quantifier-block rank. We illustrate our formalism by showing how to count the number of different strategy profiles that are Nash equilibria (NE). By analysing the structure of the specific formulas involved, we conclude that the important problem of checking for the existence of a unique NE can be solved in 2 ExpTime , which is not harder than merely checking for the existence of such an equilibrium.

...read moreread less

Patent•

Neural paraphrase generator

[...]

Jochen L. Leidner¹, Plachouras Vasileios², Fabio Petroni¹•Institutions (2)

Thomson Reuters¹, Reuters²

15 Nov 2018

TL;DR: In this article, a neural paraphrase generator receives a sequence of tuples comprising a source sequence of words, each tuple comprising word data element and structured tag element representing a linguistic attribute about the word data elements.

...read moreread less

Abstract: A neural paraphrase generator receives a sequence of tuples comprising a source sequence of words, each tuple comprising word data element and structured tag element representing a linguistic attribute about the word data element. An RNN encoder receives a sequence of vectors representing a source sequence of words, and RNN decoder predicts a probability of a target sequence of words representing a target output sentence based on a recurrent state in the decoder. An input composition component includes a word embedding matrix and a tag embedding matrix transforms the input sequence of tuples into a sequence of vectors. An output decomposition component outputs a target sequence of tuples representing predicted words and structured tag elements, the probability of each single tuple from the output is predicted based on a recurrent state of the decoder.

...read moreread less

Journal Article•DOI•

Effective and complete discovery of bidirectional order dependencies via set-based axioms

[...]

Jaroslaw Szlichta¹, Parke Godfrey², Lukasz Golab³, Mehdi Kargar⁴, Divesh Srivastava⁵ - Show less +1 more•Institutions (5)

University of Ontario Institute of Technology¹, York University², University of Waterloo³, Ryerson University⁴, AT&T Labs⁵

28 Jun 2018

TL;DR: This work presents an efficient biddirectional OD discovery algorithm enabled by a novel polynomial mapping to a canonical form, and a sound and complete set of axioms for canonical bidirectional ODs to prune the search space, and proves that it produces a complete and minimal set of bidirectionAL ODs.

...read moreread less

Abstract: Integrity constraints (ICs) are useful for expressing and enforcing application semantics. Formulating ICs manually, however, requires domain expertise, is prone to human error, and can be exceedingly time-consuming. Thus, methods for automatic discovery have been developed for some classes of ICs, such as functional dependencies (FDs), and recently, order dependencies (ODs). ODs properly subsume FDs and can express business rules involving order; e.g., an employee who pays higher taxes has a higher salary than another employee. Bidirectional ODs further allow different ordering directions, ascending and descending, as in SQL’s order-by; e.g., a student with an alphabetically lower letter grade has a higher percentage grade than another student. We address the limitations of prior work on automatic OD discovery, which has factorial complexity, is incomplete, and is not concise. We present an efficient bidirectional OD discovery algorithm enabled by a novel polynomial mapping to a canonical form, and a sound and complete set of axioms for canonical bidirectional ODs to prune the search space. Our algorithm has exponential worst-case time complexity in the number of attributes and linear complexity in the number of tuples. We prove that it produces a complete and minimal set of bidirectional ODs, and we experimentally show orders of magnitude performance improvements over the prior state-of-the-art methodologies.

...read moreread less

Proceedings Article•DOI•

Estimating Rule Quality for Knowledge Base Completion with the Relationship between Coverage Assumption

[...]

Kaja Zupanc¹, Jesse Davis²•Institutions (2)

University of Ljubljana¹, Katholieke Universiteit Leuven²

10 Apr 2018

TL;DR: This work proposes a novel score function for evaluating the quality of a first-order rule learned from a knowledge base, and attempts to include information about the tuples not in the KB when evaluating thequality of a potential rule.

...read moreread less

Abstract: Currently, there are many large, automatically constructed knowledge bases (KBs). One interesting task is learning from a knowledge base to generate new knowledge either in the form of inferred facts or rules that define regularities. One challenge for learning is that KBs are necessarily open world: we cannot assume anything about the truth values of tuples not included in the KB. When a KB only contains facts (i.e., true statements), which is typically the case, we lack negative examples, which are often needed by learning algorithms. To address this problem, we propose a novel score function for evaluating the quality of a first-order rule learned from a KB. Our metric attempts to include information about the tuples not in the KB when evaluating the quality of a potential rule. Empirically, we find that our metric results in more precise predictions than previous approaches.

...read moreread less

Proceedings Article•DOI•

Open Information Extraction with Meta-pattern Discovery in Biomedical Literature

[...]

Xuan Wang¹, Yu Zhang¹, Qi Li¹, Yinyin Chen¹, Jiawei Han¹ - Show less +1 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

15 Aug 2018

TL;DR: A novel framework CPIE (Clause+Pattern-guided Information Extraction) is proposed that incorporates clause extraction and meta-pattern discovery to extract structured relation tuples with little supervision and shows great potential in effectively dealing with real-world biomedical literature with complicated sentence structures and rich information.

...read moreread less

Abstract: Biomedical open information extraction (BioOpenIE) is a novel paradigm to automatically extract structured information from unstructured text with no or little supervision. It does not require any pre-specified relation types but aims to extract all the relation tuples from the corpus. A major challenge for open information extraction (OpenIE) is that it produces massive surface-name formed relation tuples that cannot be directly used for downstream applications. We propose a novel framework CPIE (Clause+Pattern-guided Information Extraction) that incorporates clause extraction and meta-pattern discovery to extract structured relation tuples with little supervision. Compared with previous OpenIE methods, CPIE produces massive but more structured output that can be directly used for downstream applications. We first detect short clauses from input sentences. Then we extract quality textual patterns and perform synonymous pattern grouping to identify relation types. Last, we obtain the corresponding relation tuples by matching each quality pattern in the text. Experiments show that CPIE achieves the highest precision in comparison with state-of-the-art OpenIE baselines, and also keeps the distinctiveness and simplicity of the extracted relation tuples. CPIE shows great potential in effectively dealing with real-world biomedical literature with complicated sentence structures and rich information.

...read moreread less

Proceedings Article•DOI•

Efficient Generation of Reliable Estimated Linguistic Summaries

[...]

Grégory Smits¹, Pierre Nerzic¹, Olivier Pivert¹, Marie-Jeanne Lesot²•Institutions (2)

University of Rennes¹, Paris-Sorbonne University²

08 Jul 2018

TL;DR: It is shown that reliable summaries can be very efficiently estimated based on these statistics only and without any costly data access and is provided the first linguistic summarization approach whose processing time does not depend on the size of the dataset.

...read moreread less

Abstract: Summarizing data with linguistic statements is a crucial and topical issue that has been largely addressed by the soft computing community. The goal of summarization is to generate statements that linguistically describe the properties observed in a dataset. This paper addresses the issue of efficiently extracting these summaries and rendering them to the final user, in the case where the data to be summarized are stored in a relational data base: it proposes a novel strategy that leverages the statistics about the data distribution maintained by the database system. This paper shows that reliable summaries can be very efficiently estimated based on these statistics only and without any costly data access. Additionally, it proposes a visualization of the set of extracted summaries that offers a fruitful interactive exploration tool to the user. Experiments performed on two real data bases show the relevance and efficiency of the proposed approach: with a negligible loss of accuracy, we provide the first linguistic summarization approach whose processing time does not depend on the size of the dataset. The generation of estimated linguistic summaries takes less than one second even for dataset containing millions of tuples.

...read moreread less

Collapse