Showing papers on "Tuple published in 2006"

PDF

Open Access

Proceedings Article•DOI•

Integrating compression and execution in column-oriented database systems

[...]

Daniel J. Abadi¹, Samuel Madden¹, Miguel Ferreira¹•Institutions (1)

27 Jun 2006

TL;DR: This paper shows how compression schemes not traditionally used in row-oriented DBMSs can be applied to column-oriented systems and evaluates a set of compression schemes and shows that the best scheme depends not only on the properties of the data but also on the nature of the query workload.

...read moreread less

Abstract: Column-oriented database system architectures invite a re-evaluation of how and when data in databases is compressed. Storing data in a column-oriented fashion greatly increases the similarity of adjacent records on disk and thus opportunities for compression. The ability to compress many adjacent tuples at once lowers the per-tuple cost of compression, both in terms of CPU and space overheads.In this paper, we discuss how we extended C-Store (a column-oriented DBMS) with a compression sub-system. We show how compression schemes not traditionally used in row-oriented DBMSs can be applied to column-oriented systems. We then evaluate a set of compression schemes and show that the best scheme depends not only on the properties of the data but also on the nature of the query workload.

...read moreread less

663 citations

Proceedings Article•DOI•

A Primitive Operator for Similarity Joins in Data Cleaning

[...]

Surajit Chaudhuri¹, Venkatesh Ganti¹, Raghav Kaushik¹•Institutions (1)

Microsoft¹

03 Apr 2006

TL;DR: This paper proposes a new primitive operator which can be used as a foundation to implement similarity joins according to a variety of popular string similarity functions, and notions of similarity which go beyond textual similarity.

...read moreread less

Abstract: Data cleaning based on similarities involves identification of "close" tuples, where closeness is evaluated using a variety of similarity functions chosen to suit the domain and application. Current approaches for efficiently implementing such similarity joins are tightly tied to the chosen similarity function. In this paper, we propose a new primitive operator which can be used as a foundation to implement similarity joins according to a variety of popular string similarity functions, and notions of similarity which go beyond textual similarity. We then propose efficient implementations for this operator. In an experimental evaluation using real datasets, we show that the implementation of similarity joins using our operator is comparable to, and often substantially better than, previous customized implementations for particular similarity functions.

...read moreread less

621 citations

Journal Article•DOI•

A new version of 2-tuple fuzzy linguistic representation model for computing with words

[...]

Jin-Hsien Wang, Jongyun Hao¹•Institutions (1)

United States Naval Academy¹

01 Nov 2006-IEEE Transactions on Fuzzy Systems

TL;DR: A new (proportional) 2-tuple fuzzy linguistic representation model for computing with words (CW), which is based on the concept of "symbolic proportion," which provides an opportunity to describe the initial linguistic information by members of a "continuous" linguistic scale domain which does not necessarily require the ordered linguistic terms of a linguistic variable being equidistant.

...read moreread less

Abstract: In this paper, we provide a new (proportional) 2-tuple fuzzy linguistic representation model for computing with words (CW), which is based on the concept of "symbolic proportion." This concept motivates us to represent the linguistic information by means of 2-tuples, which are composed by two proportional linguistic terms. For clarity and generality, we first study proportional 2-tuples under ordinal contexts. Then, under linguistic contexts and based on canonical characteristic values (CCVs) of linguistic labels, we define many aggregation operators to handle proportional 2-tuple linguistic information in a computational stage for CW without any loss of information. Our approach for this proportional 2-tuple fuzzy linguistic representation model deals with linguistic labels, which do not have to be symmetrically distributed around a medium label and without the traditional requirement of having "equal distance" between them. Moreover, this new model not only provides a space to allow a "continuous" interpolation of a sequence of ordered linguistic labels, but also provides an opportunity to describe the initial linguistic information by members of a "continuous" linguistic scale domain which does not necessarily require the ordered linguistic terms of a linguistic variable being equidistant. Meanwhile, under the assumption of equally informative (which is defined by a condition based on the concept of CCV), we show that our model reduces to Herrera and Mart/spl inodot//spl acute/nez's (translational) 2-tuple fuzzy linguistic representation model.

...read moreread less

467 citations

Proceedings Article•DOI•

Effective keyword search in relational databases

[...]

Fang Liu¹, Clement Yu¹, Weiyi Meng², Abdur Chowdhury•Institutions (2)

University of Illinois at Chicago¹, Binghamton University²

27 Jun 2006

TL;DR: This paper proposes a novel IR ranking strategy for effective keyword search and is the first that conducts comprehensive experiments on search effectiveness using a real world database and a set of keyword queries collected by a major search company.

...read moreread less

Abstract: With the amount of available text data in relational databases growing rapidly, the need for ordinary users to search such information is dramatically increasing. Even though the major RDBMSs have provided full-text search capabilities, they still require users to have knowledge of the database schemas and use a structured query language to search information. This search model is complicated for most ordinary users. Inspired by the big success of information retrieval (IR) style keyword search on the web, keyword search in relational databases has recently emerged as a new research topic. The differences between text databases and relational databases result in three new challenges: (1) Answers needed by users are not limited to individual tuples, but results assembled from joining tuples from multiple tables are used to form answers in the form of tuple trees. (2) A single score for each answer (i.e. a tuple tree) is needed to estimate its relevance to a given query. These scores are used to rank the most relevant answers as high as possible. (3) Relational databases have much richer structures than text databases. Existing IR strategies to rank relational outputs are not adequate. In this paper, we propose a novel IR ranking strategy for effective keyword search. We are the first that conducts comprehensive experiments on search effectiveness using a real world database and a set of keyword queries collected by a major search company. Experimental results show that our strategy is significantly better than existing strategies. Our approach can be used both at the application level and be incorporated into a RDBMS to support keyword-based search in relational databases.

...read moreread less

378 citations

Journal Article•DOI•

LIME: A coordination model and middleware supporting mobility of hosts and agents

[...]

Amy L. Murphy¹, Gian Pietro Picco², Gruia-Catalin Roman³•Institutions (3)

University of Lugano¹, Polytechnic University of Milan², Washington University in St. Louis³

01 Jul 2006-ACM Transactions on Software Engineering and Methodology

TL;DR: The model underlying LIME is illustrated, a formal semantic characterization for the operations it makes available to the application developer is provided, its current design and implementation is presented, and lessons learned are discussed in developing applications that involve physical mobility.

...read moreread less

Abstract: LIME (Linda in a mobile environment) is a model and middleware supporting the development of applications that exhibit the physical mobility of hosts, logical mobility of agents, or both. LIME adopts a coordination perspective inspired by work on the Linda model. The context for computation, represented in Linda by a globally accessible persistent tuple space, is refined in LIME to transient sharing of the identically named tuple spaces carried by individual mobile units. Tuple spaces are also extended with a notion of location and programs are given the ability to react to specified states. The resulting model provides a minimalist set of abstractions that facilitates the rapid and dependable development of mobile applications. In this article we illustrate the model underlying LIME, provide a formal semantic characterization for the operations it makes available to the application developer, present its current design and implementation, and discuss lessons learned in developing applications that involve physical mobility.

...read moreread less

284 citations

Proceedings Article•DOI•

Continuous monitoring of top-k queries over sliding windows

[...]

Kyriakos Mouratidis¹, Spiridon Bakiras², Dimitris Papadias²•Institutions (2)

Singapore Management University¹, Hong Kong University of Science and Technology²

27 Jun 2006

TL;DR: This paper presents two processing techniques: the first one computes the new answer of a query whenever some of the current top-k points expire; the second one partially pre-computes the future changes in the result, achieving better running time at the expense of slightly higher space requirements.

...read moreread less

Abstract: Given a dataset P and a preference function f, a top-k query retrieves the k tuples in P with the highest scores according to f. Even though the problem is well-studied in conventional databases, the existing methods are inapplicable to highly dynamic environments involving numerous long-running queries. This paper studies continuous monitoring of top-k queries over a fixed-size window W of the most recent data. The window size can be expressed either in terms of the number of active tuples or time units. We propose a general methodology for top-k monitoring that restricts processing to the sub-domains of the workspace that influence the result of some query. To cope with high stream rates and provide fast answers in an on-line fashion, the data in W reside in main memory. The valid records are indexed by a grid structure, which also maintains book-keeping information. We present two processing techniques: the first one computes the new answer of a query whenever some of the current top-k points expire; the second one partially pre-computes the future changes in the result, achieving better running time at the expense of slightly higher space requirements. We analyze the performance of both algorithms and evaluate their efficiency through extensive experiments. Finally, we extend the proposed framework to other query types and a different data stream model.

...read moreread less

261 citations

Journal Article•

Combinatorial aspects of covering arrays

[...]

Charles J. Colbourn¹•Institutions (1)

Arizona State University¹

01 Jan 2006-Le Matematiche

TL;DR: A combinatorial view of covering arrays is adopted, encompassing basic bounds, direct construction, recursive constructions, algorithmic methods, and applications.

...read moreread less

Abstract: Covering arrays generalize orthogonal arrays by requiring that t -tuples be covered, but not requiring that the appearance of t -tuples be balanced. Their uses in screening experiments has found application in software testing, hardware testing, and a variety of fields in which interactions among factors are to be identified. Here a combinatorial view of covering arrays is adopted, encompassing basic bounds, direct constructions, recursive constructions, algorithmic methods, and applications.

...read moreread less

211 citations

Proceedings Article•DOI•

Clean Answers over Dirty Databases: A Probabilistic Approach

[...]

Periklis Andritsos, Ariel Fuxman¹, Renée J. Miller¹•Institutions (1)

University of Toronto¹

03 Apr 2006

TL;DR: This work rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database, and experimentally study the performance of the rewritten queries.

...read moreread less

Abstract: The detection of duplicate tuples, corresponding to the same real-world entity, is an important task in data integration and cleaning. While many techniques exist to identify such tuples, the merging or elimination of duplicates can be a difficult task that relies on ad-hoc and often manual solutions. We propose a complementary approach that permits declarative query answering over duplicated data, where each duplicate is associated with a probability of being in the clean database. We rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database. Our rewritten queries are sensitive to the semantics of duplication and help a user understand which query answers are most likely to be present in the clean database. The semantics that we adopt is independent of the way the probabilities are produced, but is able to effectively exploit them during query answering. In the absence of external knowledge that associates each database tuple with a probability, we offer a technique, based on tuple summaries, that automates this task. We experimentally study the performance of our rewritten queries. Our studies show that the rewriting does not introduce a significant overhead in query execution time. This work is done in the context of the ConQuer project at the University of Toronto, which focuses on the efficient management of inconsistent and dirty databases.

...read moreread less

200 citations

Proceedings Article•

Stochastic Relational Models for Discriminative Link Prediction

[...]

Kai Yu, Wei Chu¹, Shipeng Yu², Volker Tresp², Zhao Xu² - Show less +1 more•Institutions (2)

Columbia University¹, Siemens²

04 Dec 2006

TL;DR: A Gaussian process (GP) framework, stochastic relational models (SRM), for learning social, physical, and other relational phenomena where interactions between entities are observed is introduced and extensions of SRM to general relational learning tasks are discussed.

...read moreread less

Abstract: We introduce a Gaussian process (GP) framework, stochastic relational models (SRM), for learning social, physical, and other relational phenomena where interactions between entities are observed. The key idea is to model the stochastic structure of entity relationships (i.e., links) via a tensor interaction of multiple GPs, each defined on one type of entities. These models in fact define a set of nonparametric priors on infinite dimensional tensor matrices, where each element represents a relationship between a tuple of entities. By maximizing the marginalized likelihood, information is exchanged between the participating GPs through the entire relational network, so that the dependency structure of links is messaged to the dependency of entities, reflected by the adapted GP kernels. The framework offers a discriminative approach to link prediction, namely, predicting the existences, strengths, or types of relationships based on the partially observed linkage network as well as the attributes of entities (if given). We discuss properties and variants of SRM and derive an efficient learning algorithm. Very encouraging experimental results are achieved on a toy problem and a user-movie preference link prediction task. In the end we discuss extensions of SRM to general relational learning tasks.

...read moreread less

186 citations

Proceedings Article•DOI•

Window-aware load shedding for aggregation queries over data streams

[...]

Nesime Tatbul¹, Stan Zdonik¹•Institutions (1)

Brown University¹

01 Sep 2006

TL;DR: A new type of drop operator is introduced, called a "Window Drop", which logically divides the input stream into windows and probabilistically decides which windows to drop, and always delivers subsets of original query answers with minimal degradation in result quality.

...read moreread less

Abstract: Data stream management systems may be subject to higher input rates than their resources can handle. When overloaded, the system must shed load in order to maintain low-latency query results. In this paper, we describe a load shedding technique for queries consisting of one or more aggregate operators with sliding windows. We introduce a new type of drop operator, called a "Window Drop". This operator is aware of the window properties (i.e., window size and window slide) of its downstream aggregate operators in the query plan. Accordingly, it logically divides the input stream into windows and probabilistically decides which windows to drop. This decision is further encoded into tuples by marking the ones that are disallowed from starting new windows. Unlike earlier approaches, our approach preserves integrity of windows throughout a query plan, and always delivers subsets of original query answers with minimal degradation in result quality.

...read moreread less

178 citations

Patent•

Processing XML data stream(s) using continuous queries in a data stream management system

[...]

Zhen Hua Liu¹, Shailendra Mishra¹, Muralidhar Krishnaprasad¹•Institutions (1)

Business International Corporation¹

17 Nov 2006

TL;DR: In this article, a computer is programmed to accept queries over streams of, data structured as per a predetermined syntax (e.g. defined in XML), and the computer is further programmed to execute such queries continually (or periodically) on data streams of tuples containing structured data that conform to the same predetermined syntax.

...read moreread less

Abstract: A computer is programmed to accept queries over streams of, data structured as per a predetermined syntax (e.g. defined in XML). The computer is further programmed to execute such queries continually (or periodically) on data streams of tuples containing structured data that conform to the same predetermined syntax. In many embodiments, the computer includes an engine that exclusively processes only structured data, quickly and efficiently. The computer invokes the structured data engine in two different ways depending on the embodiment: (a) directly on encountering a structured data operator, or (b) indirectly by parsing operands within the structured data operator which contain path expressions, creating a new source to supply scalar data extracted from structured data, and generating additional trees of operators that are natively supported, followed by invoking the structured data engine only when the structured data operator in the query cannot be fully implemented by natively supported operators.

...read moreread less

Journal Article•

Authentication of outsourced databases using signature aggregation and chaining

[...]

Maithili Narasimha¹, Gene Tsudik¹•Institutions (1)

University of California, Irvine¹

01 Jan 2006-Lecture Notes in Computer Science

TL;DR: This work analyzes the new approach for various base query types and compares it with Authenticated Data Structures and points out some possible security flaws in the approach suggested in the recent work of [15].

...read moreread less

Abstract: Database outsourcing is an important emerging trend which involves data owners delegating their data management needs to an external service provider. Since a service provider is almost never fully trusted, security and privacy of outsourced data are important concerns. A core security requirement is the integrity and authenticity of outsourced databases. Whenever someone queries a hosted database, the results must be demonstrably authentic (with respect to the actual data owner) to ensure that the data has not been tampered with. Furthermore, the results must carry a proof of completeness which will allow the querier to verify that the server has not omitted any valid tuples that match the query predicate. Notable prior work ([4][9][15]) focused on various types of Authenticated Data Structures. Another prior approach involved the use of specialized digital signature schemes. In this paper, we extend the state-of-the-art to provide both authenticity and completeness guarantees of query replies. Our work analyzes the new approach for various base query types and compares it with Authenticated Data Structures. We also point out some possible security flaws in the approach suggested in the recent work of [15].

...read moreread less

Proceedings Article•DOI•

SaLSa: computing the skyline without scanning the whole sky

[...]

Ilaria Bartolini¹, Paolo Ciaccia¹, Marco Patella¹•Institutions (1)

University of Bologna¹

06 Nov 2006

TL;DR: SaLSa (Sort and Limit Skyline algorithm), which exploits the sorting machinery of a relational engine to order tuples so that only a subset of them needs to be examined for computing the skyline result.

...read moreread less

Abstract: Skyline queries compute the set of Pareto-optimal tuples in a relation, ie those tuples that are not dominated by any other tuple in the same relation. Although several algorithms have been proposed for efficiently evaluating skyline queries, they either require to extend the relational server with specialized access methods (which is not always feasible) or have to perform the dominance tests on all the tuples in order to determine the result. In this paper we introduce SaLSa (Sort and Limit Skyline algorithm), which exploits the sorting machinery of a relational engine to order tuples so that only a subset of them needs to be examined for computing the skyline result. This makes SaLSa particularly attractive when skyline queries are executed on top of systems that do not understand skyline semantics or when the skyline logic runs on clients with limited power and/or bandwidth.

...read moreread less

Journal Article•DOI•

Probabilistic information retrieval approach for ranking of database query results

[...]

Surajit Chaudhuri¹, Gautam Das², Vagelis Hristidis³, Gerhard Weikum⁴•Institutions (4)

Microsoft¹, University of Texas at Arlington², Florida International University³, Max Planck Society⁴

01 Sep 2006-ACM Transactions on Database Systems

TL;DR: This work presents methodologies to tackle the problem of ranking the answers to a database query when many tuples are returned, by adapting and applying principles of probabilistic models from information retrieval for structured data.

...read moreread less

Abstract: We investigate the problem of ranking the answers to a database query when many tuples are returned In particular, we present methodologies to tackle the problem for conjunctive and range queries, by adapting and applying principles of probabilistic models from information retrieval for structured data Our solution is domain independent and leverages data and workload statistics and correlations We evaluate the quality of our approach with a user survey on a real database Furthermore, we present and experimentally evaluate algorithms to efficiently retrieve the top ranked results, which demonstrate the feasibility of our ranking system

...read moreread less

Journal Article•DOI•

Constraint Models for the Covering Test Problem

[...]

Brahim Hnich¹, Steven Prestwich², Evgeny Selensky, Barbara M. Smith²•Institutions (2)

İzmir University of Economics¹, University College Cork²

01 Jul 2006-Constraints - An International Journal

TL;DR: This paper develops constraint programming models of the problem of finding an optimal covering array and shows that compound variables, representing tuples of variables in the original model, allow the constraints of this problem to be represented more easily and hence propagate better.

...read moreread less

Abstract: Covering arrays can be applied to the testing of software, hardware and advanced materials, and to the effects of hormone interaction on gene expression. In this paper we develop constraint programming models of the problem of finding an optimal covering array. Our models exploit global constraints, multiple viewpoints and symmetry-breaking constraints. We show that compound variables, representing tuples of variables in our original model, allow the constraints of this problem to be represented more easily and hence propagate better. With our best integrated model, we are able to either prove the optimality of existing bounds or find new optimal solutions, for arrays of moderate size. Local search on a SAT-encoding of the model is able to find improved solutions and bounds for larger problems.

...read moreread less

Journal Article•DOI•

A fragile watermarking scheme for detecting malicious modifications of database relations

[...]

Huiping Guo¹, Yingjiu Li², Anyi Liu¹, Sushil Jajodia¹•Institutions (2)

George Mason University¹, Singapore Management University²

01 May 2006-Information Sciences

TL;DR: A novel fragile watermarking scheme is proposed to detect malicious modifications of database relations and shows that the modifications can be detected and localized with high probability, which is demonstrated by the experimental results.

...read moreread less

Proceedings Article•DOI•

Compiled Query Execution Engine using JVM

[...]

Jun Rao¹, Hamid Pirahesh¹, Chandrasekaran Mohan¹, Guy M. Lohman¹•Institutions (1)

IBM¹

03 Apr 2006

TL;DR: Both an interpreted and a compiled query execution engine are developed in a relational, Java-based, in-memory database prototype, and experimental results show that, despite both engines benefiting from JIT, the compiled engine runs on average about twice as fast as the interpreted one, and significantly faster than an in- memory database prototype.

...read moreread less

Abstract: A conventional query execution engine in a database system essentially uses a SQL virtual machine (SVM) to interpret a dataflow tree in which each node is associated with a relational operator. During query evaluation, a single tuple at a time is processed and passed among the operators. Such a model is popular because of its efficiency for pipelined processing. However, since each operator is implemented statically, it has to be very generic in order to deal with all possible queries. Such generality tends to introduce significant runtime inefficiency, especially in the context of memory-resident systems, because the granularity of data commercial system, using SVM. processing (a tuple) is too small compared with the associated overhead. Another disadvantage in such an engine is that each operator code is compiled statically, so query-specific optimization cannot be applied. To improve runtime efficiency, we propose a compiled execution engine, which, for a given query, generates new query-specific code on the fly, and then dynamically compiles and executes the code. The Java platform makes our approach particularly interesting for several reasons: (1) modern Java Virtual Machines (JVM) have Just- In-Time (JIT) compilers that optimize code at runtime based on the execution pattern, a key feature that SVMs lack; (2) because of Javas continued popularity, JVMs keep improving at a faster pace than SVMs, allowing us to exploit new advances in the Java runtime in the future; (3) Java is a dynamic language, which makes it convenient to load a piece of new code on the fly. In this paper, we develop both an interpreted and a compiled query execution engine in a relational, Java-based, in-memory database prototype, and perform an experimental study. Our experimental results on the TPC-H data set show that, despite both engines benefiting from JIT, the compiled engine runs on average about twice as fast as the interpreted one, and significantly faster than an in-memory

...read moreread less

Journal Article•DOI•

Graph kernels and Gaussian processes for relational reinforcement learning

[...]

Kurt Driessens¹, Jan Ramon², Thomas Gärtner•Institutions (2)

University of Waikato¹, Katholieke Universiteit Leuven²

01 Sep 2006

TL;DR: This paper investigates the use of Gaussian processes to approximate the Q-values of state-action pairs in a relational setting and proposes graph kernels as a covariance function between state- action pairs.

...read moreread less

Abstract: RRL is a relational reinforcement learning system based on Q-learning in relational state-action spaces. It aims to enable agents to learn how to act in an environment that has no natural representation as a tuple of constants. For relational reinforcement learning, the learning algorithm used to approximate the mapping between state-action pairs and their so called Q(uality)-value has to be very reliable, and it has to be able to handle the relational representation of state-action pairs. In this paper we investigate the use of Gaussian processes to approximate the Q-values of state-action pairs. In order to employ Gaussian processes in a relational setting we propose graph kernels as a covariance function between state-action pairs. The standard prediction mechanism for Gaussian processes requires a matrix inversion which can become unstable when the kernel matrix has low rank. These instabilities can be avoided by employing QR-factorization. This leads to better and more stable performance of the algorithm and a more efficient incremental update mechanism. Experiments conducted in the blocks world and with the Tetris game show that Gaussian processes with graph kernels can compete with, and often improve on, regression trees and instance based regression as a generalization algorithm for RRL.

...read moreread less

Book Chapter•DOI•

Window specification over data streams

[...]

Kostas Patroumpas¹, Timos Sellis¹•Institutions (1)

National Technical University of Athens¹

26 Mar 2006

TL;DR: This paper describes a formal framework for expressing windows in continuous queries over data streams, and proposes formal definitions for the windowed analogs of typical relational operators, such as join, union or aggregation.

...read moreread less

Abstract: Several query languages have been proposed for managing data streams in modern monitoring applications. Continuous queries expressed in these languages usually employ windowing constructs in order to extract finite portions of the potentially unbounded stream. Explicitly or not, window specifications rely on ordering. Usually, timestamps are attached to all tuples flowing into the system as a means to provide ordered access to data items. Several window types have been implemented in stream prototype systems, but a precise definition of their semantics is still lacking. In this paper, we describe a formal framework for expressing windows in continuous queries over data streams. After classifying windows according to their basic characteristics, we give algebraic expressions for the most significant window types commonly appearing in applications. As an essential step towards a stream algebra, we then propose formal definitions for the windowed analogs of typical relational operators, such as join, union or aggregation, and we identify several properties useful to query optimization.

...read moreread less

Journal Article•DOI•

Efficient classification across multiple database relations: a CrossMine approach

[...]

Xiaoxin Yin¹, Jiawei Han¹, Jiong Yang², Philip S. Yu•Institutions (2)

University of Illinois at Urbana–Champaign¹, Case Western Reserve University²

01 Jun 2006-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper proposes a new approach, called CrossMine, which includes a set of novel and powerful methods for multirelational classification, including 1) tuple ID propagation, an efficient and flexible method for virtually joining relations, which enables convenient search among different relations, and new definitions for predicates and decision-tree nodes.

...read moreread less

Abstract: Relational databases are the most popular repository for structured data, and is thus one of the richest sources of knowledge in the world. In a relational database, multiple relations are linked together via entity-relationship links. Multirelational classification is the procedure of building a classifier based on information stored in multiple relations and making predictions with it. Existing approaches of inductive logic programming (recently, also known as relational mining) have proven effective with high accuracy in multirelational classification. Unfortunately, most of them suffer from scalability problems with regard to the number of relations in databases. In this paper, we propose a new approach, called CrossMine, which includes a set of novel and powerful methods for multirelational classification, including 1) tuple ID propagation, an efficient and flexible method for virtually joining relations, which enables convenient search among different relations, 2) new definitions for predicates and decision-tree nodes, which involve aggregated information to provide essential statistics for classification, and 3) a selective sampling method for improving scalability with regard to the number of tuples. Based on these techniques, we propose two scalable and accurate methods for multirelational classification: CrossMine-Rule, a rule-based method and CrossMine-Tree, a decision-tree-based method. Our comprehensive experiments on both real and synthetic data sets demonstrate the high scalability and accuracy of the CrossMine approach

...read moreread less

Proceedings Article•DOI•

Extending RDBMSs To Support Sparse Datasets Using An Interpreted Attribute Storage Format

[...]

Jennifer L. Beckmann¹, Alan Halverson¹, Rajasekar Krishnamurthy², Jeffrey F. Naughton¹•Institutions (2)

University of Wisconsin-Madison¹, IBM²

03 Apr 2006

TL;DR: This paper argues that the proper way to handle sparse data is not to use a vertical schema, but rather to extend the RDBMS tuple storage format to allow the representation of sparse attributes as interpreted fields, and shows that the interpreted storage approach dominates in query efficiency and ease-of-use over the current horizontal storage and vertical schema approaches over a wide range of queries.

...read moreread less

Abstract: "Sparse" data, in which relations have many attributes that are null for most tuples, presents a challenge for relational database management systems. If one uses the normal "horizontal" schema to store such data sets in any of the three leading commercial RDBMS, the result is tables that occupy vast amounts of storage, most of which is devoted to nulls. If one attempts to avoid this storage blowup by using a "vertical" schema, the storage utilization is indeed better, but query performance is orders of magnitude slower for certain classes of queries. In this paper, we argue that the proper way to handle sparse data is not to use a vertical schema, but rather to extend the RDBMS tuple storage format to allow the representation of sparse attributes as interpreted fields. The addition of interpreted storage allows for efficient and transparent querying of sparse data, uniform access to all attributes, and schema scalability. We show, through an implementation in PostgreSQL, that the interpreted storage approach dominates in query efficiency and ease-of-use over the current horizontal storage and vertical schema approaches over a wide range of queries and sparse data sets.

...read moreread less

Journal Article•

Data Fusion in Three Steps: Resolving Schema, Tuple, and Value Inconsistencies.

[...]

Felix Naumann, Alexander Bilke, Jens Bleiholder, Melanie Weis

01 Jan 2006-IEEE Data(base) Engineering Bulletin

Proceedings Article•DOI•

Revision Processing in a Stream Processing Engine: A High-Level Design

[...]

Esther Ryvkina¹, Anurag S. Maskey¹, Mitch Cherniack¹, Stanley B. Zdonik²•Institutions (2)

Brandeis University¹, Brown University²

03 Apr 2006

TL;DR: This work states that any stream processing engine should process revision inputs by generating revision outputs that correct previous query results, and knows of no stream processing system that presently has this capability.

...read moreread less

Abstract: Data stream processing systems have become ubiquitous in academic [1, 2, 5, 6] and commercial [11] sectors, with application areas that include financial services, network traffic analysis, battlefield monitoring and traffic control [3]. The append-only model of streams implies that input data is immutable and therefore always correct. But in practice, streaming data sources often contend with noise (e.g., embedded sensors) or data entry errors (e.g., financial data feeds) resulting in erroneous inputs and therefore, erroneous query results. Many data stream sources (e.g., commercial ticker feeds) issue "revision tuples" (revisions) that amend previously issued tuples (e.g. erroneous share prices). Ideally, any stream processing engine should process revision inputs by generating revision outputs that correct previous query results. We know of no stream processing system that presently has this capability.

...read moreread less

Answering Vague Queries in Fuzzy DL-Lite

[...]

Umberto Straccia¹•Institutions (1)

National Research Council¹

01 Jan 2006

TL;DR: This paper shows how to compute efficiently the top-k answers of a complex query (i.e. conjunctive queries) over a huge set of instances.

...read moreread less

Abstract: Fuzzy Description Logics (fuzzy DLs) allow to describe structured knowledge with vague concepts. Unlike classical DLs, in fuzzy DLs an answer is a set of tuples ranked according to the degree they satisfy the query. In this paper, we consider fuzzy DL-Lite. We show how to compute efficiently the top-k answers of a complex query (i.e. conjunctive queries) over a huge set of instances.

...read moreread less

Proceedings Article•DOI•

Towards robust indexing for ranked queries

[...]

Dong Xin¹, Chen Chen¹, Jiawei Han¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Sep 2006

TL;DR: This paper studies the sequentially layered indexing problem where tuples are put into multiple consecutive layers and any top-k query can be answered by at most at most three layers of tuples, and proposes a new criterion for building the layered index.

...read moreread less

Abstract: Top-k query asks for k tuples ordered according to a specific ranking function that combines the values from multiple participating attributes. The combined score function is usually linear. To efficiently answer top-k queries, preprocessing and indexing the data have been used to speed up the run time performance. Many indexing methods allow the online query algorithms progressively retrieve the data and stop at a certain point. However, in many cases, the number of data accesses is sensitive to the query parameters (i.e., linear weights in the score functions).In this paper, we study the sequentially layered indexing problem where tuples are put into multiple consecutive layers and any top-k query can be answered by at most k layers of tuples. We propose a new criterion for building the layered index. A layered index is robust if for any k, the number of tuples in the top k layers is minimal in comparison with all the other alternatives. The robust index guarantees the worst case performance for arbitrary query parameters. We derive a necessary and sufficient condition for robust index. The problem is shown solvable within O(ndlog n) (where d is the number of dimensions, and n is the number of tuples). To reduce the high complexity of the exact solution, we develop an approximate approach, which has time complexity O(2dn(log n)r(d)-1), where r(d) = ⌈d/2⌉ + ⌊d/2⌋ ⌈d/2⌉. Our experimental results show that our proposed method outperforms the best known previous methods.

...read moreread less

Proceedings Article•DOI•

Supporting ad-hoc ranking aggregates

[...]

Chengkai Li¹, Kevin Chen-Chuan Chang¹, Ihab F. Ilyas²•Institutions (2)

University of Illinois at Urbana–Champaign¹, University of Waterloo²

27 Jun 2006

TL;DR: This paper presents a principled framework for efficient processing of ad-hoc top-k (ranking) aggregate queries, which provide the k groups with the highest aggregates as results, and addresses the challenges in realizing the framework and implementing new query operators, enabling efficient group-aware and rank-aware query plans.

...read moreread less

Abstract: This paper presents a principled framework for efficient processing of ad-hoc top-k (ranking) aggregate queries, which provide the k groups with the highest aggregates as results. Essential support of such queries is lacking in current systems, which process the queries in a naive materialize-group-sort scheme that can be prohibitively inefficient. Our framework is based on three fundamental principles. The Upper-Bound Principle dictates the requirements of early pruning, and the Group-Ranking and Tuple-Ranking Principles dictate group-ordering and tuple-ordering requirements. They together guide the query processor toward a provably optimal tuple schedule for aggregate query processing. We propose a new execution framework to apply the principles and requirements. We address the challenges in realizing the framework and implementing new query operators, enabling efficient group-aware and rank-aware query plans. The experimental study validates our framework by demonstrating orders of magnitude performance improvement in the new query plans, compared with the traditional plans.

...read moreread less

Proceedings Article•DOI•

Class Association Rule Mining with Chi-Squared Test Using Genetic Network Programming

[...]

Kaoru Shimada¹, Kotaro Hirasawa¹, Jinglu Hu¹•Institutions (1)

Waseda University¹

01 Jan 2006

TL;DR: An efficient algorithm for important class association rule mining using genetic network programming (GNP) and measuring the significance of the association via the chi-squared test to present a classifier using these extracted rules.

...read moreread less

Abstract: An efficient algorithm for important class association rule mining using genetic network programming (GNP) is proposed. GNP is one of the evolutionary optimization techniques, which uses directed graph structures as genes. Instead of generating a large number of candidate rules, the method can obtain a sufficient number of important association rules for classification. The proposed method measures the significance of the association via the chi-squared test. Therefore, all the extracted important rules can be used for classification directly. In addition, the method suits class association rule mining from dense databases, where many frequently occurring items are found in each tuple. Users can define conditions of extracting important class association rules. In this paper, we describe an algorithm for class association rule mining with chi-squared test using GNP and present a classifier using these extracted rules.

...read moreread less

Top-down parameter-free clustering of high-dimensional categorical data

[...]

Eugenio Cesario¹, Giuseppe Manco¹, Riccardo Ortale¹•Institutions (1)

Indian Council of Agricultural Research¹

01 Jan 2006

TL;DR: It is shown how a suitable notion of cluster homogeneity can be defined in the context of high-dimensional categorical data, from which an effective instance of the proposed clustering scheme immediately follows.

...read moreread less

Abstract: A parameter-free, fully-automatic approach to clustering high-dimensional categorical data is proposed. The technique is based on a two-phase iterative procedure, which attempts to improve the overall quality of the whole partition. In the first phase, cluster assignments are given, and a new cluster is added to the partition by identifying and splitting a low-quality cluster. In the second phase, the number of clusters is fixed, and an attempt to optimize cluster assignments is done. On the basis of such features, the algorithm attempts to improve the overall quality of the whole partition and finds clusters in the data, whose number is naturally established on the basis of the inherent features of the underlying data set rather than being previously specified. Furthermore, the approach is parametric to the notion of cluster quality: Here, a cluster is defined as a set of tuples exhibiting a sort of homogeneity. We show how a suitable notion of cluster homogeneity can be defined in the context of high-dimensional categorical data, from which an effective instance of the proposed clustering scheme immediately follows. Experiments on both synthetic and real data prove that the devised algorithm scales linearly and achieves nearly optimal results in terms of compactness and separation.

...read moreread less

Book Chapter•DOI•

Generalized arc consistency for positive table constraints

[...]

Christophe Lecoutre¹, Radoslaw Szymanek²•Institutions (2)

Artois University¹, University College Cork²

25 Sep 2006

TL;DR: A new algorithm to establish Generalized Arc Consistency on positive table constraints, i.e. constraints defined in extension by a set of allowed tuples, that can be easily grafted to any generic GAC algorithm and admits on some instances a behaviour quadratic in the arity of the constraints.

...read moreread less

Abstract: In this paper, we propose a new algorithm to establish Generalized Arc Consistency (GAC) on positive table constraints, i.e. constraints defined in extension by a set of allowed tuples. Our algorithm visits the lists of valid and allowed tuples in an alternative fashion when looking for a support (i.e. a tuple that is both allowed and valid). It is then able to jump over sequences of valid tuples containing no allowed tuple and over sequences of allowed tuples containing no valid tuple. Our approach, that can be easily grafted to any generic GAC algorithm, admits on some instances a behaviour quadratic in the arity of the constraints whereas classical approaches, i.e. approaches that focus on either valid or allowed tuples, admit an exponential behaviour. We show the effectiveness of this approach, both theoretically and experimentally.

...read moreread less

Book Chapter•DOI•

Multi-dimensional aggregation for temporal data

[...]

Michael H. Böhlen¹, Johann Gamper¹, Christian S. Jensen²•Institutions (2)

Free University of Bozen-Bolzano¹, Aalborg University²

26 Mar 2006

TL;DR: The paper reports on an implementation of the new aggregation operator and on an empirical study that indicates that the operator scales to large data sets and is competitive with respect to other temporal aggregation algorithms.

...read moreread less

Abstract: Business Intelligence solutions, encompassing technologies such as multi-dimensional data modeling and aggregate query processing, are being applied increasingly to non-traditional data. This paper extends multi-dimensional aggregation to apply to data with associated interval values that capture when the data hold. In temporal databases, intervals typically capture the states of reality that the data apply to, or capture when the data are, or were, part of the current database state. This paper proposes a new aggregation operator that addresses several challenges posed by interval data. First, the intervals to be associated with the result tuples may not be known in advance, but depend on the actual data. Such unknown intervals are accommodated by allowing result groups that are specified only partially. Second, the operator contends with the case where an interval associated with data expresses that the data holds for each point in the interval, as well as the case where the data holds only for the entire interval, but must be adjusted to apply to sub-intervals. The paper reports on an implementation of the new operator and on an empirical study that indicates that the operator scales to large data sets and is competitive with respect to other temporal aggregation algorithms.

...read moreread less

Collapse