Showing papers on "Tuple published in 2007"

PDF

Open Access

Proceedings Article•

Open information extraction from the web

[...]

Michele Banko¹, Michael Cafarella¹, Stephen Soderland¹, Matt Broadhead¹, Oren Etzioni¹ - Show less +1 more•Institutions (1)

University of Washington¹

06 Jan 2007

TL;DR: Open Information Extraction (OIE) as mentioned in this paper is a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input.

...read moreread less

Abstract: Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries. We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to perform extraction for a handful of pre-specified relations, TEXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more relations, discovered on the fly. We report statistics on TEXTRUNNER's 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000 more abstract assertions.

...read moreread less

1,574 citations

Proceedings Article•DOI•

Top-k Query Processing in Uncertain Databases

[...]

Mohamed A. Soliman¹, Ihab F. Ilyas¹, K. Chen-Chuan Chang²•Institutions (2)

University of Waterloo¹, University of Illinois at Urbana–Champaign²

15 Apr 2007

TL;DR: A framework that encapsulates a state space model and efficient query processing techniques to tackle the challenges of uncertain data settings is constructed and it is proved that the techniques are optimal in terms of the number of accessed tuples and materialized search states.

...read moreread less

Abstract: Top-k processing in uncertain databases is semantically and computationally different from traditional top-k processing. The interplay between score and uncertainty makes traditional techniques inapplicable. We introduce new probabilistic formulations for top-k queries. Our formulations are based on "marriage" of traditional top-k semantics and possible worlds semantics. In the light of these formulations, we construct a framework that encapsulates a state space model and efficient query processing techniques to tackle the challenges of uncertain data settings. We prove that our techniques are optimal in terms of the number of accessed tuples and materialized search states. Our experiments show the efficiency of our techniques under different data distributions with orders of magnitude improvement over naive materialization of possible worlds.

...read moreread less

455 citations

Journal Article•DOI•

Data integration with uncertainty

[...]

Xin Luna Dong¹, Alon Halevy², Cong Yu³•Institutions (3)

University of Washington¹, Google², University of Michigan³

23 Sep 2007

TL;DR: The concept of probabilistic schema mappings is introduced and it is shown that there are two possible semantics for such mappings: by-table semantics assumes that there exists a correct mapping but the author does not know what it is; by-tuple semantics assuming that the correct mapping may depend on the particular tuple in the source data.

...read moreread less

Abstract: This paper reports our first set of results on managing uncertainty in data integration. We posit that data-integration systems need to handle uncertainty at three levels, and do so in a principled fashion. First, the semantic mappings between the data sources and the mediated schema may be approximate because there may be too many of them to be created and maintained or because in some domains (e.g., bioinformatics) it is not clear what the mappings should be. Second, queries to the system may be posed with keywords rather than in a structured form. Third, the data from the sources may be extracted using information extraction techniques and so may yield imprecise data. As a first step to building such a system, we introduce the concept of probabilistic schema mappings and analyze their formal foundations. We show that there are two possible semantics for such mappings: by-table semantics assumes that there exists a correct mapping but we don't know what it is; by-tuple semantics assumes that the correct mapping may depend on the particular tuple in the source data. We present the query complexity and algorithms for answering queries in the presence of approximate schema mappings, and we describe an algorithm for efficiently computing the top-k answers to queries in such a setting.

...read moreread less

366 citations

Proceedings Article•DOI•

Finding Top-k Min-Cost Connected Trees in Databases

[...]

Bolin Ding¹, J. Xu Yu¹, Shan Wang¹, Lu Qin², Xiao Zhang², Xuemin Lin³ - Show less +2 more•Institutions (3)

The Chinese University of Hong Kong¹, Renmin University of China², University of New South Wales³

15 Apr 2007

TL;DR: This paper proposes a novel parameterized solution, with l as a parameter, to find the optimal GST-1, in time complexity O(3ln + 2l ((l + logn)n + m), where n and m are the numbers of nodes and edges in graph G, which can handle graphs with a large number of nodes.

...read moreread less

Abstract: It is widely realized that the integration of database and information retrieval techniques will provide users with a wide range of high quality services. In this paper, we study processing an l-keyword query, p1, p1, ..., pl, against a relational database which can be modeled as a weighted graph, G(V, E). Here V is a set of nodes (tuples) and E is a set of edges representing foreign key references between tuples. Let Vi ⊆ V be a set of nodes that contain the keyword pi. We study finding top-k minimum cost connected trees that contain at least one node in every subset Vi, and denote our problem as GST-k When k = 1, it is known as a minimum cost group Steiner tree problem which is NP-complete. We observe that the number of keywords, l, is small, and propose a novel parameterized solution, with l as a parameter, to find the optimal GST-1, in time complexity O(3ln + 2l ((l + logn)n + m)), where n and m are the numbers of nodes and edges in graph G. Our solution can handle graphs with a large number of nodes. Our GST-1 solution can be easily extended to support GST-k, which outperforms the existing GST-k solutions over both weighted undirected/directed graphs. We conducted extensive experimental studies, and report our finding.

...read moreread less

357 citations

Proceedings Article•DOI•

Representing and Querying Correlated Tuples in Probabilistic Databases

[...]

Prithviraj Sen¹, Amol Deshpande¹•Institutions (1)

University of Maryland, College Park¹

15 Apr 2007

TL;DR: This work develops an efficient strategy for query evaluation over Probabilistic databases by casting the query processing problem as an inference problem in an appropriately constructed probabilistic graphical model and presents several optimizations specific to probabilism databases that enable efficient query evaluation.

...read moreread less

Abstract: Probabilistic databases have received considerable attention recently due to the need for storing uncertain data produced by many real world applications. The widespread use of probabilistic databases is hampered by two limitations: (1) current probabilistic databases make simplistic assumptions about the data (e.g., complete independence among tuples) that make it difficult to use them in applications that naturally produce correlated data, and (2) most probabilistic databases can only answer a restricted subset of the queries that can be expressed using traditional query languages. We address both these limitations by proposing a framework that can represent not only probabilistic tuples, but also correlations that may be present among them. Our proposed framework naturally lends itself to the possible world semantics thus preserving the precise query semantics extant in current probabilistic databases. We develop an efficient strategy for query evaluation over such probabilistic databases by casting the query processing problem as an inference problem in an appropriately constructed probabilistic graphical model. We present several optimizations specific to probabilistic databases that enable efficient query evaluation. We validate our approach by presenting an experimental evaluation that illustrates the effectiveness of our techniques at answering various queries using real and synthetic datasets.

...read moreread less

270 citations

Proceedings Article•

The boundary between privacy and utility in data publishing

[...]

Vibhor Rastogi, Dan Suciu, Sungho Hong

23 Sep 2007

TL;DR: This paper proves an almost crisp separation of the case when a useful anonymization algorithm is possible from when it is not, based on the attacker's prior knowledge, and improves the privacy/utility tradeoff of previously known algorithms with privacy/UTility guarantees such as FRAPP.

...read moreread less

Abstract: We consider the privacy problem in data publishing: given a database instance containing sensitive information "anonymize" it to obtain a view such that, on one hand attackers cannot learn any sensitive information from the view, and on the other hand legitimate users can use it to compute useful statistics. These are conflicting goals. In this paper we prove an almost crisp separation of the case when a useful anonymization algorithm is possible from when it is not, based on the attacker's prior knowledge. Our definition of privacy is derived from existing literature and relates the attacker's prior belief for a given tuple t, with the posterior belief for the same tuple. Our definition of utility is based on the error bound on the estimates of counting queries. The main result has two parts. First we show that if the prior beliefs for some tuples are large then there exists no useful anonymization algorithm. Second, we show that when the prior is bounded for all tuples then there exists an anonymization algorithm that is both private and useful. The anonymization algorithm that forms our positive result is novel, and improves the privacy/utility tradeoff of previously known algorithms with privacy/utility guarantees such as FRAPP.

...read moreread less

263 citations

Proceedings Article•DOI•

Materialization Strategies in a Column-Oriented DBMS

[...]

Daniel J. Abadi¹, Daniel S. Myers¹, David J. DeWitt², Samuel Madden¹•Institutions (2)

Massachusetts Institute of Technology¹, University of Wisconsin-Madison²

15 Apr 2007

TL;DR: This paper describes a variety of strategies for tuple construction and intermediate result representations and provides a systematic evaluation of these strategies.

...read moreread less

Abstract: There has been renewed interest in column-oriented database architectures in recent years. For read-mostly query workloads such as those found in data warehouse and decision support applications, "column-stores" have been shown to perform particularly well relative to "row-stores" In order for column-stores to be readily adopted as a replacement for row-stores, however, they must present the same interface to client applications as do row stores, which implies that they must output row-store-style tuples. Thus, the input columns stored on disk must be converted to rows at some point in the query plan, but the optimal point at which to do the conversion is not obvious. This problem can be considered as the opposite of the projection problem in row-store systems: while row-stores need to determine where in query plans to place projection operators to make tuples narrower, column-stores need to determine when to combine single-column projections into wider tuples. This paper describes a variety of strategies for tuple construction and intermediate result representations and provides a systematic evaluation of these strategies.

...read moreread less

246 citations

Proceedings Article•

The design of ESSENCE: a constraint language for specifying combinatorial problems

[...]

Alan M. Frisch¹, Matthew Grum¹, Christopher Jefferson², Bernadette Martínez Hernández¹, Ian Miguel³ - Show less +1 more•Institutions (3)

University of York¹, University of Oxford², University of St Andrews³

06 Jan 2007

TL;DR: Essence is a formal language for specifying combinatorial problems in a manner similar to natural rigorous specifications that use a mixture of natural language and discrete mathematics.

...read moreread less

Abstract: ESSENCE is a new formal language for specifying combinatorial problems in a manner similar to natural rigorous specifications that use a mixture of natural language and discrete mathematics. ESSENCE provides a high level of abstraction, much of which is the consequence of the provision of decision variables whose values can be combinatorial objects, such as tuples, sets, multisets, relations, partitions and functions. ESSENCE also allows these combinatorial objects to be nested to arbitrary depth, thus providing, for example, sets of partitions, sets of sets of partitions, and so forth. Therefore, a problem that requires finding a complex combinatorial object can be directly specified by using a decision variable whose type is precisely that combinatorial object.

...read moreread less

215 citations

Book Chapter•DOI•

On acyclic conjunctive queries and constant delay enumeration

[...]

Guillaume Bagan¹, Arnaud Durand², Etienne Grandjean¹•Institutions (2)

University of Caen Lower Normandy¹, University of Paris²

11 Sep 2007

TL;DR: The enumeration complexity of the natural extension of acyclic conjunctive queries with disequalities is studied and it is shown that for each query of free-connex treewidth bounded by some constant k, enumeration of results can be done with O(|M|k+1) precomputation steps and constant delay.

...read moreread less

Abstract: We study the enumeration complexity of the natural extension of acyclic conjunctive queries with disequalities. In this language, a number of NP-complete problems can be expressed. We first improve a previous result of Papadimitriou and Yannakakis by proving that such queries can be computed in time c.|M|ċ|ϕ(M)| where M is the structure, ϕ(M) is the result set of the query and c is a simple exponential in the size of the formula ϕ. A consequence of our method is that, in the general case, tuples of such queries can be enumerated with a linear delay between two tuples. We then introduce a large subclass of acyclic formulas called CCQ≠ and prove that the tuples of a CCQ≠ query can be enumerated with a linear time precomputation and a constant delay between consecutive solutions. Moreover, under the hypothesis that the multiplication of two n×n boolean matrices cannot be done in time O(n2), this leads to the following dichotomy for acyclic queries: either such a query is in CCQ≠ or it cannot be enumerated with linear precomputation and constant delay. Furthermore we prove that testing whether an acyclic formula is in CCQ≠ can be performed in polynomial time. Finally, the notion of free-connex treewidth of a structure is defined. We show that for each query of free-connex treewidth bounded by some constant k, enumeration of results can be done with O(|M|k+1) precomputation steps and constant delay.

...read moreread less

174 citations

Proceedings Article•DOI•

Sketching probabilistic data streams

[...]

Graham Cormode¹, Minos Garofalakis²•Institutions (2)

AT&T Labs¹, Yahoo!²

11 Jun 2007

TL;DR: These algorithms offer strong randomized estimation guarantees while using only sublinear space in the size of the stream(s), and rely on novel, concise streaming sketch synopses that extend conventional sketching ideas to the probabilistic streams setting.

...read moreread less

Abstract: The management of uncertain, probabilistic data has recently emerged as a useful paradigm for dealing with the inherent unreliabilities of several real-world application domains, including data cleaning, information integration, and pervasive, multi-sensor computing. Unlike conventional data sets, a set of probabilistic tuples defines a probability distribution over an exponential number of possible worlds (i.e., "grounded", deterministic databases). This "possibleworlds" interpretation allows for clean query semantics but also raises hard computational problems for probabilistic database query processors. To further complicate matters, in many scenarios (e.g., large-scale process and environmental monitoring using multiple sensor modalities), probabilistic data tuples arrive and need to be processed in a streaming fashion; that is, using limited memory and CPU resources and without the benefit of multiple passes over a static probabilistic database. Such probabilistic data streams raise a host of new research challenges for stream-processing engines that, to date, remain largely unaddressed. In this paper, we propose the first space- and time-efficient algorithms for approximating complex aggregate queries (including, the number of distinct values and join/self-join sizes) over probabilistic data streams. Following the possible-worlds semantics, such aggregates essentially define probability distributions over the space of possible aggregation results, and our goal is to characterize such distributions through efficient approximations of their key moments (such as expectation and variance). Our algorithms offer strong randomized estimation guarantees while using only sublinear space in the size of the stream(s), and rely on novel, concise streaming sketch synopses that extend conventional sketching ideas to the probabilistic streams setting. Our experimental results verify the effectiveness of our approach.

...read moreread less

170 citations

Journal Article•DOI•

Thoughts on k-anonymization

[...]

M. Ercan Nergiz¹, Chris Clifton¹•Institutions (1)

Purdue University¹

01 Dec 2007

TL;DR: The main finding of the paper will be that metrics may behave differently through different algorithms and may not show correlations with some applications’ accuracy on output data.

...read moreread less

Abstract: k-Anonymity is a method for providing privacy protection by ensuring that data cannot be traced to an individual. In a k-anonymous dataset, any identifying information occurs in at least k tuples. To achieve optimal and practical k-anonymity, recently, many different kinds of algorithms with various assumptions and restrictions have been proposed with different metrics to measure quality. This paper evaluates a family of clustering-based algorithms that are more flexible and even attempts to improve precision by ignoring the restrictions of user-defined Domain Generalization Hierarchies. The evaluation of the new approaches with respect to cost metrics shows that metrics may behave differently with different algorithms and may not correlate with some applications' accuracy on output data.

...read moreread less

Book Chapter•DOI•

Complexity of consistent query answering in databases under cardinality-based and incremental repair semantics

[...]

Andrei Lopatenko¹, Leopoldo Bertossi²•Institutions (2)

Free University of Bozen-Bolzano¹, Carleton University²

10 Jan 2007

TL;DR: In this paper, the authors considered the cardinality-based repair semantics for consistent query answering in a dynamic setting, where a consistent database is affected by a sequence of updates, which may make it inconsistent.

...read moreread less

Abstract: A database D may be inconsistent wrt a given set IC of integrity constraints. Consistent Query Answering (CQA) is the problem of computing from D the answers to a query that are consistent wrt IC. Consistent answers are invariant under all the repairs of D, i.e. the consistent instances that minimally depart from D. Three classes of repair have been considered in the literature: those that minimize set-theoretically the set of tuples in the symmetric difference; those that minimize the changes of attribute values, and those that minimize the cardinality of the set of tuples in the symmetric difference. The latter class has not been systematically investigated. In this paper we obtain algorithmic and complexity theoretic results for CQA under this cardinality-based repair semantics. We do this in the usual, static setting, but also in a dynamic framework where a consistent database is affected by a sequence of updates, which may make it inconsistent. We also establish comparative results with the other two kinds of repairs in the dynamic case.

...read moreread less

Proceedings Article•DOI•

Query-Aware Sampling for Data Streams

[...]

T. Johnson¹, S. Muthukrishnan², V. Shkapenyuk², O. Spatscheck³•Institutions (3)

AT&T Labs¹, Rutgers University², AT&T³

17 Apr 2007

TL;DR: This paper shows how to perform query-aware sampling (semantic sampling) which works in general, and presents methods for analyzing any given query, choosing sampling methods judiciously, and reconciling the sampling methods required by different queries in a query set.

...read moreread less

Abstract: Data stream management systems are useful when large volumes of data need to be processed in real time. Examples include monitoring network traffic, monitoring financial transactions, and analyzing large scale scientific data feeds. These applications have varying data rates and often show bursts of high activity that overload the system, often during the most critical instants (e.g., network attacks, financial spikes) for analysis. Therefore, load shedding is necessary to preserve the stability of the system, gracefully degrade its performance and extract answers. Existing methods for load shedding in a general purpose data stream query system use random sampling of tuples, essentially independent of the query. While this technique is acceptable for some queries, the results may be meaningless or even incorrect for other queries, lit principle, a number of different query-dependent sampling methods exist, but they work only for particular queries. In this paper, we show how to perform query-aware sampling (semantic sampling) which works in general. We present methods for analyzing any given query, choosing sampling methods judiciously, and reconciling the sampling methods required by different queries in a query set. We conclude with experiments on a highspeed data stream that demonstrate with different query sets that our method produces accurate results while decreasing the load significantly.

...read moreread less

Patent•

Systems and methods for presenting results of geographic text searches

[...]

John R. Frank

06 Aug 2007

TL;DR: In this article, a set of document-location tuples from a corpus of documents is obtained from a web application, each location having associated cartographic display attributes and displaying a visual representation of the domain identified by the domain identifier.

...read moreread less

Abstract: Under one aspect, an interface program stored on a computer-readable medium causes a computer system to perform the functions of: accepting search criteria from a user, the search criteria including a free-text query and a domain identifier, the domain identifier identifying a domain in a metric vector space; obtaining a set of document-location tuples from a corpus of documents, each document-location tuple satisfying the search criteria from the user, each location having associated cartographic display attributes; displaying a visual representation of the domain identified by the domain identifier, the visual representation of the domain having an average spatial scale; selecting a subset of the set of document-location tuples based on the cartographic display attributes and on the average spatial scale of the visual representation of the domain; and displaying a plurality of visual indicators representing the selected subset of document-location tuples

...read moreread less

Patent•

System and methods for providing statstically interesting geographical information based on queries to a geographic search engine

[...]

John R. Frank

12 Jun 2007

TL;DR: In this article, the authors present a set of document-location tuples from a corpus of documents, wherein each document contains information that is responsive to the free text query entry and contains location-related information that refers to a location within the domain.

...read moreread less

Abstract: Under one aspect, an interface program stored on a computer-readable medium causes a computer system with a display device to perform the functions of: accepting search criteria from a user, the search criteria including a domain identifier identifying a domain and a free text query entry; in response to accepting said search criteria from the user, receiving a set of document-location tuples from a corpus of documents, wherein each document: (a) contains information that is responsive to the free text query entry; and (b) contains location-related information that refers to a location within the domain; requesting and receiving a result from an additional query based at least in part on the domain identifier, the result not being a document-location tuple; and displaying a visual representation of at least a subset of the document-location tuples and a visual representation of the result of the additional query on the display device

...read moreread less

Proceedings Article•DOI•

Object Distinction: Distinguishing Objects with Identical Names

[...]

Xiaoxin Yin¹, Jiawei Han¹, Philip S. Yu²•Institutions (2)

University of Illinois at Urbana–Champaign¹, IBM²

15 Apr 2007

TL;DR: A general object distinction methodology called DISTINCT is developed, which combines two complementary measures for relational similarity: set resemblance of neighbor tuples and random walk probability, and uses SVM to weigh different types of linkages without manually labeled training data.

...read moreread less

Abstract: Different people or objects may share identical names in the real world, which causes confusion in many applications. It is a nontrivial task to distinguish those objects, especially when there is only very limited information associated with each of them. In this paper, we develop a general object distinction methodology called DISTINCT, which combines two complementary measures for relational similarity: set resemblance of neighbor tuples and random walk probability, and uses SVM to weigh different types of linkages without manually labeled training data. Experiments show that DISTINCT can accurately distinguish different objects with identical names in real databases.

...read moreread less

Proceedings Article•DOI•

Efficient Keyword Search Across Heterogeneous Relational Databases

[...]

Mayssam Sayyadian¹, Hieu LeKhac², AnHai Doan¹, Luis Gravano³•Institutions (3)

University of Wisconsin-Madison¹, University of Illinois at Urbana–Champaign², Columbia University³

15 Apr 2007

TL;DR: Kite combines schema matching and structure discovery techniques to find approximate foreign-key joins across heterogeneous databases and exploits the joins - discovered automatically across the databases - to enable fast and effective querying over the distributed data.

...read moreread less

Abstract: Keyword search is a familiar and potentially effective way to find information of interest that is "locked" inside relational databases. Current work has generally assumed that answers for a keyword query reside within a single database. Many practical settings, however, require that we combine tuples from multiple databases to obtain the desired answers. Such databases are often autonomous and heterogeneous in their schemas and data. This paper describes Kite, a solution to the keyword-search problem over heterogeneous relational databases. Kite combines schema matching and structure discovery techniques to find approximate foreign-key joins across heterogeneous databases. Such joins are critical for producing query results that span multiple databases and relations. Kite then exploits the joins - discovered automatically across the databases - to enable fast and effective querying over the distributed data. Our extensive experiments over real-world data sets show that (1) our query processing algorithms are efficient and (2) our approach manages to produce high-quality query results spanning multiple heterogeneous databases, with no need for human reconciliation of the different databases.

...read moreread less

Journal Article•DOI•

Genetic learning of accurate and compact fuzzy rule based systems based on the 2-tuples linguistic representation

[...]

Rafael Alcalá¹, Jesús Alcalá-Fdez¹, Francisco Herrera¹, José Otero²•Institutions (2)

University of Granada¹, University of Oviedo²

01 Jan 2007-International Journal of Approximate Reasoning

TL;DR: This work proposes a new method to obtain linguistic fuzzy systems by means of an evolutionary learning of the data base a priori (number of labels and lateral displacements) and a simple rule generation method to quickly learn the associated rule base.

...read moreread less

Proceedings Article•DOI•

Reverse Query Processing

[...]

Carsten Binnig¹, Donald Kossmann², Eric Lo²•Institutions (2)

Heidelberg University¹, ETH Zurich²

15 Apr 2007

TL;DR: Reverse query processing (RQP) as discussed by the authors is a technique to generate test databases for testing database applications (e.g., OLAP or business objects) is a daunting task in practice.

...read moreread less

Abstract: Generating databases for testing database applications (e.g., OLAP or business objects) is a daunting task in practice. There are a number of commercial tools to automatically generate test databases. These tools take a database schema (table layouts plus integrity constraints) and table sizes as input in order to generate new tuples. However, the databases generated by these tools are not adequate for testing a database application. If an application query is executed against such a synthetic database, then the result of that application query is likely to be empty or contain weird results, such as a report on the performance of a sales person that contains negative sales. To solve this problem, this paper proposes a new technique called reverse query processing (RQP). RQP gets a query and a result as input and returns a possible database instance that could have produced that result for that query. RQP also has other applications; most notably, testing the performance of DBMS and debugging SQL queries.

...read moreread less

Proceedings Article•DOI•

Supporting Streaming Updates in an Active Data Warehouse

[...]

Neoklis Polyzotis¹, Spiros Skiadopoulos², Panos Vassiliadis³, Alkis Simitsis⁴, N. E. Frantzell¹ - Show less +1 more•Institutions (4)

University of California, Santa Cruz¹, University of Peloponnese², University of Ioannina³, National and Kapodistrian University of Athens⁴

15 Apr 2007

TL;DR: This paper proposes a specialized join algorithm, termed mesh join (MeshJoin), that compensates for the difference in the access cost of the two join inputs by relying entirely on fast sequential scans of R, and sharing the I/O cost of accessing R across multiple tuples of S.

...read moreread less

Abstract: Active data warehousing has emerged as an alternative to conventional warehousing practices in order to meet the high demand of applications for up-to-date information. In a nutshell, an active warehouse is refreshed on-line and thus achieves a higher consistency between the stored information and the latest data updates. The need for on-line warehouse refreshment introduces several challenges in the implementation of data warehouse transformations, with respect to their execution time and their overhead to the warehouse processes. In this paper, we focus on a frequently encountered operation in this context, namely, the join of a fast stream S of source updates with a disk-based relation R, under the constraint of limited memory. This operation lies at the core of several common transformations, such as, surrogate key assignment, duplicate detection or identification of newly inserted tuples. We propose a specialized join algorithm, termed mesh join (MeshJoin), that compensates for the difference in the access cost of the two join inputs by (a) relying entirely on fast sequential scans of R, and (b) sharing the I/O cost of accessing R across multiple tuples of S. We detail the Mesh Join algorithm and develop a systematic cost model that enables the tuning of Mesh Join for two objectives: maximizing throughput under a specific memory budget or minimizing memory consumption for a specific throughput. We present an experimental study that validates the performance of Mesh Join on synthetic and real-life data. Our results verify the scalability of Mesh-Join to fast streams and large relations, and demonstrate its numerous advantages over existing join algorithms.

...read moreread less

Proceedings Article•DOI•

Keyword search on relational data streams

[...]

Alexander Markowetz¹, Yin Yang¹, Dimitris Papadias¹•Institutions (1)

Hong Kong University of Science and Technology¹

11 Jun 2007

TL;DR: This paper proposes keyword search on relational data streams (S-KWS) as an effective way for querying in such intricate and dynamic environments and proposes techniques that utilize the operator mesh for efficient query processing.

...read moreread less

Abstract: Increasing monitoring of transactions, environmental parameters, homeland security, RFID chips and interactions of online users rapidly establishes new data sources and application scenarios. In this paper, we propose keyword search on relational data streams (S-KWS) as an effective way for querying in such intricate and dynamic environments. Compared to conventional query methods, S-KWS has several benefits. First, it allows search for combinations of interesting terms without a-priori knowledge of the data streams in which they appear. Second, it hides the schema from the user and allows it to change, without the need for query re-writing. Finally, keyword queries are easy to express. Our contributions are summarized as follows. (i) We provide formal semantics for S-KWS, addressing the temporal validity and order of results. (ii) We propose an efficient algorithm for generating operator trees, applicable to arbitrary schemas. (iii) We integrate these trees into an operator mesh that shares common expressions. (iv) We develop techniques that utilize the operator mesh for efficient query processing. The techniques adapt dynamically to changes in the schema and input characteristics. Finally, (v) we present methods for purging expired tuples, minimizing either CPU, or memory requirements.

...read moreread less

Proceedings Article•

Ad-hoc top-k query answering for data streams

[...]

Gautam Das¹, Dimitrios Gunopulos², Nick Koudas³, Nikos Sarkas³•Institutions (3)

University of Texas at Arlington¹, University of California, Riverside², University of Toronto³

23 Sep 2007

TL;DR: This paper introduces query evaluation strategies that operate on top of an arrangement data structure that are able to guarantee efficient evaluation for ad-hoc queries and designs and analyzes algorithms for incrementally maintaining a data set organized in an arrangement representation under streaming updates.

...read moreread less

Abstract: A top-k query retrieves the k highest scoring tuples from a data set with respect to a scoring function defined on the attributes of a tuple. The efficient evaluation of top-k queries has been an active research topic and many different instantiations of the problem, in a variety of settings, have been studied. However, techniques developed for conventional, centralized or distributed databases are not directly applicable to highly dynamic environments and on-line applications, like data streams. Recently, techniques supporting top-k queries on data streams have been introduced. Such techniques are restrictive however, as they can only efficiently report top-k answers with respect to a pre-specified (as opposed to ad-hoc) set of queries. In this paper we introduce a novel geometric representation for the top-k query problem that allows us to raise this restriction. Utilizing notions of geometric arrangements, we design and analyze algorithms for incrementally maintaining a data set organized in an arrangement representation under streaming updates. We introduce query evaluation strategies that operate on top of an arrangement data structure that are able to guarantee efficient evaluation for ad-hoc queries. The performance of our core technique is augmented by incorporating tuple pruning strategies, minimizing the number of tuples that need to be stored and manipulated. This results in a main memory indexing technique supporting both efficient incremental updates and the evaluation of ad-hoc top-k queries. A thorough experimental study evaluates the efficiency of the proposed technique.

...read moreread less

Journal Article•DOI•

First-order queries on structures of bounded degree are computable with constant delay

[...]

Arnaud Durand¹, Etienne Grandjean²•Institutions (2)

University of Paris¹, University of Caen Lower Normandy²

01 Aug 2007-ACM Transactions on Computational Logic

TL;DR: In this paper, the complexity of first-order queries over d-degree-bounded structures is considered as a global process, and it is shown that queries on such structures can be evaluated in total time f(vφv) where s is the structure, S is the formula, φ is the result of the query and f, g are some fixed functions.

...read moreread less

Abstract: A relational structure is d-degree-bounded, for some integer d, if each element of the domain belongs to at most d tuples. In this paper, we revisit the complexity of the evaluation problem of not necessarily Boolean first-order (FO) queries over d-degree-bounded structures. Query evaluation is considered here as a dynamical process. We prove that any FO query on d-degree-bounded structures belongs to the complexity class constant-Delaylin, that is, can be computed by an algorithm that has two separate parts: it has a precomputation step of time linear in the size of the structure and then, it outputs all solutions (i.e., tuples that satisfy the formula) one by one with a constant delay (i.e., depending on the size of the formula only) between each. Seen as a global process, this implies that queries on d-degree-bounded structures can be evaluated in total time f(vφv).(vSv p vφ(S)v) and space g(vφv).vSv where S is the structure, φ is the formula, φ(S) is the result of the query and f, g are some fixed functions.Among other things, our results generalize a result of Seese on the data complexity of the model-checking problem for d-degree-bounded structures. Besides, the originality of our approach compared to related results is that it does not rely on the Hanf's model-theoretic technique and is simple and informative since it essentially rests on a quantifier elimination method.

...read moreread less

Journal Article•DOI•

Incremental Evaluation of Sliding-Window Queries over Data Streams

[...]

Thanaa M. Ghanem¹, Moustafa A. Hammad, Mohamed F. Mokbel, Walid G. Aref, Ahmed K. Elmagarmid - Show less +1 more•Institutions (1)

Purdue University¹

01 Jan 2007-IEEE Transactions on Knowledge and Data Engineering

TL;DR: Several optimization techniques over thenegative tuples approach are presented that aim to reduce the overhead of processing negative tuples while avoiding the output delay of the query answer.

...read moreread less

Abstract: Two research efforts have been conducted to realize sliding-window queries in data stream management systems, namely, query revaluation and incremental evaluation. In the query reevaluation method, two consecutive windows are processed independently of each other. On the other hand, in the incremental evaluation method, the query answer for a window is obtained incrementally from the answer of the preceding window. In this paper, we focus on the incremental evaluation method. Two approaches have been adopted for the incremental evaluation of sliding-window queries, namely, the input-triggered approach and the negative tuples approach. In the input-triggered approach, only the newly inserted tuples flow in the query pipeline and tuple expiration is based on the timestamps of the newly inserted tuples. On the other hand, in the negative tuples approach, tuple expiration is separated from tuple insertion where a tuple flows in the pipeline for every inserted or expired tuple. The negative tuples approach avoids the unpredictable output delays that result from the input-triggered approach. However, negative tuples double the number of tuples through the query pipeline, thus reducing the pipeline bandwidth. Based on a detailed study of the incremental evaluation pipeline, we classify the incremental query operators into two classes according to whether an operator can avoid the processing of negative tuples or not. Based on this classification, we present several optimization techniques over the negative tuples approach that aim to reduce the overhead of processing negative tuples while avoiding the output delay of the query answer. A detailed experimental study, based on a prototype system implementation, shows the performance gains over the input-triggered approach of the negative tuples approach when accompanied with the proposed optimizations

...read moreread less

Book Chapter•DOI•

A compression algorithm for large arity extensional constraints

[...]

George Katsirelos¹, Toby Walsh¹•Institutions (1)

NICTA¹

23 Sep 2007

TL;DR: A more compact representation for a set of tuples is introduced, which allows a potentially exponential reduction in the space needed to represent the satisfying tuples and exponential reduce in the time needed to enforce GAC.

...read moreread less

Abstract: We present an algorithm for compressing table constraints representing allowed or disallowed tuples. This type of constraint is used for example in configuration problems, where the satisfying tuples are read from a database. The arity of these constraints may be large. A generic GAC algorithm for such a constraint requires time exponential in the arity of the constraint to maintain GAC, but Bessiere and Regin showed in [1] that for the case of allowed tuples, GAC can be enforced in time proportional to the number of allowed tuples, using the algorithm GAC-Schema. We introduce a more compact representation for a set of tuples, which allows a potentially exponential reduction in the space needed to represent the satisfying tuples and exponential reduction in the time needed to enforce GAC. We show that this representation can be constructed from a decision tree that represents the original tuples and demonstrate that it does in practice produce a significantly shorter description of the constraint. We also show that this representation can be efficiently used in existing algorithms and can be used to improve GAC-Schema further. Finally, we show that this method can be used to improve the complexity of enforcing GAC on a table constraint defined in terms of forbidden tuples.

...read moreread less

Proceedings Article•DOI•

The shunt: an FPGA-based accelerator for network intrusion prevention

[...]

Nicholas Weaver¹, Vern Paxson¹, Jose M. Gonzalez¹•Institutions (1)

International Computer Science Institute¹

18 Feb 2007

TL;DR: The Shunt is developed, a in-line, FPGA-based IPS ac-celerator coupled to a host PC to handle both cache management and higher level IPS analysis, based on a novel series of caches.

...read moreread less

Abstract: The sophistication and complexity of analysis performed by today's network intrusion prevention systems (IPSs) benefits greatly from implementation using general-purpose CPUs. Yet the performance of such CPUs increasingly lags behind that necessary to process today's high-rate traffic streams. A key observation, however, is that much of the traffic comprising a high-volume stream can, after some initial analysis, be qualified as "likely uninteresting." To this end, we have developed an in-line, FPGA-based IPS ac-celerator, the Shunt, using the NetFPGA2 platform. The Shunt functions as the forwarding device used by the IPS; it alone processes the bulk of the traffic, offloading the memory bus and leaving the CPU free to inspect the subset of the traffic deemed germane for security analysis. To do so, the Shunt maintains several large state tables indexed by packet header fields, including IP/TCP flags, source and destination IP addresses, and connection tuples. The tables yield decision values the element makes on a packet-by-packet basis: forward the packet, drop it, or divert it through the IPS. By manipulating table entries, the IPS can specify the traffic it wishes to examine, directly block malicious traffic, and "cut through" traffic streams once it has had an opportunity to "vet" them, all on a fine-grained basis. We base our design on a novel series of caches, with a "fail safe" miss policy, coupled to a host PC to handle both cache management and higher level IPS analysis. The design requires only 2 MB of SRAM for its extensive caches, and can sup-port four Gbps Ethernets on a single Virtex 2 Pro 30.

...read moreread less

Journal Article•DOI•

Branch-and-bound processing of ranked queries

[...]

Yufei Tao¹, Vagelis Hristidis², Dimitris Papadias³, Yannis Papakonstantinou⁴•Institutions (4)

City University of Hong Kong¹, Florida International University², Hong Kong University of Science and Technology³, University of California, San Diego⁴

01 May 2007-Information Systems

TL;DR: This paper proposes a simple yet powerful technique for processing ranked queries based on multi-dimensional access methods and branch-and-bound search, and confirms the superiority of the proposed methods with a detailed experimental study.

...read moreread less

Proceedings Article•

Materialized views in probabilistic databases: for information exchange and query optimization

[...]

Christopher Ré¹, Dan Suciu¹•Institutions (1)

University of Washington¹

23 Sep 2007

TL;DR: This paper proposes an alternative approach to materializing probabilistic views, by giving conditions under which a view can be represented by a block-independent disjoint (BID) table, and proposes a novel partial representation that can represent all views but may not define a unique probability distribution.

...read moreread less

Abstract: Views over probabilistic data contain correlations between tuples, and the current approach is to capture these correlations using explicit lineage. In this paper we propose an alternative approach to materializing probabilistic views, by giving conditions under which a view can be represented by a block-independent disjoint (BID) table. Not all views can be represented as BID tables and so we propose a novel partial representation that can represent all views but may not define a unique probability distribution. We then give conditions on when a query's value on a partial representation will be uniquely defined. We apply our theory to two applications: query processing using views and information exchange using views. In query processing on probabilistic data, we can ignore the lineage and use materialized views to more efficiently answer queries. By contrast, if the view has explicit lineage, the query evaluation must reprocess the lineage to compute the query resulting in dramatically slower execution. The second application is information exchange when we do not wish to disclose the entire lineage, which otherwise may result in shipping the entire database. The paper contains several theoretical results that completely solve the problem of deciding whether a conjunctive view can be represented as a BID and whether a query on a partial representation is uniquely determined. We validate our approach experimentally showing that representable views exist in real and synthetic workloads and show over three magnitudes of improvement in query processing versus a lineage based approach.

...read moreread less

Proceedings Article•DOI•

Enforcing access control over data streams

[...]

Barbara Carminati¹, Elena Ferrari¹, Kian-Lee Tan²•Institutions (2)

University of Insubria¹, National University of Singapore²

20 Jun 2007

TL;DR: This paper designs a set of novel secure operators, that basically filter out tuples/attributes from results of the corresponding (non-secure) operators that are not accessible according to the specified access control policies.

...read moreread less

Abstract: Access control is an important component of any computational system. However, it is only recently that mechanisms to guard against unauthorized access for streaming data have been proposed. In this paper, we study how to enforce the role-based access control model proposed by us in [5]. We design a set of novel secure operators, that basically filter out tuples/attributes from results of the corresponding (non-secure) operators that are not accessible according to the specified access control policies. We further develop an access control mechanism to enforce the access control policies based on these operators. We show that our method is secure according to the specified policies.

...read moreread less

Journal Article•DOI•

Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data

[...]

Eugenio Cesario¹, Giuseppe Manco¹, Riccardo Ortale¹•Institutions (1)

Indian Council of Agricultural Research¹

01 Dec 2007-IEEE Transactions on Knowledge and Data Engineering

TL;DR: In this article, a parameter-free, fully-automatic approach to clustering high-dimensional categorical data is proposed, which is based on a two-phase iterative procedure, which attempts to improve the overall quality of the whole partition.

...read moreread less

Abstract: A parameter-free, fully-automatic approach to clustering high-dimensional categorical data is proposed. The technique is based on a two-phase iterative procedure, which attempts to improve the overall quality of the whole partition. In the first phase, cluster assignments are given, and a new cluster is added to the partition by identifying and splitting a low-quality cluster. In the second phase, the number of clusters is fixed, and an attempt to optimize cluster assignments is done. On the basis of such features, the algorithm attempts to improve the overall quality of the whole partition and finds clusters in the data, whose number is naturally established on the basis of the inherent features of the underlying data set rather than being previously specified. Furthermore, the approach is parametric to the notion of cluster quality: Here, a cluster is defined as a set of tuples exhibiting a sort of homogeneity. We show how a suitable notion of cluster homogeneity can be defined in the context of high-dimensional categorical data, from which an effective instance of the proposed clustering scheme immediately follows. Experiments on both synthetic and real data prove that the devised algorithm scales linearly and achieves nearly optimal results in terms of compactness and separation.

...read moreread less

Collapse