Showing papers in "ACM Transactions on Database Systems in 2007"
••
TL;DR: This work treats the problem as an optimization problem where, given a workload of queries, a stratified random sample of the original data is selected such that the error in answering the workload queries using the sample is minimized.
Abstract: The ability to approximately answer aggregation queries accurately and efficiently is of great benefit for decision support and data mining tools. In contrast to previous sampling-based studies, we treat the problem as an optimization problem where, given a workload of queries, we select a stratified random sample of the original data such that the error in answering the workload queries using the sample is minimized. A key novelty of our approach is that we can tailor the choice of samples to be robust, even for workloads that are “similar” but not necessarily identical to the given workload. Finally, our techniques recognize the importance of taking into account the variance in the data distribution in a principled manner. We show how our solution can be implemented on a database system, and present results of extensive experiments on Microsoft SQL Server that demonstrate the superior quality of our method compared to previous work.
243 citations
••
TL;DR: The goal of this survey is to describe the algorithms within each component in detail, comparing and contrasting competing methods, thereby enabling further analysis and experimentation with each component and allowing the best algorithms for a particular situation to be built piecemeal, or, even better, enabling an optimizer to choose which algorithms to use.
Abstract: A variety of techniques for performing a spatial join are reviewed. Instead of just summarizing the literature and presenting each technique in its entirety, distinct components of the different techniques are described and each is decomposed into an overall framework for performing a spatial join. A typical spatial join technique consists of the following components: partitioning the data, performing internal-memory spatial joins on subsets of the data, and checking if the full polygons intersect. Each technique is decomposed into these components and each component addressed in a separate section so as to compare and contrast similar aspects of each technique. The goal of this survey is to describe the algorithms within each component in detail, comparing and contrasting competing methods, thereby enabling further analysis and experimentation with each component and allowing the best algorithms for a particular situation to be built piecemeal, or, even better, enabling an optimizer to choose which algorithms to use.
229 citations
••
TL;DR: This work shows that the standard hash join algorithm/or disk-oriented databases (i.e. GRACE) spends over 73% of its user time stalled on CPU cache misses, and explores the use of prefetching to improve its cache performance.
Abstract: Hash join algorithms suffer from extensive CPU cache stalls. This article shows that the standard hash join algorithm for disk-oriented databases (i.e. GRACE) spends over 80p of its user time stalled on CPU cache misses, and explores the use of CPU cache prefetching to improve its cache performance. Applying prefetching to hash joins is complicated by the data dependencies, multiple code paths, and inherent randomness of hashing. We present two techniques, group prefetching and software-pipelined prefetching, that overcome these complications. These schemes achieve 1.29--4.04X speedups for the join phase and 1.37--3.49X speedups for the partition phase over GRACE and simple prefetching approaches. Moreover, compared with previous cache-aware approaches (i.e. cache partitioning), the schemes are at least 36p faster on large relations and do not require exclusive use of the CPU cache to be effective. Finally, comparing the elapsed real times when disk I/Os are in the picture, our cache prefetching schemes achieve 1.12--1.84X speedups for the join phase and 1.06--1.60X speedups for the partition phase over the GRACE hash join algorithm.
188 citations
••
TL;DR: The core of the methodology is a novel concept of “probabilistically constrained rectangle”, which permits effective pruning/validation of nonqualifying/qualifying data and a new index structure called the U-tree for minimizing the query overhead.
Abstract: In an uncertain database, every object o is associated with a probability density function, which describes the likelihood that o appears at each position in a multidimensional workspace. This article studies two types of range retrieval fundamental to many analytical tasks. Specifically, a nonfuzzy query returns all the objects that appear in a search region rq with at least a certain probability tq. On the other hand, given an uncertain object q, fuzzy search retrieves the set of objects that are within distance eq from q with no less than probability tq. The core of our methodology is a novel concept of “probabilistically constrained rectangle”, which permits effective pruning/validation of nonqualifying/qualifying data. We develop a new index structure called the U-tree for minimizing the query overhead. Our algorithmic findings are accompanied with a thorough theoretical analysis, which reveals valuable insight into the problem characteristics, and mathematically confirms the efficiency of our solutions. We verify the effectiveness of the proposed techniques with extensive experiments.
170 citations
••
TL;DR: A new algorithm is proposed, designed to minimize the number of object accesses, the computational cost, and the memory requirements of top-k search with monotone aggregate functions, and is shown to be orders of magnitude faster.
Abstract: A top-k query combines different rankings of the same set of objects and returns the k objects with the highest combined score according to an aggregate function. We bring to light some key observations, which impose two phases that any top-k algorithm, based on sorted accesses, should go through. Based on them, we propose a new algorithm, which is designed to minimize the number of object accesses, the computational cost, and the memory requirements of top-k search with monotone aggregate functions. We provide an analysis for its cost and show that it is always no worse than the baseline “no random accesses” algorithm in terms of computations, accesses, and memory required. As a side contribution, we perform a space analysis, which indicates the memory requirements of top-k algorithms that only perform sorted accesses. For the case, where the required space exceeds the available memory, we propose disk-based variants of our algorithm. We propose and optimize a multiway top-k join operator, with certain advantages over evaluation trees of binary top-k join operators. Finally, we define and study the computation of top-k cubes and the implementation of roll-up and drill-down operations in such cubes. Extensive experiments with synthetic and real data show that, compared to previous techniques, our method accesses fewer objects, while being orders of magnitude faster.
117 citations
••
IBM1
TL;DR: This work gives a formal definition for what it means for a schema mapping M′ to be an inverse of a schema mapped M for a class S of source instances, and shows how to construct a global inverse when one exists.
Abstract: A schema mapping is a specification that describes how data structured under one schema (the source schema) is to be transformed into data structured under a different schema (the target schema). Although the notion of an inverse of a schema mapping is important, the exact definition of an inverse mapping is somewhat elusive. This is because a schema mapping may associate many target instances with each source instance, and many source instances with each target instance. Based on the notion that the composition of a mapping and its inverse is the identity, we give a formal definition for what it means for a schema mapping M′ to be an inverse of a schema mapping M for a class S of source instances. We call such an inverse an S-inverse. A particular case of interest arises when S is the class of all source instances, in which case an S-inverse is a global inverse. We focus on the important and practical case of schema mappings specified by source-to-target tuple-generating dependencies, and uncover a rich theory. When S is specified by a set of dependencies with a finite chase, we show how to construct an S-inverse when one exists. In particular, we show how to construct a global inverse when one exists. Given M and M′, we show how to define the largest class S such that M′ is an S-inverse of M.
106 citations
••
TL;DR: This article studies how to efficiently mine the complete set of coherent closed quasi-cliques from large dense graph databases, which is an especially challenging task due to the fact that the downward-closure property no longer holds.
Abstract: Due to the ability of graphs to represent more generic and more complicated relationships among different objects, graph mining has played a significant role in data mining, attracting increasing attention in the data mining community. In addition, frequent coherent subgraphs can provide valuable knowledge about the underlying internal structure of a graph database, and mining frequently occurring coherent subgraphs from large dense graph databases has witnessed several applications and received considerable attention in the graph mining community recently. In this article, we study how to efficiently mine the complete set of coherent closed quasi-cliques from large dense graph databases, which is an especially challenging task due to the fact that the downward-closure property no longer holds. By fully exploring some properties of quasi-cliques, we propose several novel optimization techniques which can prune the unpromising and redundant subsearch spaces effectively. Meanwhile, we devise an efficient closure checking scheme to facilitate the discovery of closed quasi-cliques only. Since large databases cannot be held in main memory, we also design an out-of-core solution with efficient index structures for mining coherent closed quasi-cliques from large dense graph databases. We call this Cocaina. Thorough performance study shows that Cocaina is very efficient and scalable for large dense graph databases.
100 citations
••
TL;DR: A similarity retrieval framework which incorporates both of the aspects of similarity retrieval into a single unified model and shows that for any dissimilarity measure, the “amount” of triangle inequality can be changed to obtain an approximate or full metric which can be used for MAM-based retrieval.
Abstract: In multimedia systems we usually need to retrieve database (DB) objects based on their similarity to a query object, while the similarity assessment is provided by a measure which defines a (dis)similarity score for every pair of DB objects. In most existing applications, the similarity measure is required to be a metric, where the triangle inequality is utilized to speed up the search for relevant objects by use of metric access methods (MAMs), for example, the M-tree. A recent research has shown, however, that nonmetric measures are more appropriate for similarity modeling due to their robustness and ease to model a made-to-measure similarity. Unfortunately, due to the lack of triangle inequality, the nonmetric measures cannot be directly utilized by MAMs. From another point of view, some sophisticated similarity measures could be available in a black-box nonanalytic form (e.g., as an algorithm or even a hardware device), where no information about their topological properties is provided, so we have to consider them as nonmetric measures as well. From yet another point of view, the concept of similarity measuring itself is inherently imprecise and we often prefer fast but approximate retrieval over an exact but slower one.To date, the mentioned aspects of similarity retrieval have been solved separately, that is, exact versus approximate search or metric versus nonmetric search. In this article we introduce a similarity retrieval framework which incorporates both of the aspects into a single unified model. Based on the framework, we show that for any dissimilarity measure (either a metric or nonmetric) we are able to change the “amount” of triangle inequality, and so obtain an approximate or full metric which can be used for MAM-based retrieval. Due to the varying “amount” of triangle inequality, the measure is modified in a way suitable for either an exact but slower or an approximate but faster retrieval. Additionally, we introduce the TriGen algorithm aimed at constructing the desired modification of any black-box distance automatically, using just a small fraction of the database.
92 citations
••
TL;DR: The authors' experiments with large data sets from two scientific domains show that multi-resolution, parallelizable bitmap indexes occupy an acceptable amount of storage while improving range query performance by roughly a factor of 10, compared to a single-resolution bitmap index of reasonable size.
Abstract: The unique characteristics of scientific data and queries cause traditional indexing techniques to perform poorly on scientific workloads, occupy excessive space, or both. Refinements of bitmap indexes have been proposed previously as a solution to this problem. In this article, we describe the difficulties we encountered in deploying bitmap indexes with scientific data and queries from two real-world domains. In particular, previously proposed methods of binning, encoding, and compressing bitmap vectors either were quite slow for processing the large-range query conditions our scientists used, or required excessive storage space. Nor could the indexes easily be built or used on parallel platforms. In this article, we show how to solve these problems through the use of multi-resolution, parallelizable bitmap indexes, which support a fine-grained trade-off between storage requirements and query performance. Our experiments with large data sets from two scientific domains show that multi-resolution, parallelizable bitmap indexes occupy an acceptable amount of storage while improving range query performance by roughly a factor of 10, compared to a single-resolution bitmap index of reasonable size.
91 citations
••
TL;DR: A novel geometric approach is presented which reduces monitoring the value of a function to a set of constraints applied locally on each of the streams, which enables monitoring of arbitrary threshold functions over distributed data streams in an efficient manner.
Abstract: Monitoring data streams in a distributed system is the focus of much research in recent years. Most of the proposed schemes, however, deal with monitoring simple aggregated values, such as the frequency of appearance of items in the streams. More involved challenges, such as the important task of feature selection (e.g., by monitoring the information gain of various features), still require very high communication overhead using naive, centralized algorithms.We present a novel geometric approach which reduces monitoring the value of a function (vis-a-vis a threshold) to a set of constraints applied locally on each of the streams. The constraints are used to locally filter out data increments that do not affect the monitoring outcome, thus avoiding unnecessary communication. As a result, our approach enables monitoring of arbitrary threshold functions over distributed data streams in an efficient manner.We present experimental results on real-world data which demonstrate that our algorithms are highly scalable, and considerably reduce communication load in comparison to centralized algorithms.
78 citations
••
TL;DR: This editorial analyzes from a variety of perspectives the controversial issue of single-blind versus double- Blind reviewing, and proposes a double-blind policy for TODS that attempts to minimize the costs while retaining the core benefit of fairness that double- blind reviewing provides.
Abstract: This editorial analyzes from a variety of perspectives the controversial issue of single-blind versus double-blind reviewing. In single-blind reviewing, the reviewer is unknown to the author, but the identity of the author is known to the reviewer. Double-blind reviewing is more symmetric: The identity of the author and the reviewer are not revealed to each other. We first examine the significant scholarly literature regarding blind reviewing. We then list six benefits claimed for double-blind reviewing and 21 possible costs. To compare these benefits and costs, we propose a double-blind policy for TODS that attempts to minimize the costs while retaining the core benefit of fairness that double-blind reviewing provides, and evaluate that policy against each of the listed benefits and costs. Following that is a general discussion considering several questions: What does this have to do with TODS, does bias exist in computer science, and what is the appropriate decision procedureq We explore the “knobs” a policy design can manipulate to fine-tune a double-blind review policy. This editorial ends with a specific decision.
••
TL;DR: The issues involved in composing mappings given by embedded dependencies are studied and it is shown that full and second-order dependencies that are not limited to be source-to-target are not closed under composition and determining whether the composition can be given by these kinds of dependencies is undecidable.
Abstract: Composition of mappings between schemas is essential to support schema evolution, data exchange, data integration, and other data management tasks. In many applications, mappings are given by embedded dependencies. In this article, we study the issues involved in composing such mappings.Our algorithms and results extend those of Fagin et al. [2004], who studied the composition of mappings given by several kinds of constraints. In particular, they proved that full source-to-target tuple-generating dependencies (tgds) are closed under composition, but embedded source-to-target tgds are not. They introduced a class of second-order constraints, SO tgds, that is closed under composition and has desirable properties for data exchange.We study constraints that need not be source-to-target and we concentrate on obtaining (first-order) embedded dependencies. As part of this study, we also consider full dependencies and second-order constraints that arise from Skolemizing embedded dependencies. For each of the three classes of mappings that we study, we provide: (a) an algorithm that attempts to compute the composition; and (b) sufficient conditions on the input mappings which guarantee that the algorithm will succeed.In addition, we give several negative results. In particular, we show that full and second-order dependencies that are not limited to be source-to-target are not closed under composition (for the latter, under the additional restriction that no new function symbols are introduced). Furthermore, we show that determining whether the composition can be given by these kinds of dependencies is undecidable.
••
TL;DR: The construction of a generic natural language query interface to an XML database that can accept a large class of English sentences as a query, which can be quite complex and include aggregation, nesting, and value joins, among other things.
Abstract: We describe the construction of a generic natural language query interface to an XML database. Our interface can accept a large class of English sentences as a query, which can be quite complex and include aggregation, nesting, and value joins, among other things. This query is translated, potentially after reformulation, into an XQuery expression. The translation is based on mapping grammatical proximity of natural language parsed tokens in the parse tree of the query sentence to proximity of corresponding elements in the XML data to be retrieved. Iterative search in the form of followup queries is also supported. Our experimental assessment, through a user study, demonstrates that this type of natural language interface is good enough to be usable now, with no restrictions on the application domain.
••
TL;DR: The first adaptive packed-memory array (APMA), which automatically adjusts to the input pattern, is given, which has four times fewer element moves per insertion than the traditional PMA and running times that are more than seven times faster.
Abstract: The packed-memory array (PMA) is a data structure that maintains a dynamic set of N elements in sorted order in a Θ(N)-sized array. The idea is to intersperse Θ(N) empty spaces or gaps among the elements so that only a small number of elements need to be shifted around on an insert or delete. Because the elements are stored physically in sorted order in memory or on disk, the PMA can be used to support extremely efficient range queries. Specifically, the cost to scan L consecutive elements is O(1 p LsB) memory transfers.This article gives the first adaptive packed-memory array (APMA), which automatically adjusts to the input pattern. Like the traditional PMA, any pattern of updates costs only O(log2N) amortized element moves and O(1 p (log2N)sB) amortized memory transfers per update. However, the APMA performs even better on many common input distributions achieving only O(log N) amortized element moves and O(1p (logN)sB) amortized memory transfers. The article analyzes sequential inserts, where the insertions are to the front of the APMA, hammer inserts, where the insertions “hammer” on one part of the APMA, random inserts, where the insertions are after random elements in the APMA, and bulk inserts, where for constant α e [0, 1], Nα elements are inserted after random elements in the APMA. The article then gives simulation results that are consistent with the asymptotic bounds. For sequential insertions of roughly 1.4 million elements, the APMA has four times fewer element moves per insertion than the traditional PMA and running times that are more than seven times faster.
••
TL;DR: This article proposes a transformational architecture that is based upon two novel primitive operations, called merging and reduction, that help refine a configuration, treating indexes and materialized views in a unified way, as well as succinctly explain the refinement process to DBAs.
Abstract: Physical database design tools rely on a DBA-provided workload to pick an “optimal” set of indexes and materialized views. Such tools allow either creating a new such configuration or adding new structures to existing ones. However, these tools do not provide adequate support for the incremental and flexible refinement of existing physical structures. Although such refinements are often very valuable for DBAs, a completely manual approach to refinement can lead to infeasible solutions (e.g., excessive use of space). In this article, we focus on the important problem of physical design refinement and propose a transformational architecture that is based upon two novel primitive operations, called merging and reduction. These operators help refine a configuration, treating indexes and materialized views in a unified way, as well as succinctly explain the refinement process to DBAs.
••
TL;DR: The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings, and shows that VSol is effective for large skewed databases of short strings.
Abstract: Approximate queries on string data are important due to the prevalence of such data in databases and various conventions and errors in string data. We present the VSol estimator, a novel technique for estimating the selectivity of approximate string queries. The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings. To get inverse strings we decompose all database strings into overlapping substrings of length q (q-grams) and then associate each q-gram with its inverse string: the IDs of all strings that contain the q-gram. We use signatures to compress inverse strings, and clustering to group similar signatures.We study our technique analytically and experimentally. The space complexity of our estimator only depends on the number of neighborhoods in the database and the desired estimation error. The time to estimate the selectivity is independent of the number of database strings and linear with respect to the length of query string. We give a detailed empirical performance evaluation of our solution for synthetic and real-world datasets. We show that VSol is effective for large skewed databases of short strings.
••
TL;DR: This article presents fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way, and adapts results from random-graph theory and statistics to develop a rigorous cost model for the execution plans.
Abstract: Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the Web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or “crawl,” the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output “completeness” (e.g., in terms of recall). Nevertheless, this choice is typically ad hoc and based on heuristics or plain intuition. In this article, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model. We also present two optimization approaches for text-centric tasks that rely on the cost-model parameters and select efficient execution plans. Overall, our optimization approaches help build efficient execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.
••
TL;DR: This article identifies and addresses the barriers of realizing a unified framework for optimizing top-k queries in middlewares, and develops efficient search schemes over such space for identifying the optimal algorithm.
Abstract: This article studies optimizing top-k queries in middlewares. While many assorted algorithms have been proposed, none is generally applicable to a wide range of possible scenarios. Existing algorithms lack both the “generality” to support a wide range of access scenarios and the systematic “adaptivity” to account for runtime specifics. To fulfill this critical lacking, we aim at taking a cost-based optimization approach: By runtime search over a space of algorithms, cost-based optimization is general across a wide range of access scenarios, yet adaptive to the specific access costs at runtime. While such optimization has been taken for granted for relational queries from early on, it has been clearly lacking for ranked queries. In this article, we thus identify and address the barriers of realizing such a unified framework. As the first barrier, we need to define a “comprehensive” space encompassing all possibly optimal algorithms to search over. As the second barrier and a conflicting goal, such a space should also be “focused” enough to enable efficient search. For SQL queries that are explicitly composed of relational operators, such a space, by definition, consists of schedules of relational operators (or “query plans”). In contrast, top-k queries do not have logical tasks, such as relational operators. We thus define the logical tasks of top-k queries as building blocks to identify a comprehensive and focused space for top-k queries. We then develop efficient search schemes over such space for identifying the optimal algorithm. Our study indicates that our framework not only unifies, but also outperforms existing algorithms specifically designed for their scenarios.
••
TL;DR: The notion of an extended wavelet coefficient as a flexible, efficient storage method for wavelet coefficients over multimeasure data is introduced and novel algorithms for constructing effective (optimal or near-optimal) extendedwavelet-coefficient synopses under a given storage constraint are proposed.
Abstract: Several studies have demonstrated the effectiveness of the Haar wavelet decomposition as a tool for reducing large amounts of data down to compact wavelet synopses that can be used to obtain fast, accurate approximate answers to user queries. Although originally designed for minimizing the overall mean-squared (i.e., L2-norm) error in the data approximation, recently proposed methods also enable the use of Haar wavelets in minimizing other error metrics, such as the relative error in data value reconstruction, which is arguably the most important for approximate query answers. Relatively little attention, however, has been paid to the problem of using wavelet synopses as an approximate query answering tool over complex tabular datasets containing multiple measures, such as those typically found in real-life OLAP applications. Existing decomposition approaches will either operate on each measure individually, or treat all measures as a vector of values and process them simultaneously. As we demonstrate in this article, these existing individual or combined storage approaches for the wavelet coefficients of different measures can easily lead to suboptimal storage utilization, resulting in drastically reduced accuracy for approximate query answers. To address this problem, in this work, we introduce the notion of an extended wavelet coefficient as a flexible, efficient storage method for wavelet coefficients over multimeasure data. We also propose novel algorithms for constructing effective (optimal or near-optimal) extended wavelet-coefficient synopses under a given storage constraint, for both sum-squared error and relative-error norms. Experimental results with both real-life and synthetic datasets validate our approach, demonstrating that our techniques consistently obtain significant gains in approximation accuracy compared to existing solutions.
••
TL;DR: This article uses the extended Hamming scheme, which is only 3-wise independent, to generate estimators that significantly outperform state-of-the-art solutions for two problems, namely, size of spatial joins and selectivity estimation.
Abstract: The exact computation of aggregate queries, like the size of join of two relations, usually requires large amounts of memory (constrained in data-streaming) or communication (constrained in distributed computation) and large processing times. In this situation, approximation techniques with provable guarantees, like sketches, are one possible solution. The performance of sketches depends crucially on the ability to generate particular pseudo-random numbers. In this article we investigate both theoretically and empirically the problem of generating k-wise independent pseudo-random numbers and, in particular, that of generating 3- and 4-wise independent pseudo-random numbers that are fast range-summable (i.e., they can be summed in sublinear time). Our specific contributions are: (a) we provide a thorough comparison of the various pseudo-random number generating schemes; (b) we study both theoretically and empirically the fast range-summation property of 3- and 4-wise independent generating schemes; (c) we provide algorithms for the fast range-summation of two 3-wise independent schemes, BCH and extended Hamming; and (d) we show convincing theoretical and empirical evidence that the extended Hamming scheme performs as well as any 4-wise independent scheme for estimating the size of join of two relations using AMS sketches, even though it is only 3-wise independent. We use this scheme to generate estimators that significantly outperform state-of-the-art solutions for two problems, namely, size of spatial joins and selectivity estimation.
••
TL;DR: This work is devoted to an expressiveness study of node-selecting queries with proven theoretical and practical applicability, especially in the field of query evaluation against XML streams.
Abstract: Node-selecting queries over trees lie at the core of several important XML languages for the web, such as the node-selection language XPath, the query language XQuery, and the transformation language XSLT. The main syntactic constructs of such queries are the backward predicates, for example, ancestor and preceding, and the forward predicates, for example, descendant and following. Forward predicates are included in the depth-first, left-to-right preorder relation associated with the input tree, whereas backward predicates are included in the inverse of this preorder relation.This work is devoted to an expressiveness study of node-selecting queries with proven theoretical and practical applicability, especially in the field of query evaluation against XML streams. The main question it answers positively is whether, for each input query with forward and backward predicates, there exists an equivalent forward-only output query. This question is then positively answered for input and output queries of varying structural complexity, using LOGLIN and PSPACE reductions.Various existing applications based on the results of this work are reported, including query optimization and streamed evaluation.
••
TL;DR: The algorithms, implementation, and performance evaluation are presented showing that CLIDE is a viable on-line tool.
Abstract: The CLIDE System assists the owners of sources that participate in Web service-based data publishing systems to publish a restricted set of parameterized queries over the schema of their sources and package them as WSDL services. The sources may be relational databases, which naturally have a schema, or ad hoc information/application systems whereas the owner publishes a virtual schema. CLIDE allows information clients to pose queries over the published schema and utilizes prior work on answering queries using views to answer queries that can be processed by combining and processing the results of one or more Web service calls. These queries are called feasible. Contrary to prior work, where infeasible queries are rejected without an explanatory feedback, leading the user into a frustrating trial-and-error cycle, CLIDE features a query formulation interface, which extends the QBE-like query builder of Microsoft's SQL Server with a color scheme that guides the user toward formulating feasible queries. CLIDE guarantees that the suggested query edit actions are complete (i.e., each feasible query can be built by following only suggestions), rapidly convergent (the suggestions are tuned to lead to the closest feasible completions of the query), and suitably summarized (at each interaction step, only a minimal number of actions needed to preserve completeness are suggested). We present the algorithms, implementation, and performance evaluation showing that CLIDE is a viable on-line tool.
••
TL;DR: This article reports on experimental work that confirms that existing approaches have difficulties dealing with nonaggregate subqueries, and that the nested relational approach offers better performance, and proposes a new efficient approach, the nested SQL relational approach, based onThe nested relational algebra.
Abstract: Most research work on optimization of nested queries focuses on aggregate subqueries. In this article, we show that existing approaches are not adequate for nonaggregate subqueries, especially for those having multiple subqueries and certain comparison operators. We then propose a new efficient approach, the nested relational approach, based on the nested relational algebra. The nested relational approach treats all subqueries in a uniform manner, being able to deal with nested queries of any type and any level. We report on experimental work that confirms that existing approaches have difficulties dealing with nonaggregate subqueries, and that the nested relational approach offers better performance. We also discuss algebraic optimization rules for further optimizing the nested relational approach and the issue of integrating it into relational database systems.
••
TL;DR: This article formally defines pesudconstraints using a probabilistic model and provides a statistical test to identify pesudoconstraints in a database, and presents an automatic method for detecting cycle pesudoconservative from a relational database.
Abstract: In this article, we introduce pesudoconstraints, a novel data mining pattern aimed at identifying rare events in databases. At first, we formally define pesudoconstraints using a probabilistic model and provide a statistical test to identify pesudoconstraints in a database. Then, we focus on a specific class of pesudoconstraints, named cycle pesudoconstraints, which often occur in databases. We define cycle pesudoconstraints in the context of the ER model and present an automatic method for detecting cycle pesudoconstraints from a relational database. Finally, we present an experiment to show cycle pesudoconstraints “at work” on real data.
••
TL;DR: In experiments with an image database of handwritten digits and a time-series database, the proposed method outperforms existing state-of-the-art non-Euclidean indexing methods, meaning that it provides significantly better tradeoffs between efficiency and retrieval accuracy.
Abstract: A common problem in many types of databases is retrieving the most similar matches to a query object. Finding these matches in a large database can be too slow to be practical, especially in domains where objects are compared using computationally expensive similarity (or distance) measures. Embedding methods can significantly speed-up retrieval by mapping objects into a vector space, where distances can be measured rapidly using a Minkowski metric. In this article we present a novel way to improve embedding quality. In particular, we propose to construct embeddings that use a query-sensitive distance measure for the target space of the embedding. This distance measure is used to compare those vectors that the query and database objects are mapped to. The term “query-sensitive” means that the distance measure changes, depending on the current query object. We demonstrate theoretically that using a query-sensitive distance measure increases the modeling power of embeddings and allows them to capture more of the structure of the original space. We also demonstrate experimentally that query-sensitive embeddings can significantly improve retrieval performance. In experiments with an image database of handwritten digits and a time-series database, the proposed method outperforms existing state-of-the-art non-Euclidean indexing methods, meaning that it provides significantly better tradeoffs between efficiency and retrieval accuracy.
••
TL;DR: This article shows how to use “survival analysis” techniques in general, and Cox's proportional hazards regression in particular, to model database changes over time and predict when to update each content summary, and exploits the change model to devise update schedules.
Abstract: Large amounts of (often valuable) information are stored in web-accessible text databases. “Metasearchers” provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases for each query. So far, database selection research has largely assumed that databases are static, so the associated statistical summaries do not evolve over time. However, databases are rarely static and the statistical summaries that describe their contents need to be updated periodically to reflect content changes. In this article, we first report the results of a study showing how the content summaries of 152 real web databases evolved over a period of 52 weeks. Then, we show how to use “survival analysis” techniques in general, and Cox's proportional hazards regression in particular, to model database changes over time and predict when we should update each content summary. Finally, we exploit our change model to devise update schedules that keep the summaries up to date by contacting databases only when needed, and then we evaluate the quality of our schedules experimentally over real web databases.
••
TL;DR: This work proves formally that the partial preaggregation method (PP) yields the same results as the F method, and provides analytical and experimental results on the accuracy and computational benefits of the PP method.
Abstract: Given an OLAP query expressed over multiple source OLAP databases, we study the problem of estimating the resulting OLAP target database. The problem arises when it is not possible to derive the result from a single database. The method we use is linear indirect estimation, commonly used for statistical estimation. We examine two obvious computational methods for computing such a target database, called the full cross-product (F) and preaggregation (P) methods. We study the accuracy and computational cost of these methods. While the F method provides a more accurate estimate, it is more expensive computationally than P. Our contribution is in proposing a third, new method, called the partial preaggregation method (PP), which is significantly less expensive than F, but just as accurate. We prove formally that the PP method yields the same results as the F method, and provide analytical and experimental results on the accuracy and computational benefits of the PP method.
•
TL;DR: This article proposes a transformational architecture that is based upon two novel primitive operations, called merging and reduction, that help refine a configuration, treating indexes and materialized views in a unified way, as well as succinctly explain the refinement process to DBAs.
Abstract: Physical database design tools rely on a DBA-provided workload to pick an "optimal" set of indexes and materialized views. Such tools allow either creating a new such configuration or adding new structures to existing ones. However, these tools do not provide adequate support for the incremental and flexible refinement of existing physical structures. Although such refinements are often very valuable for DBAs, a completely manual approach to refinement can lead to infeasible solutions (e.g., excessive use of space). In this article, we focus on the important problem of physical design refinement and propose a transformational architecture that is based upon two novel primitive operations, called merging and reduction. These operators help refine a configuration, treating indexes and materialized views in a unified way, as well as succinctly explain the refinement process toDBAs.