scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Databases in 1999"


Posted Content
TL;DR: This paper proposes three cost-based heuristic algorithms: Volcano-SH and Volcano-RU, which are based on simple modifications to the Volcano search strategy, and a greedy heuristic that incorporates novel optimizations that improve efficiency greatly.
Abstract: Complex queries are becoming commonplace, with the growing use of decision support systems. These complex queries often have a lot of common sub-expressions, either within a single query, or across multiple such queries run as a batch. Multi-query optimization aims at exploiting common sub-expressions to reduce evaluation cost. Multi-query optimization has hither-to been viewed as impractical, since earlier algorithms were exhaustive, and explore a doubly exponential search space. In this paper we demonstrate that multi-query optimization using heuristics is practical, and provides significant benefits. We propose three cost-based heuristic algorithms: Volcano-SH and Volcano-RU, which are based on simple modifications to the Volcano search strategy, and a greedy heuristic. Our greedy heuristic incorporates novel optimizations that improve efficiency greatly. Our algorithms are designed to be easily added to existing optimizers. We present a performance study comparing the algorithms, using workloads consisting of queries from the TPC-D benchmark. The study shows that our algorithms provide significant benefits over traditional optimization, at a very acceptable overhead in optimization time.

336 citations


Posted Content
TL;DR: A comparison of five, representative query languages for XML, highlighting their common features and differences is presented.
Abstract: XML is becoming the most relevant new standard for data representation and exchange on the WWW. Novel languages for extracting and restructuring the XML content have been proposed, some in the tradition of database query languages (i.e. SQL, OQL), others more closely inspired by XML. No standard for XML query language has yet been decided, but the discussion is ongoing within the World Wide Web Consortium and within many academic institutions and Internet-related major companies. We present a comparison of five, representative query languages for XML, highlighting their common features and differences.

208 citations


Posted Content
TL;DR: The next-generation astronomy digital archives will cover most of the sky at fine resolution in many wavelengths, from X-rays, through ultraviolet, optical, and infrared, and the archives will be stored at diverse geographical locations.
Abstract: The next-generation astronomy digital archives will cover most of the universe at fine resolution in many wave-lengths, from X-rays to ultraviolet, optical, and infrared. The archives will be stored at diverse geographical locations. One of the first of these projects, the Sloan Digital Sky Survey (SDSS) will create a 5-wavelength catalog over 10,000 square degrees of the sky (see this http URL). The 200 million objects in the multi-terabyte database will have mostly numerical attributes, defining a space of 100+ dimensions. Points in this space have highly correlated distributions. The archive will enable astronomers to explore the data interactively. Data access will be aided by a multidimensional spatial index and other indices. The data will be partitioned in many ways. Small tag objects consisting of the most popular attributes speed up frequent searches. Splitting the data among multiple servers enables parallel, scalable I/O and applies parallel processing to the data. Hashing techniques allow efficient clustering and pair-wise comparison algorithms that parallelize nicely. Randomly sampled subsets allow debugging otherwise large queries at the desktop. Central servers will operate a data pump that supports sweeping searches that touch most of the data. The anticipated queries require special operators related to angular distances and complex similarity tests of object properties, like shapes, colors, velocity vectors, or temporal behaviors. These issues pose interesting data management challenges.

170 citations


Posted Content
TL;DR: Terabytes of "Internet unfriendly" geo-spatial images were scrubbed and edited into hundreds of millions of “Internet friendly” image tiles and loaded into a SQL data warehouse, demonstrating that general-purpose relational database technology can manage large scale image repositories and shows that web browsers can be a good geo- Spatial image presentation system.
Abstract: The TerraServer stores aerial, satellite, and topographic images of the earth in a SQL database available via the Internet. It is the world's largest online atlas, combining five terabytes of image data from the United States Geological Survey (USGS) and SPIN-2. This report describes the system-redesign based on our experience over the last year. It also reports usage and operations results over the last year -- over 2 billion web hits and over 20 Terabytes of imagry served over the Internet. Internet browsers provide intuitive spatial and text interfaces to the data. Users need no special hardware, software, or knowledge to locate and browse imagery. This paper describes how terabytes of "Internet unfriendly" geo-spatial images were scrubbed and edited into hundreds of millions of "Internet friendly" image tiles and loaded into a SQL data warehouse. Microsoft TerraServer demonstrates that general-purpose relational database technology can manage large scale image repositories, and shows that web browsers can be a good geospatial image presentation system.

75 citations


Posted Content
TL;DR: In this paper, the authors present an algorithm based on the ''System R''-style query optimization algorithm that does not rely on these assumptions and chooses the plan of the least expected cost instead of the plan with a fixed value of the parameters.
Abstract: We identify two unreasonable, though standard, assumptions made by database query optimizers that can adversely affect the quality of the chosen evaluation plans. One assumption is that it is enough to optimize for the expected case---that is, the case where various parameters (like available memory) take on their expected value. The other assumption is that the parameters are constant throughout the execution of the query. We present an algorithm based on the ``System R''-style query optimization algorithm that does not rely on these assumptions. The algorithm we present chooses the plan of the least expected cost instead of the plan of least cost given some fixed value of the parameters. In execution environments that exhibit a high degree of variability, our techniques should result in better performance.

64 citations


Posted Content
TL;DR: This paper answers the question of data checkpointing by providing a necessary and sufficient condition suited for database systems and establishes a bridge between the data object/transaction model and the process/message-passing model.
Abstract: Whether it is for audit or for recovery purposes, data checkpointing is an important problem of distributed database systems. Actually, transactions establish dependence relations on data checkpoints taken by data object managers. So, given an arbitrary set of data checkpoints (including at least a single data checkpoint from a data manager, and at most a data checkpoint from each data manager), an important question is the following one: ``Can these data checkpoints be members of a same consistent global checkpoint?''. This paper answers this question by providing a necessary and sufficient condition suited for database systems. Moreover, to show the usefulness of this condition, two {\em non-intrusive} data checkpointing protocols are derived from this condition. It is also interesting to note that this paper, by exhibiting ``correspondences'', establishes a bridge between the data object/transaction model and the process/message-passing model.

2 citations