scispace - formally typeset
Search or ask a question

Showing papers by "Peter Buneman published in 2004"


Journal ArticleDOI
TL;DR: This article develops an archiving technique that is both efficient in its use of space and preserves the continuity of elements through versions of the database, something that is not provided by traditional minimum-edit-distance diff approaches and uses timestamps.
Abstract: Archiving is important for scientific data, where it is necessary to record all past versions of a database in order to verify findings based upon a specific version. Much scientific data is held in a hierachical format and has a key structure that provides a canonical identification for each element of the hierarchy. In this article, we exploit these properties to develop an archiving technique that is both efficient in its use of space and preserves the continuity of elements through versions of the database, something that is not provided by traditional minimum-edit-distance diff approaches. The approach also uses timestamps. All versions of the data are merged into one hierarchy where an element appearing in multiple versions is stored only once along with a timestamp. By identifying the semantic continuity of elements and merging them into one data structure, our technique is capable of providing meaningful change descriptions, the archive allows us to easily answer certain temporal queries such as retrieval of any specific version from the archive and finding the history of an element. This is in contrast with approaches that store a sequence of deltas where such operations may require undoing a large number of changes or significant reasoning with the deltas. A suite of experiments also demonstrates that our archive does not incur any significant space overhead when contrasted with diff approaches. Another useful property of our approach is that we use XML format to represent hierarchical data and the resulting archive is also in XML. Hence, XML tools can be directly applied on our archive. In particular, we apply an XML compressor on our archive, and our experiments show that our compressed archive outperforms compressed diff-based repositories in space efficiency. We also show how we can extend our archiving tool to an external memory archiver for higher scalability and describe various index structures that can further improve the efficiency of some temporal queries on our archive.

158 citations


Proceedings ArticleDOI
21 Jun 2004
TL;DR: Some of the challenges for database research and the progress that has been made on them are described: they include x Data integration, x Annotation, and the last three are topics that I and my colleagues have been working on.
Abstract: The United Kingdom has recently created a Digital Curation Centre whose purpose is to provide advice on, develop tools for and conduct research on all aspects of Digital Curation. But what is digital curation, and why is it interesting to database researchers? Ask around, and you are likely to find two kinds of people involved in digital curation -at least they call themselves curators and use computers. Moreover, on the face of it, they have almost nothing else in common. An archivist (A) does the digital equivalent of putting documents in boxes. A is dealing with data generated by other people and is concerned with: appraisal -the selection of what documents to preserve, indexing and classification -the choice of which document to put into which box, and preservation- ensuring that the documents are preserved for posterity. A finds computers extremely useful because all kinds of “digital objects” may be archived, and the internet provides easy access to digital objects. A scientist (B) does the digital equivalent of publishing a textbook or compendium. B might be a biologist and is publishing data that results from B's experiments or has been collected as a result of B's research into the literature. B's concerns are with organization and integration of data that has been collected from other sources, with the process of annotation of this data and with the publishing and presentation of the data. B finds computers and the internet useful because it is easy to add recent data -one doesn't have to wait for the next paper edition to appear, one can build rather rich representations of the data, and it is easy to publish the data in a form that is accessible to the readers. In fact, B is likely to use some form of database technology. What do A and B have to do with each other? Quite a lot -and much of it depends on database technology or presents challenges for database research. In building up a catalog of archived data A is already doing something like B, but perhaps with more stable data. B also needs to be concerned with archival issues. Because B has traditionally been more concerned with publishing than preservation, there are now a number of endangered data sets that are potentially important for longitudinal or historical studies. In this talk I shall describe some of the challenges for database research and the progress that has been made on them: they include x Data integration. What has database technology delivered? What can we expect it to deliver? And what is wishful thinking? x Database archiving. Digital objects are typically fixed, but how does A archive B's data, which may change daily or more often? Should A create a new archive after every update? x Annotation. This is data that is sometimes attached to a database after it has been designed and populated. What does B need in order to attach annotations to databases over which B has no control? x Provenance. This is loosely related to annotation. It is something that A will describe, but is equally important to B. Suppose I point to some element in a database and ask you where it has come from. In B's domain that element has been repeatedly copied from database to database, so even describing the provenance may be a complicated task. Can one do better than provide the transaction logs of all the databases involved? The first of these is a major topic of concern to a large number of database researchers. I can only provide a brief and opinionated summary. The last three are topics that I and my colleagues have been working on. I shall describe our progress.

4 citations


Book ChapterDOI
01 Jan 2004
TL;DR: This chapter is motivated by the question “are there any clean mathematical principles behind the design of query languages?” and presents a calculus, itself a language, that could be (and was) used as an internal representation for various user languages.
Abstract: This chapter is motivated by the question “are there any clean mathematical principles behind the design of query languages?” One can hardly blame the reader of various recent standards for asking this question The authors try to sketch what such a mathematical framework could be One of the classifying principles they use extensively is that of languages being organised around type systems, with language primitives corresponding to constructors and deconstructors for each type There is some value in casting the concepts in as general a form as possible; hence the use of the language of category theory for describing them Once the semantic framework is discussed, the chapter presents a calculus, itself a language, that could be (and was) used as an internal representation for various user languages The discussion is relevant to all kinds of data models: relational, object-relational, object-oriented, and semi-structured

3 citations