Showing papers on "Tuple published in 1999"

PDF

Open Access

Proceedings Article•

Relational Databases for Querying XML Documents: Limitations and Opportunities

[...]

Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, Jeffrey F. Naughton - Show less +2 more

07 Sep 1999

TL;DR: It turns out that the relational approach can handle most (but not all) of the semantics of semi-structured queries over XML data, but is likely to be effective only in some cases.

...read moreread less

Abstract: XML is fast emerging as the dominant standard for representing data in the World Wide Web. Sophisticated query engines that allow users to effectively tap the data stored in XML documents will be crucial to exploiting the full power of XML. While there has been a great deal of activity recently proposing new semistructured data models and query languages for this purpose, this paper explores the more conservative approach of using traditional relational database engines for processing XML documents conforming to Document Type Descriptors (DTDs). To this end, we have developed algorithms and implemented a prototype system that converts XML documents to relational tuples, translates semi-structured queries over XML documents to SQL queries over tables, and converts the results to XML. We have qualitatively evaluated this approach using several real DTDs drawn from diverse domains. It turns out that the relational approach can handle most (but not all) of the semantics of semi-structured queries over XML data, but is likely to be effective only in some cases. We identify the causes for these limitations and propose certain extensions to the relational model that would make it more appropriate for processing queries over XML documents.

...read moreread less

1,111 citations

Proceedings Article•DOI•

Packet classification using tuple space search

[...]

V. Srinivasan¹, Subhash Suri¹, George Varghese¹•Institutions (1)

Washington University in St. Louis¹

30 Aug 1999

TL;DR: The Pruned Tuple Space search is the only scheme known to us that allows fast updates and fast search times, and an optimal algorithm is described, called Rectangle Search, for two-dimensional filters.

...read moreread less

Abstract: Routers must perform packet classification at high speeds to efficiently implement functions such as firewalls and QoS routing. Packet classification requires matching each packet against a database of filters (or rules), and forwarding the packet according to the highest priority filter. Existing filter schemes with fast lookup time do not scale to large filter databases. Other more scalable schemes work for 2-dimensional filters, but their lookup times degrade quickly with each additional dimension. While there exist good hardware solutions, our new schemes are geared towards software implementation.We introduce a generic packet classification algorithm, called Tuple Space Search (TSS). Because real databases typically use only a small number of distinct field lengths, by mapping filters to tuples even a simple linear search of the tuple space can provide significant speedup over naive linear search over the filters. Each tuple is maintained as a hash table that can be searched in one memory access. We then introduce techniques for further refining the search of the tuple space, and demonstrate their effectiveness on some firewall databases. For example, a real database of 278 filters had a tuple space of 41 which our algorithm prunes to 11 tuples. Even as we increased the filter database size from 1K to 100K (using a random two-dimensional filter generation model), the number of tuples grew from 53 to only 186, and the pruned tuples only grew from 1 to 4. Our Pruned Tuple Space search is also the only scheme known to us that allows fast updates and fast search times. We also show a lower bound on the general tuple space search problem, and describe an optimal algorithm, called Rectangle Search, for two-dimensional filters.

...read moreread less

604 citations

Journal Article•DOI•

Tane: An Efficient Algorithm for Discovering Functional and Approximate Dependencies

[...]

Yka Huhtala, Juha Kärkkäinen, Pasi Porkka, Hannu Toivonen

01 Jan 1999-The Computer Journal

TL;DR: TANE is an efficient algorithm for finding functional dependencies from large databases based on partitioning the set of rows with respect to their attribute values, which makes testing the validity of functional dependencies fast even for a large number of tuples.

...read moreread less

Abstract: The discovery of functional dependencies from relations is an important database analysis technique. We present TANE, an efficient algorithm for finding functional dependencies from large databases. TANE is based on partitioning the set of rows with respect to their attribute values, which makes testing the validity of functional dependencies fast even for a large number of tuples. The use of partitions also makes the discovery of approximate functional dependencies easy and efficient and the erroneous or exceptional rows can be identified easily. Experiments show that T ANE is fast in practice. For benchmark databases the running times are improved by several orders of magnitude over previously published results. The algorithm is also applicable to much larger datasets than the previous methods.

...read moreread less

602 citations

Proceedings Article•

Evaluating Top-k Selection Queries

[...]

Surajit Chaudhuri¹, Luis Gravano²•Institutions (2)

Microsoft¹, Columbia University²

07 Sep 1999

TL;DR: This paper studies how to determine a range query to evaluate a top-k query by exploiting the statistics available to a relational DBMS, and the impact of the quality of these statistics on the retrieval eciency of the resulting scheme.

...read moreread less

Abstract: In many applications, users specify target values for certain attributes, without requiring exact matches to these values in return. Instead, the result to such queries is typically a rank of the \top k" tuples that best match the given attribute values. In this paper, we study the advantages and limitations of processing a top-k query by translating it into a single range query that traditional relational DBMSs can process eciently. In particular, we study how to determine a range query to evaluate a top-k query by exploiting the statistics available to a relational DBMS, and the impact of the quality of these statistics on the retrieval eciency of the resulting scheme.

...read moreread less

328 citations

Book•

Incremental maintenance of views with duplicates

[...]

Timothy G. Griffin, Leonid Libkin

01 Jun 1999

TL;DR: An algorithm that propagates changes from base relations to materialized views is presented, based on reasoning about equivalence of bag-valued expressions, and it is proved that it is correct and preserves a certain notion of minimality that ensures that no unnecessary tuples are computed.

...read moreread less

Abstract: We study the problem of efficient maintenance of materialized views that may contain duplicates. This problem is particularly important when queries against such views involve aggregate functions, which need duplicates to produce correct results. Unlike most work on the view maintenance problem that is based on an algorithmic approach, our approach is algebraic and based on equational reasoning. This approach has a number of advantages: it is robust and easily extendible to new language constructs, it produces output that can be used by query optimizers, and it simplifies correctness proofs.We use a natural extension of the relational algebra operations to bags (multisets) as our basic language. We present an algorithm that propagates changes from base relations to materialized views. This algorithm is based on reasoning about equivalence of bag-valued expressions. We prove that it is correct and preserves a certain notion of minimality that ensures that no unnecessary tuples are computed. Although it is generally only a heuristic that computing changes to the view rather than recomputing the view from scratch is more efficient, we prove results saying that under normal circumstances one should expect, the change propagation algorithm to be significantly faster and more space efficient than complete recomputing of the view. We also show that our approach interacts nicely with aggregate functions, allowing their correct evaluation on views that change.

...read moreread less

223 citations

Journal Article•DOI•

Query optimization in the presence of limited access patterns

[...]

Daniela Florescu¹, Alon Y. Levy², Ioana Manolescu¹, Dan Suciu³•Institutions (3)

French Institute for Research in Computer Science and Automation¹, University of Washington², AT&T Labs³

01 Jun 1999

TL;DR: A theoretical and experimental analysis of the resulting search space and a novel query optimization algorithm that is designed to perform well under the different conditions that may arise are described.

...read moreread less

Abstract: We consider the problem of query optimization in the presence of limitations on access patterns to the data (i.e., when one must provide values for one of the attributes of a relation in order to obtain tuples). We show that in the presence of limited access patterns we must search a space of annotated query plans, where the annotations describe the inputs that must be given to the plan. We describe a theoretical and experimental analysis of the resulting search space and a novel query optimization algorithm that is designed to perform well under the different conditions that may arise. The algorithm searches the set of annotated query plans, pruning invalid and non-viable plans as early as possible in the search space, and it also uses a best-first search strategy in order to produce a first complete plan early in the search. We describe experiments to illustrate the performance of our algorithm.

...read moreread less

184 citations

Patent•

Method and apparatus for parallel execution of SQL-from within user defined functions

[...]

Navin Kabra¹, Jignesh M. Patel², Jie-Bing Yu¹, Biswadeep Nag¹, Jian-Jun Chen¹ - Show less +1 more•Institutions (2)

NCR Corporation¹, Teradata²

22 Dec 1999

TL;DR: In this paper, a C++ class (hereinafter referred to as "dispatcher") is proposed to take an SQL query and start parallel execution of the query, which is optimized and parallelized.

...read moreread less

Abstract: A method, apparatus, and an article of manufacture for parallel execution of SQL operations from within user defined functions. One or more embodiments of the invention provide the user defined function (UDF) with a C++ class (hereinafter referred to as “dispatcher”) that can take an SQL query and start parallel execution of the query. The query is optimized and parallelized. The dispatcher executes the query, sets up the communication links between the various operators in the query, and ensures that all the results are sent back to the data-server that originated the query request. Further, the dispatcher merges the results of the parallel execution and produces a single stream of tuples that is fed to the calling UDF. To provide the single stream to the calling UDF, one or more embodiments of the invention utilize a class that provides the UDF with a simple and easy-to-use interface to access the results of the nested SQL execution.

...read moreread less

167 citations

Proceedings Article•DOI•

Scalable trigger processing

[...]

Eric N. Hanson¹, C. Carnes, L. Huang, M. Konyala, L. Noronha, Srinivasan Parthasarathy, J.B. Park, A. Vernon - Show less +4 more•Institutions (1)

University of Florida¹

23 Mar 1999

TL;DR: This paper proposes a way to develop a truly scalable trigger system with a trigger cache to use the main memory effectively, and a memory-conserving selection predicate index based on the use of unique expression formats called expression signatures.

...read moreread less

Abstract: Current database trigger systems have extremely limited scalability. This paper proposes a way to develop a truly scalable trigger system. Scalability to large numbers of triggers is achieved with a trigger cache to use the main memory effectively, and a memory-conserving selection predicate index based on the use of unique expression formats called expression signatures. A key observation is that if a very large number of triggers are created, many will have the same structure, except for the appearance of different constant values. When a trigger is created, tuples are added to special relations created for expression signatures to hold the trigger's constants. These tables can be augmented with a database index or main-memory index structure to serve as a predicate index. The design presented also uses a number of types of concurrency to achieve scalability, including token (tuple)-level, condition-level, rule action-level and data-level concurrency.

...read moreread less

141 citations

Journal Article•

Lower Bounds for Linear Satisfiability Problems

[...]

Jeff Erickson

01 Jan 1999-Chicago Journal of Theoretical Computer Science

TL;DR: An Ω(ndr/2e) lower bound is proved for the following problem: for some fixed linear equation in r variables, given n real numbers, do any r of them satisfy the equation?

...read moreread less

Abstract: We prove an Ω(ndr/2e) lower bound for the following problem: For some fixed linear equation in r variables, given n real numbers, do any r of them satisfy the equation? Our lower bound holds in a restricted linear decision tree model, in which each decision is based on the sign of an arbitrary linear combination of r or fewer inputs. In this model, our lower bound is as large as possible. Previously, this lower bound was known only for a few special cases and only in more specialized models of computation. Our lower bound follows from an adversary argument. We show that for any algorithm, there is a input that contains Ω(ndr/2e) “critical” r-tuples, which have the following important property. None of the critical tuples satisfies the equation; however, if the algorithm does not directly test each critical tuple, then the adversary can modify the input, in a way that is undetectable to the algorithm, so that some untested tuple does satisfy the equation. A key step in the proof is the introduction of formal infinitesimals into the adversary input. A theorem of Tarski implies that if we can construct a single input containing infinitesimals that is hard for every algorithm, then for every decision tree algorithm there exists a corresponding real-valued input which is hard for that algorithm. An extended abstract of this paper can be found in [Eri95].

...read moreread less

75 citations

Proceedings Article•DOI•

Tuple centres for the coordination of Internet agents

[...]

Andrea Omicini¹, Franco Zambonelli•Institutions (1)

University of Bologna¹

28 Feb 1999

TL;DR: The paper presents the TUCSON coordination model for Internet applications based on network-aware (possibly mobile) agents, based on the notion of tuple centre, an enhanced tuple space whose behaviour can be extended according to the application needs.

...read moreread less

Abstract: The paper presents the TUCSON coordination model for Internet applications based on network-aware (possibly mobile) agents. The model is based on the notion of tuple centre, an enhanced tuple space whose behaviour can be extended according to the application needs. Everv node of a TUCSON environment provides its local communication space, made up of a multiplicity of independently-programmable tuple centres. This makes it possible to embed global system properties into the space of components’ interaction, thus enabling flexible cooperation over space and time between agents, and permitting to easily face many issues critical to Internet applications, such as heterogeneity and dynamicity of the execution environments.

...read moreread less

74 citations

Journal Article•DOI•

Fast joins using join indices

[...]

Zhe Li¹, Kenneth A. Ross¹•Institutions (1)

Columbia University¹

01 Apr 1999

TL;DR: Two new algorithms, “Jive join” and “Slam join,” are proposed for computing the join of two relations using a join index, which perform significantly better than Valduriez's algorithm, the TID join algorithm, and hash join algorithms.

...read moreread less

Abstract: Two new algorithms, “Jive join” and “Slam join,” are proposed for computing the join of two relations using a join index. The algorithms are duals: Jive join range-partitions input relation tuple ids and then processes each partition, while Slam join forms ordered runs of input relation tuple ids and then merges the results. Both algorithms make a single sequential pass through each input relation, in addition to one pass through the join index and two passes through a temporary file, whose size is half that of the join index. Both algorithms require only that the number of blocks in main memory is of the order of the square root of the number of blocks in the smaller relation. By storing intermediate and final join results in a vertically partitioned fashion, our algorithms need to manipulate less data in memory at a given time than other algorithms. The algorithms are resistant to data skew and adaptive to memory fluctuations. Selection conditions can be incorporated into the algorithms. Using a detailed cost model, the algorithms are analyzed and compared with competing algorithms. For large input relations, our algorithms perform significantly better than Valduriez's algorithm, the TID join algorithm, and hash join algorithms. An experimental study is also conducted to validate the analytical results and to demonstrate the performance characteristics of each algorithm in practice.

...read moreread less

Patent•

Query optimization through the use of multi-column statistics to avoid the problems of non-indexed column correlation

[...]

Thomas A. Beavin¹, Balakrishna R. Iyer¹, Akira Shibamiya¹, Hong Sang Tie¹, Min Wang¹ - Show less +1 more•Institutions (1)

IBM¹

26 Mar 1999

TL;DR: In this article, a multi-column linear quantile statistic is collected by dividing the data of multiple columns into sub-ranges where each sub-range has approximately an even distribution of data, and determining a frequency and cardinality of each subrange.

...read moreread less

Abstract: The system, method, and program of this invention collects multi-column statistics, by a database management system, to reflect a relationship among multiple columns of a table in a relational database. These statistics are stored in the system catalog, and are used during query optimization to obtain an estimate of the number of qualifying rows when a query has predicates on multiple columns of a table. A multi-column linear quantile statistic is collected by dividing the data of multiple columns into sub-ranges where each sub-range has approximately an even distribution of data, and determining a frequency and cardinality of each sub-range. A multi-column polygonal quantile statistic is collected by dividing the data of multiple columns into sub-spaces where each sub-space contains approximately the same number of tuples, and determining a frequency and cardinality of each sub-space. The system catalog is accessed for the stored multi-column linear quantile statistic for a query having a single range predicate and at least one equal predicate to determine the selectivity value for the predicates of the query. The system catalog is accessed for the stored multi-column polygonal quantile statistic for a query having more than one range predicate. These statistics are used in various ways to determine the selectivity value for the predicates of the query.

...read moreread less

Dissertation•

A Pi-Calculus Based Approach to Software Composition

[...]

Markus Lumpe

01 Jan 1999

TL;DR: The L-Calculus is presented, a variant of the Pi-calculus in which agents communicate by passing extensible, labeled records, or so-called "forms", rather than tuples, which makes it much easier to model compositional abstractions than it is possible in the plain Pi- Calculus.

...read moreread less

Abstract: Present-day applications are increasingly required to be flexible, or "open" in a variety of ways. By flexibility we mean that these applications have to be portable (to different hardware and software platforms), interoperable (with other applications), extendible (to new functionality), configurable (to individual users' or clients' needs), and maintainable. These kinds of flexibility are currently best supported by component-oriented software technology: components, by means of abstraction, support portability, interoperability, and maintainability. Extendibility and configurability are supported by different forms of binding technology, or "glue": application parts, or even whole applications can be created by composing software components; applications stay flexible by allowing components to be replaced or reconfigured, possibly at runtime. This thesis develops a formal language for software composition that is based on the Pi-calculus. More precisely, we present the L-calculus, a variant of the Pi-calculus in which agents communicate by passing extensible, labeled records, or so-called "forms", rather than tuples. This approach makes it much easier to model compositional abstractions than it is possible in the plain Pi-calculus, since the contents of communication are now independent of position, agents are more naturally polymorphic since communication forms can be easily extended, and environmental arguments can be passed implicitly. The L-calculus is developed in three stages: (i) we analyse whether the Pi-calculus is suitable to model composition abstractions, (ii) driven by the insights we got using the Pi-calculus, we de ne a new calculus that has better support for software composition (e.g., provides support for inherently extensible software construction), and (iii), we de ne a first-order type system with subtype polymorphism and sound record concatenation that allows us to check statically an agent system in order to prevent the occurrences of run-time errors. We conclude with defining a first Java-based composition system and Piccola, a prototype composition language based on the L-calculus. The composition system provides support for integrating arbitrary compositional abstractions using both Piccola and standard bridging technologies like RMI and CORBA. Furthermore, the composition systems maintains a composition library that provides components in a uniform way.

...read moreread less

Proceedings Article•DOI•

Incremental multidimensional scaling method for database visualization

[...]

Wojciech Basalaj¹•Institutions (1)

University of Cambridge¹

25 Mar 1999-electronic imaging

TL;DR: The incremental Multidimensional Scaling method presented here uses cluster analysis techniques to assess the structural significance of groups of data objects and creates an opportunity to ignore dissimilarities between closely associated objects, thus greatly reducing input size.

...read moreread less

Abstract: A collection of entity descriptions may be conveniently represented by a set of tuples or a set of objects with appropriate attributes. The utility of relational and object databases is based on this premise. Methods of multivariate analysis can naturally be applied to such a representation. Multidimensional Scaling deserves particular attention because of its suitability for visualization. The advantage of using Multidimensional Scaling is its generality. Provided that one can judge or calculate the dissimilarity between any pair of data objects, this method can be applied. This makes it invariant to the number and types of object attributes. To take advantage of this method for visualizing large collections of data, however, its inherent computational complexity needs to be alleviated. This is particularly the case for least squares scaling, which involves numerical minimization of a loss function; on the other hand the technique gives better configurations than analytical classical scaling. Numerical optimization requires selection of a convergence criterion, i.e. deciding when to stop. A common solution is to stop after a predetermined number of iterations has been performed. Such an approach, while guaranteed to terminate, may prematurely abort the optimization. The incremental Multidimensional Scaling method presented here solves these problems. It uses cluster analysis techniques to assess the structural significance of groups of data objects. This creates an opportunity to ignore dissimilarities between closely associated objects, thus greatly reducing input size. To detect convergence it maintains a compact representation of all intermediate optimization results. This method has been applied to the analysis of database tables.

...read moreread less

Dissertation•DOI•

Components, Scripts, and Glue: A conceptual framework for software composition

[...]

Jean-Guy Schneider

01 Oct 1999

TL;DR: This thesis presents a conceptual framework for componentbased software development incorporating the notions of components and frameworks, software architectures, glue, as well as scripting and coordination, which allows for an algebraic view of software composition.

...read moreread less

Abstract: The last decade has shown that object-oriented technology alone is not enough to cope with the rapidly changing requirements of present-day applications. Typically, objectoriented methods do not lead to designs that make a clear separation between computational and compositional aspects. Component-based systems, on the other hand, achieve flexibility by clearly separating the stable parts of systems (i.e. the components) from the specification of their composition. Components are black-box entities that encapsulate services behind well-defined interfaces. The essential point is that components are not used in isolation, but according to a software architecture which determines the interfaces that components may have and the rules governing their composition. A component, therefore, cannot be separated from a component framework. Naturally, it is not enough to have components and frameworks, but one needs a way to plug components together. However, one of the main problems with existing languages and systems is that there is no generally accepted definition of how components can be composed. In this thesis, we argue that the flexibility and adaptability needed for component-based applications to cope with changing requirements can be substantially enhanced if we do not only think in terms of components, but also in terms of architectures, scripts, and glue. Therefore, we present a conceptual framework for componentbased software development incorporating the notions of components and frameworks, software architectures, glue, as well as scripting and coordination, which allows for an algebraic view of software composition. Furthermore, we define the FORM calculus, an offspring of the asynchronous Pi-calculus, as a formal foundation for a composition language that makes the ideas of the conceptual framework concrete. The FORM calculus replaces the tuple communication of the Pi-calculus by the communication of forms (or extensible records). This approach overcomes the problem of position-dependent arguments, since the contents of communications are now independent of positions and, therefore, makes it easier to define flexible and extensible abstractions. We use the FORM calculus to define a (meta-level) framework for concurrent, objectoriented programming and show that common object-oriented programming abstractions such as instance variables and methods, different method dispatch strategies as well as synchronization are most easily modelled when class metaobjects are explicitly reified as first-class entities and when a compositional view of object-oriented abstractions is adopted. Finally, we show that both, polymorphic form extension and restriction are the basic composition mechanisms for forms and illustrate that they are the key concepts for defining extensible and adaptable, hence reusable higher-level compositional abstractions.

...read moreread less

Standard models under polynomial positivity conditions

[...]

Sandra Pott

01 Jan 1999

TL;DR: In this article, the authors developed standard models for commuting tuples of bounded linear operators on a Hilbert space under certain polynomial positivity con- ditions, generalizing the work of V. Muller and F.-H. Vasilescu in (6), (14).

...read moreread less

Abstract: We develop standard models for commuting tuples of bounded linear operators on a Hilbert space under certain polynomial positivity con- ditions, generalizing the work of V. Muller and F.-H. Vasilescu in (6), (14). As a consequence of the model, we prove a von Neumann-type inequal- ity for such tuples. Up to similarity, we obtain the existence of in a certain sense "unitary" dilations.

...read moreread less

Book Chapter•DOI•

Heuristic Measures of Interestingness

[...]

Robert J. Hilderman¹, Howard J. Hamilton¹•Institutions (1)

University of Regina¹

15 Sep 1999

TL;DR: This paper presents and empirically compares sixteen heuristic measures that evaluate the structure of a summary to assign a single real-valued index that represents its interestingness relative to other summaries generated from the same database.

...read moreread less

Abstract: The tuples in a generalized relation (i.e., a summary generated from a database) are unique, and therefore, can be considered to be a population with a structure that can be described by some probability distribution. In this paper, we present and empirically compare sixteen heuristic measures that evaluate the structure of a summary to assign a single real-valued index that represents its interestingness relative to other summaries generated from the same database. The heuristics are based upon well-known measures of diversity, dispersion, dominance, and inequality used in several areas of the physical, social, ecological, management, information, and computer sciences. Their use for ranking summaries generated from databases is a new application area. All sixteen heuristics rank less complex summaries (i.e., those with few tuples and/or few non-ANY attributes) as most interesting. We demonstrate that for sample data sets, the order in which some of the measures rank summaries is highly correlated.

...read moreread less

Book Chapter•DOI•

In Search of the Lost Schema

[...]

Stéphane Grumbach¹, Giansalvatore Mecca•Institutions (1)

French Institute for Research in Computer Science and Automation¹

10 Jan 1999

TL;DR: Depending upon the encoding of empty sets, two polynomial on-line algorithms are proposed for solving the schema finding problem, and it is proved that with a high probability, both algorithms find the schema after examining a fixed number of tuples, thus leading in practice to a linear time behavior with respect to the database size for wrapping the data.

...read moreread less

Abstract: We study the problem of rediscovering the schema of nested relations that have been encoded as strings for storage purposes. We consider various classes of encoding functions, and consider the markup encodings, which allow to find the schema without knowledge of the encoding function, under reasonable assumptions on the input data. Depending upon the encoding of empty sets, we propose two polynomial on-line algorithms (with different buffer size) solving the schema finding problem. We also prove that with a high probability, both algorithms find the schema after examining a fixed number of tuples, thus leading in practice to a linear time behavior with respect to the database size for wrapping the data. Finally, we show that the proposed techniques are well-suited for practical applications, such as structuring and wrapping HTML pages and Web sites.

...read moreread less

Patent•

Sampling over joins for database systems

[...]

Surajit Chaudhuri¹, Rajeev Motwani¹, Vivek Narasayya¹•Institutions (1)

Microsoft¹

15 Mar 1999

TL;DR: In this article, a database server supports weighted and unweighted sampling of records or tuples in accordance with desired sampling semantics such as with replacement (WR), without replacement (WoR), or independent coin flips (CF) semantics, for example.

...read moreread less

Abstract: A database server supports weighted and unweighted sampling of records or tuples in accordance with desired sampling semantics such as with replacement (WR), without replacement (WoR), or independent coin flips (CF) semantics, for example. The database server may perform such sampling sequentially not only to sample non-materialized records such as those produced as a stream by a pipeline in a query tree for example, but also to sample records, whether materialized or not, in a single pass. The database server also supports sampling over a join of two relations of records or tuples without requiring the computation of the full join and without requiring the materialization of both relations and/or indexes on the join attribute values of both relations.

...read moreread less

Journal Article•

Dealing with semantic heterogeneity during data integration

[...]

Zoubida Kedad, Elisabeth Métais

01 Jan 1999-Lecture Notes in Computer Science

TL;DR: In this article, semantic operators allow a linguistic flexibility in the queries, e.g. two tuples with the values red and vermilion could match in a semantic join on the color attribute.

...read moreread less

Abstract: Multi-sources information systems, such as data warehouse systems, involve heterogeneous sources. In this paper, we deal with the semantic heterogeneity of the data instances. Problems may occur when confronting sources, each time different level of denominations have been used for the same value, e.g. vermilion in one source, and red in an other. We propose to manage this semantic heterogeneity by using a linguistic dictionary. Semantic operators allow a linguistic flexibility in the queries, e.g. two tuples with the values red and vermilion could match in a semantic join on the color attribute. A particularity of our approach is it states the scope of the flexibility by defining classes of equivalent values by the mean of priority nodes. They are used as parameters for allowing the user to define the scope of the flexibility in a very natural manner, without specifying any distance.

...read moreread less

Journal Article•DOI•

Data Mining in Large Databases Using Domain Generalization Graphs

[...]

Robert J. Hilderman¹, Howard J. Hamilton¹, Nick Cercone²•Institutions (2)

University of Regina¹, University of Waterloo²

01 Nov 1999

TL;DR: This work presents serial and parallel versions of the Multi-Attribute Generalization algorithm for traversing the generalization state space described by joining the domain generalization graphs for multiple attributes, and presents the interestingness of the resulting summaries using measures based upon variance and relative entropy.

...read moreread less

Abstract: Attribute-oriented generalization summarizes the information in a relational database by repeatedly replacing specific attribute values with more general concepts according to user-defined concept hierarchies. We introduce domain generalization graphs for controlling the generalization of a set of attributes and show how they are constructed. We then present serial and parallel versions of the Multi-Attribute Generalization algorithm for traversing the generalization state space described by joining the domain generalization graphs for multiple attributes. Based upon a generate-and-test approach, the algorithm generates all possible summaries consistent with the domain generalization graphs. Our experimental results show that significant speedups are possible by partitioning path combinations from the DGGs across multiple processors. We also rank the interestingness of the resulting summaries using measures based upon variance and relative entropy. Our experimental results also show that these measures provide an effective basis for analyzing summary data generated from relational databases. Variance appears more useful because it tends to rank the less complex summaries (i.e., those with few attributes and/or tuples) as more interesting.

...read moreread less

Journal Article•DOI•

On linearly rigid tuples

[...]

Karl Strambach, Helmut Völklein

01 Jan 1999-Crelle's Journal

Journal Article•DOI•

Data Mining via Discretization, Generalization and Rough Set Feature Selection

[...]

Xiaohua Hu, Nick Cercone¹•Institutions (1)

University of Waterloo¹

01 Feb 1999-Knowledge and Information Systems

TL;DR: A prototype knowledge discovery system DBROUGH-II has been constructed by integrating discretization, generalization, rough set feature selection and a variety of data mining algorithms, demonstrating that different kinds of knowledge rules, such as characteristic rules, discriminant rules, maximal generalized classification rules, and data evolution regularities, can be discovered efficiently and effectively.

...read moreread less

Abstract: We present a data mining method which integrates discretization, generalization and rough set feature selection. Our method reduces the data horizontally and vertically. In the first phase, discretization and generalization are integrated. Numeric attributes are discretized into a few intervals. The primitive values of symbolic attributes are replaced by high level concepts and some obvious superfluous or irrelevant symbolic attributes are also eliminated. The horizontal reduction is done by merging identical tuples after substituting an attribute value by its higher level value in a pre- defined concept hierarchy for symbolic attributes, or the discretization of continuous (or numeric) attributes. This phase greatly decreases the number of tuples we consider further in the database(s). In the second phase, a novel context- sensitive feature merit measure is used to rank features, a subset of relevant attributes is chosen, based on rough set theory and the merit values of the features. A reduced table is obtained by removing those attributes which are not in the relevant attributes subset and the data set is further reduced vertically without changing the interdependence relationships between the classes and the attributes. Finally, the tuples in the reduced relation are transformed into different knowledge rules based on different knowledge discovery algorithms. Based on these principles, a prototype knowledge discovery system DBROUGH-II has been constructed by integrating discretization, generalization, rough set feature selection and a variety of data mining algorithms. Tests on a telecommunication customer data warehouse demonstrates that different kinds of knowledge rules, such as characteristic rules, discriminant rules, maximal generalized classification rules, and data evolution regularities, can be discovered efficiently and effectively.

...read moreread less

Patent•

Method and system for incremental database maintenance

[...]

Inderpal Singh Mumick¹, Himanshu Gupta¹•Institutions (1)

AT&T¹

20 May 1999

TL;DR: In this paper, the authors present a method and system for incrementally maintaining a database having at least one materialized view based on a table, which is updated by applying the higher-level change table to the materialised view using a refresh operation, which has two parameters, a join condition and an update function specification.

...read moreread less

Abstract: The present invention is a method and system for incrementally maintaining a database having at least one materialized view based on at least one table. When changes to the table are received, a change table based on the received changes is generated. The generated change table is propagated upwards to form a higher-level change table and the materialized view is updated by applying the higher-level change table to the materialized view using a refresh operation. In one aspect, the change table includes a plurality of tuples representing the changes and the materialized view includes a plurality of tuples. The refresh operation has two parameters, a join condition and an update function specification. The materialized view is updated by finding all tuples in the materialized view that match the tuple in the change table, using the join condition, for each tuple in the change table and updating each found tuple in the materialized view by performing operations indicated by the update function specification.

...read moreread less

Proceedings Article•DOI•

On the semantics of tuple-based coordination models

[...]

Andrea Omicini¹•Institutions (1)

University of Bologna¹

28 Feb 1999

TL;DR: This paper aims at providing a conceptual framework for coordination, as well as an operational framework for the semantic characterisation of coordination models and languages, to deal with the intrinsic unformalisability of interactive systems.

...read moreread less

Abstract: The emergence of coordination models and languages for the design and development of today multi-component software systems calls for a precise understanding and definition of what coordination is, what coordination models and languages are. and what they are meant to. In this paper, we aim at providing a conceptual framework for coordination, as well as an operational framework for the semantic characterisation of coordination models and languages. The main go+ of this framework are (i) to deal with the intrinsic unformalisability of interactive systems, and (ii) to be simple yet expressive enough to work as a clean and effective specification for the implementation of a coordinated system. The effectiveness of the framework defined is shown by applying it to the general description of tuple-based coordination models. The expressiveness of the corresponding operational framework is then exploited for the full operational characterisation of a logic tuple-based coordination model.

...read moreread less

Proceedings Article•DOI•

Cell tuple based spatio-temporal data model: an object oriented approach

[...]

Ale Raza¹, Wolfgang Kainz¹•Institutions (1)

International Institute of Minnesota¹

01 Nov 1999

TL;DR: Some of the limitations of the earlier object-based models are discussed and a novel approach for spatio-temporal data modelling is presented by extending the approach proposed by Worboys and the cell tuple structure for cell complexes introduced by Brisson.

...read moreread less

Abstract: Research on TGIS has been addressing various aspects of time in a GIS. A number of issues and barriers have been identified in the design and implementation of a TGIS. Some of the issues are application-dependent while others are more fundamental and are relevant for any generic TGIS. One of the fundamental enigmas and impediments in designing a generic TGIS is the spatiotemporal data model. Application specific modelling will be more efficient if it is based on a generic model. Incorporating time in object-based data models increases the complexity of the data structure and has been a challenging task for many designers. Complexity may be reduced by employing object-oriented concepts and relying on a solid mathematical basis. This paper discusses some of the limitations of the earlier object-based models and presents a novel approach for spatio-temporal data modelling by extending the approach proposed by Worboys and the cell tuple structure for cell complexes introduced by Brisson. The approach presented in this paper is based on cell complexes for representing space.

...read moreread less

Book Chapter•DOI•

On Data Summaries Based on Gradual Rules

[...]

Patrick Bosc, Olivier Pivert, Laurent Ughetto

25 May 1999

TL;DR: This paper focuses on the extraction from databases of linguistic summaries, using so-called fuzzy gradual rules, which encode statements of the form "the younger the employees, the smaller their bonus".

...read moreread less

Abstract: With the increasing size of databases, the extraction of data summaries becomes more and more useful The use of fazzy sets seems interesting in order to extract linguistic summaries, ie, statements from the natural language, containing gradual properties, which are meaningful for human operators This paper focuses on the extraction from databases of linguistic summaries, using so-called fuzzy gradual rules, which encode statements of the form "the younger the employees, the smaller their bonus" The summaries considered here are more on the relations between labels of the attributes than on the data themselves The first idea is to extract all the rules which are not in contradiction with tuples of a given relation Then, the interest of these rules is questioned For instance, some of them can reveal potential incoherence, while other are not really informative It is then shown that in some cases, interesting information can be extracted from these rules Last, some properties the final set of rules should verify are outlined

...read moreread less

Journal Article•DOI•

A dictionary-based approach for gene annotation.

[...]

Lior Pachter¹, Serafim Batzoglou, Valentin I. Spitkovsky, Eric Banks, Eric S. Lander, Daniel J. Kleitman, Bonnie Berger - Show less +3 more•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 1999-Journal of Computational Biology

TL;DR: A fast and fully automated dictionary-based approach to gene annotation and exon prediction, using dictionaries from the nonredundant protein OWL database and the dbEST database to find the longest matches at every position in an input sequence to the database sequences.

...read moreread less

Abstract: This paper describes a fast and fully automated dictionary-based approach to gene annotation and exon prediction. Two dictionaries are constructed, one from the nonredundant protein OWL database and the other from the dbEST database. These dictionaries are used to obtain O (1) time lookups of tuples in the dictionaries (4 tuples for the OWL database and 11 tuples for the dbEST database). These tuples can be used to rapidly find the longest matches at every position in an input sequence to the database sequences. Such matches provide very useful information pertaining to locating common segments between exons, alternative splice sites, and frequency data of long tuples for statistical purposes. These dictionaries also provide the basis for both homology determination, and statistical approaches to exon prediction.

...read moreread less

Book•

Updating derived relations: detecting irrelevant and autonomously computable updates

[...]

José A. Blakeley, Neil Coburn, Per-Ake Larson

01 Jun 1999

TL;DR: Enough and necessary conditions are given for detecting when an update of a base relation cannot affect a derived relation (an irrelevant update), and when a derived relations can be correctly updated using no data other than the derived relation itself and the given update operation (an autonomously computable update).

...read moreread less

Abstract: Consider a database containing not only base relations but also stored derived relations (also called materialized or concrete views). When a base relation is updated, it may also be necessary to update some of the derived relations. This paper gives sufficient and necessary conditions for detecting when an update of a base relation cannot affect a derived relation (an irrelevant update), and for detecting when a derived relation can be correctly updated using no data other than the derived relation itself and the given update operation (an autonomously computable update). The class of derived relations considered is restricted to those defined by PSJ-expressions, that is, any relational algebra expressions constructed from an arbitrary number of project, select and join operations (but containing no self-joins). The class of update operations consists of insertions, deletions, and modifications, where the set of tuples to be deleted or modified is specified by a selection condition on attributes of the relation being updated.

...read moreread less

Book Chapter•DOI•

An Approach to Classify Semi-structured Objects

[...]

Elisa Bertino¹, Giovanna Guerrini², Isabella Merlo², Marco Mesiti³•Institutions (3)

University of Milan¹, University of Genoa², Telcordia Technologies³

14 Jun 1999

TL;DR: The notion of weak membership of an object in a class is introduced, and two measures, the conformity and the heterogeneity degrees, are exploited by the classification algorithm to identify the most appropriate class in which an object can be classified, among the ones of which it is a weak member.

...read moreread less

Abstract: Several advanced applications, such as those dealing with the Web, need to handle data whose structure is not known a-priori. Such requirement severely limits the applicability of traditional database techniques, that are based on the fact that the structure of data (e.g. the database schema) is known before data are entered into the database. Moreover, in traditional database systems, whenever a data item (e.g. a tuple, an object, and so on) is entered, the application specifies the collection (e.g. relation, class, and so on) the data item belongs to. Collections are the basis for handling queries and indexing and therefore a proper classification of data items in collections is crucial. In this paper, we address this issue in the context of an extended object-oriented data model. We propose an approach to classify objects, created without specifying the class they belong to, in the most appropriate class of the schema, that is, the class closest to the object state. In particular, we introduce the notion of weak membership of an object in a class, and define two measures, the conformity and the heterogeneity degrees, exploited by our classification algorithm to identify the most appropriate class in which an object can be classified, among the ones of which it is a weak member.

...read moreread less