scispace - formally typeset
Search or ask a question
Topic

Tuple

About: Tuple is a research topic. Over the lifetime, 6513 publications have been published within this topic receiving 146057 citations. The topic is also known as: tuple & ordered tuplet.


Papers
More filters
Book
01 Jun 1994
TL;DR: A perspective on ML and SML/NJ and how to program with Datatypes and Solutions to Selected Exercises are presented.
Abstract: 0. A Perspective on ML and SML/NJ. I. INTRODUCTION TO PROGRAMMING IN ML. 1. Expressions. 2. Type Consistency. 3. Variables and Environments. 4. Tuples and Lists. 5. It's Easy It's "fun." 6. Patterns in Function Definitions. 7. Local Environments Using "let." 8. Exceptions. 9. Side Effects: Input and Output. II. ADVANCED FEATURES OF ML. 10. Polymorphic Functions. 11. Higher-Order Functions. 12. Defining New Types. 13. Programming with Datatypes. 14. The ML Module System. 15. Software Design Using Modules. 16. Arrays. 17. References. III. ADDITIONAL DETAILS AND FEATURES. 18. Record Structures. 19. Matches and Patterns. 20. More About Exceptions. 21. Counting with Functions as Values. 22. More About Input and Output. 23. Creating Executable Files. 24. Controlling Operator Grouping. 25. Built-In Functions of SML/NJ. 26. Summary of ML Syntax. Solutions to Selected Exercises. Index.

206 citations

Journal ArticleDOI
TL;DR: This paper presents an attribute clustering method which is able to group genes based on their interdependence so as to mine meaningful patterns from the gene expression data, and introduces a methodology to solving it.
Abstract: This paper presents an attribute clustering method which is able to group genes based on their interdependence so as to mine meaningful patterns from the gene expression data. It can be used for gene grouping, selection, and classification. The partitioning of a relational table into attribute subgroups allows a small number of attributes within or across the groups to be selected for analysis. By clustering attributes, the search dimension of a data mining algorithm is reduced. The reduction of search dimension is especially important to data mining in gene expression data because such data typically consist of a huge number of genes (attributes) and a small number of gene expression profiles (tuples). Most data mining algorithms are typically developed and optimized to scale to the number of tuples instead of the number of attributes. The situation becomes even worse when the number of attributes overwhelms the number of tuples, in which case, the likelihood of reporting patterns that are actually irrelevant due to chances becomes rather high. It is for the aforementioned reasons that gene grouping and selection are important preprocessing steps for many data mining algorithms to be effective when applied to gene expression data. This paper defines the problem of attribute clustering and introduces a methodology to solving it. Our proposed method groups interdependent attributes into clusters by optimizing a criterion function derived from an information measure that reflects the interdependence between attributes. By applying our algorithm to gene expression data, meaningful clusters of genes are discovered. The grouping of genes based on attribute interdependence within group helps to capture different aspects of gene association patterns in each group. Significant genes selected from each group then contain useful information for gene expression classification and identification. To evaluate the performance of the proposed approach, we applied it to two well-known gene expression data sets and compared our results with those obtained by other methods. Our experiments show that the proposed method is able to find the meaningful clusters of genes. By selecting a subset of genes which have high multiple-interdependence with others within clusters, significant classification information can be obtained. Thus, a small pool of selected genes can be used to build classifiers with very high classification rate. From the pool, gene expressions of different categories can be identified.

205 citations

Book ChapterDOI
31 Aug 2004
TL;DR: It is shown formally that neither approximation can be addressed effectively for a sliding-window join of arbitrary input streams, and a broad class of applications for which an age-based model of stream arrival is more appropriate is pointed out.
Abstract: We address the problem of computing approximate answers to continuous sliding-window joins over data streams when the available memory may be insufficient to keep the entire join state One approximation scenario is to provide a maximum subset of the result, with the objective of losing as few result tuples as possible An alternative scenario is to provide a random sample of the join result, eg, if the output of the join is being aggregated We show formally that neither approximation can be addressed effectively for a sliding-window join of arbitrary input streams Previous work has addressed only the maximum-subset problem, and has implicitly used a frequency-based model of stream arrival We address the sampling problem for this model More importantly, we point out a broad class of applications for which an age-based model of stream arrival is more appropriate, and we address both approximation scenarios under this new model Finally, for the case of multiple joins being executed with an overall memory constraint, we provide an algorithm for memory allocation across the joins that optimizes a combined measure of approximation in all scenarios considered All of our algorithms are implemented and experimental results demonstrate their effectiveness

204 citations

Proceedings ArticleDOI
05 Apr 2005
TL;DR: This work proposes two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques, and proposes a novel framework for the fuzzy duplicate elimination problem.
Abstract: Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples, which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm.

203 citations

Proceedings ArticleDOI
03 Apr 2006
TL;DR: This work rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database, and experimentally study the performance of the rewritten queries.
Abstract: The detection of duplicate tuples, corresponding to the same real-world entity, is an important task in data integration and cleaning. While many techniques exist to identify such tuples, the merging or elimination of duplicates can be a difficult task that relies on ad-hoc and often manual solutions. We propose a complementary approach that permits declarative query answering over duplicated data, where each duplicate is associated with a probability of being in the clean database. We rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database. Our rewritten queries are sensitive to the semantics of duplication and help a user understand which query answers are most likely to be present in the clean database. The semantics that we adopt is independent of the way the probabilities are produced, but is able to effectively exploit them during query answering. In the absence of external knowledge that associates each database tuple with a probability, we offer a technique, based on tuple summaries, that automates this task. We experimentally study the performance of our rewritten queries. Our studies show that the rewriting does not introduce a significant overhead in query execution time. This work is done in the context of the ConQuer project at the University of Toronto, which focuses on the efficient management of inconsistent and dirty databases.

200 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Time complexity
36K papers, 879.5K citations
85% related
Server
79.5K papers, 1.4M citations
83% related
Scalability
50.9K papers, 931.6K citations
83% related
Polynomial
52.6K papers, 853.1K citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023203
2022459
2021210
2020285
2019306
2018266