scispace - formally typeset
Open AccessProceedings ArticleDOI

Tell me something I don't know: randomization strategies for iterative data mining

TLDR
The problem of randomizing data so that previously discovered patterns or models are taken into account, and the results indicate that in many cases, the results of, e.g., clustering actually imply theresults of, say, frequent pattern discovery.
Abstract
There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure.In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Maximum entropy models and subjective interestingness: an application to tiles in binary databases

TL;DR: It is shown how the MaxEnt model can be computed remarkably efficiently in this situation, and how it can be used for the same purpose as swap randomizations but computationally more efficiently.
Proceedings ArticleDOI

Tell me what i need to know: succinctly summarizing data with itemsets

TL;DR: In this paper, a probabilistic maximum entropy model is used to find the most interesting itemset, and in turn update the model of the data accordingly, so that the summary is guaranteed to be both descriptive and non-redundant.
Journal ArticleDOI

A peek into the black box: exploring classifiers by randomization

TL;DR: An efficient iterative algorithm to find the attributes and dependencies used by any classifier when making predictions is proposed and the empirical investigation shows that the novel algorithm is indeed able to find groupings of interacting attributes exploited by the different classifiers.
Proceedings ArticleDOI

An information theoretic framework for data mining

TL;DR: The proposed framework can be used to help in designing new data mining algorithms that maximize the efficiency of the information exchange from the algorithm to the data miner.
Journal ArticleDOI

A Unifying Framework for Mining Approximate Top-k Binary Patterns

TL;DR: This work reviews several greedy algorithms, and discusses PANDA+, an algorithmic framework able to optimize different cost functions generalized into a unifying formulation, and evaluates the goodness of the algorithm by measuring the quality of the extracted patterns.
References
More filters
Journal ArticleDOI

Controlling the false discovery rate: a practical and powerful approach to multiple testing

TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.
Journal ArticleDOI

Equation of state calculations by fast computing machines

TL;DR: In this article, a modified Monte Carlo integration over configuration space is used to investigate the properties of a two-dimensional rigid-sphere system with a set of interacting individual molecules, and the results are compared to free volume equations of state and a four-term virial coefficient expansion.
Journal ArticleDOI

A Simple Sequentially Rejective Multiple Test Procedure

TL;DR: In this paper, a simple and widely accepted multiple test procedure of the sequentially rejective type is presented, i.e. hypotheses are rejected one at a time until no further rejections can be done.
Journal ArticleDOI

Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment.

TL;DR: Resampling-Based Adjustments: Basic Concepts and Practical Applications.
Book

Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses

TL;DR: This book provides a step-by-step manual on the application of permutation tests in biology, medicine, science, and engineering and shows how the problems of missing and censored data, nonresponders, after thefact covariates, and outliers may be handled.