Tell me something I don't know: randomization strategies for iterative data mining

doi:10.1145/1557019.1557065

Open AccessProceedings ArticleDOI

Tell me something I don't know: randomization strategies for iterative data mining

- pp 379-388

TLDR

The problem of randomizing data so that previously discovered patterns or models are taken into account, and the results indicate that in many cases, the results of, e.g., clustering actually imply theresults of, say, frequent pattern discovery.

Abstract:

There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure.In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

Tell me something I don't know: randomization strategies for iterative data mining

Citations

Maximum entropy models and subjective interestingness: an application to tiles in binary databases

Tell me what i need to know: succinctly summarizing data with itemsets

A peek into the black box: exploring classifiers by randomization

An information theoretic framework for data mining

A Unifying Framework for Mining Approximate Top-k Binary Patterns

References

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Equation of state calculations by fast computing machines

A Simple Sequentially Rejective Multiple Test Procedure

Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment.

Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses

Related Papers (5)

Assessing data mining results via swap randomization

Elements of information theory

Discovering significant patterns

UCI Machine Learning Repository

Discovering Frequent Closed Itemsets for Association Rules