scispace - formally typeset
Search or ask a question
Author

Niko Vuokko

Bio: Niko Vuokko is an academic researcher from Helsinki University of Technology. The author has contributed to research in topics: Sample (statistics) & Cluster analysis. The author has an hindex of 9, co-authored 12 publications receiving 299 citations. Previous affiliations of Niko Vuokko include Helsinki Institute for Information Technology & Aalto University.

Papers
More filters
Proceedings ArticleDOI
28 Jun 2009
TL;DR: The problem of randomizing data so that previously discovered patterns or models are taken into account, and the results indicate that in many cases, the results of, e.g., clustering actually imply theresults of, say, frequent pattern discovery.
Abstract: There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure.In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

81 citations

Proceedings ArticleDOI
TL;DR: In this paper, the problem of randomizing data so that previously discovered patterns or models are taken into account is considered, and the authors use Metropolis sampling based on local swaps to achieve this.
Abstract: There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure. In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

71 citations

Proceedings Article
01 Jan 2008
TL;DR: Three alternative algorithms based on local transformations and Metropolis sampling are described, and it is shown that they are efficient and usable in practice and work efficiently and solve the defined problem.
Abstract: Randomization is an important technique for assessing the significance of data mining results. Given an input data set, a randomization method samples at random from some class of datasets that share certain characteristics with the original data. The measure of interest on the original data is then compared to the measure on the samples to assess its significance. For certain types of data, e.g., gene expression matrices, it is useful to be able to sample datasets that share row and column means and variances. Testing whether the results of a data mining algorithm on such randomized datasets differ from the results on the true dataset tells us whether the results on the true data were an artifact of the row and column means and variances, or due to some more interesting phenomena in the data. In this paper, we study the problem of generating such randomized datasets. We describe three alternative algorithms based on local transformations and Metropolis sampling, and show that the methods are efficient and usable in practice. We evaluate the performance of the methods both on real and generated data. The results indicate that the methods work efficiently and solve the defined problem.

27 citations

Proceedings Article
01 Dec 2010
TL;DR: This work identifies the cases in which the original network G and feature vectors F can be reconstructed in polynomial time and addresses the problem of reconstructing the originalnetwork and set of features given their randomized counterparts G and F and knowledge of the randomization model.
Abstract: In social networks, nodes correspond to entities and edges to links between them. In most of the cases, nodes are also associated with a set of features. Noise, missing values or efforts to preserve privacy in the network may transform the original network G and its feature vectors F . This transformation can be modeled as a randomization method. Here, we address the problem of reconstructing the original network and set of features given their randomized counterparts G′ and F ′ and knowledge of the randomization model. We identify the cases in which the original network G and feature vectors F can be reconstructed in polynomial time. Finally, we illustrate the efficacy of our methods using both generated and real datasets.

26 citations

Journal IssueDOI
TL;DR: Methods based on local transformations and Metropolis sampling are described, and it is shown that the methods are efficient and usable in practice in significance testing of data mining results on real-valued matrices.
Abstract: Randomization is an important technique for assessing the significance of data analysis results. Given an input dataset, a randomization method samples at random from some class of datasets that share certain characteristics with the original data. The measure of interest on the original data is then compared to the measure on the samples to assess its significance. For certain types of data, e.g., gene expression matrices, it is useful to be able to sample datasets that have the same row and column distributions of values as the original dataset. Testing whether the results of a data mining algorithm on such randomized datasets differ from the results on the true dataset tells us whether the results on the true data were an artifact of the row and column statistics, or due to some more interesting phenomena in the data. We study the problem of generating such randomized datasets. We describe methods based on local transformations and Metropolis sampling, and show that the methods are efficient and usable in practice. We evaluate the performance of the methods both on real and generated data. We also show how our methods can be applied to a real data analysis scenario on DNA microarray data. The results indicate that the methods work efficiently and are usable in significance testing of data mining results on real-valued matrices. Copyright © 2009 Wiley Periodicals, Inc., A Wiley Company

25 citations


Cited by
More filters
Proceedings ArticleDOI
22 Jan 2006
TL;DR: Some of the major results in random graphs and some of the more challenging open problems are reviewed, including those related to the WWW.
Abstract: We will review some of the major results in random graphs and some of the more challenging open problems. We will cover algorithmic and structural questions. We will touch on newer models, including those related to the WWW.

7,116 citations

Journal ArticleDOI
TL;DR: The analysis shows that studying the classification error via permutation tests is effective; in particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data.
Abstract: We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classifier performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classifier performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data.

436 citations

Proceedings ArticleDOI
06 Dec 2009
TL;DR: In this paper, the authors explore the framework of permutation-based p-values for assessing the behavior of the classification error and study two simple permutation tests: the first test estimates the null distribution by permuting the labels in the data; this has been used extensively in classification problems in computational biology and the second test produces permutations of the features within classes, inspired by restricted randomization techniques traditionally used in statistics.
Abstract: We explore the framework of permutation-based p-values for assessing the behavior of the classification error. In this paper we study two simple permutation tests. The first test estimates the null distribution by permuting the labels in the data; this has been used extensively in classification problems in computational biology. The second test produces permutations of the features within classes, inspired by restricted randomization techniques traditionally used in statistics. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classification error via permutation tests is effective; in particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data.

392 citations

01 Jan 1993
TL;DR: In this paper, it was shown that 1/f processes are optimally represented in terms of orthonormal wavelet bases, and the wavelet expansion's role as a Karhunen-Loeve-type expansion was developed.
Abstract: The 1/f family of fractal random processes model a truly extraordinary range of natural and man-made phenomena, many of which arise in a variety of signal processing scenarios. Yet despite their apparent importance, the lack of convenient representations for 1/f processes has, at least until recently, strongly limited their popularity. In this paper, we demonstrate that 1/f processes are, in a broad sense, optimally represented in terms of orthonormal wavelet bases. Specifically, via a useful frequency domain characterization for 1/f processes, we develop the wavelet expansion's role as a Karhunen-Loeve-type expansion for 1/f processes. As an illustration of potential, we show that wavelet based representations naturally lead to highly efficient solutions to some fundamental detection and estimation problems involving 1/f processes

314 citations

Journal ArticleDOI
TL;DR: It is shown how the MaxEnt model can be computed remarkably efficiently in this situation, and how it can be used for the same purpose as swap randomizations but computationally more efficiently.
Abstract: Recent research has highlighted the practical benefits of subjective interestingness measures, which quantify the novelty or unexpectedness of a pattern when contrasted with any prior information of the data miner (Silberschatz and Tuzhilin, Proceedings of the 1st ACM SIGKDD international conference on Knowledge discovery and data mining (KDD95), 1995; Geng and Hamilton, ACM Comput Surv 38(3):9, 2006). A key challenge here is the formalization of this prior information in a way that lends itself to the definition of a subjective interestingness measure that is both meaningful and practical. In this paper, we outline a general strategy of how this could be achieved, before working out the details for a use case that is important in its own right. Our general strategy is based on considering prior information as constraints on a probabilistic model representing the uncertainty about the data. More specifically, we represent the prior information by the maximum entropy (MaxEnt) distribution subject to these constraints. We briefly outline various measures that could subsequently be used to contrast patterns with this MaxEnt model, thus quantifying their subjective interestingness. We demonstrate this strategy for rectangular databases with knowledge of the row and column sums. This situation has been considered before using computation intensive approaches based on swap randomizations, allowing for the computation of empirical p-values as interestingness measures (Gionis et al., ACM Trans Knowl Discov Data 1(3):14, 2007). We show how the MaxEnt model can be computed remarkably efficiently in this situation, and how it can be used for the same purpose as swap randomizations but computationally more efficiently. More importantly, being an explicitly represented distribution, the MaxEnt model can additionally be used to define analytically computable interestingness measures, as we demonstrate for tiles (Geerts et al., Proceedings of the 7th international conference on Discovery science (DS04), 2004) in binary databases.

162 citations