Home
/
Authors
/
Niko Vuokko

Author

Niko Vuokko

Other affiliations: Helsinki Institute for Information Technology, Aalto University

Bio: Niko Vuokko is an academic researcher from Helsinki University of Technology. The author has contributed to research in topics: Sample (statistics) & Cluster analysis. The author has an hindex of 9, co-authored 12 publications receiving 299 citations. Previous affiliations of Niko Vuokko include Helsinki Institute for Information Technology & Aalto University.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Tell me something I don't know: randomization strategies for iterative data mining

[...]

Sami Hanhijärvi¹, Markus Ojala¹, Niko Vuokko¹, Kai Puolamäki¹, Nikolaj Tatti¹, Heikki Mannila¹ - Show less +2 more•Institutions (1)

Helsinki University of Technology¹

28 Jun 2009

TL;DR: The problem of randomizing data so that previously discovered patterns or models are taken into account, and the results indicate that in many cases, the results of, e.g., clustering actually imply theresults of, say, frequent pattern discovery.

...read moreread less

Abstract: There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure.In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

...read moreread less

81 citations

Proceedings Article•DOI•

Tell Me Something I Don't Know: Randomization Strategies for Iterative Data Mining

[...]

Sami Hanhijärvi¹, Markus Ojala¹, Niko Vuokko¹, Kai Puolamäki¹, Nikolaj Tatti², Heikki Mannila¹ - Show less +2 more•Institutions (2)

Helsinki University of Technology¹, Aalto University²

16 Jun 2020-arXiv: Learning

TL;DR: In this paper, the problem of randomizing data so that previously discovered patterns or models are taken into account is considered, and the authors use Metropolis sampling based on local swaps to achieve this.

...read moreread less

Abstract: There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure. In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

...read moreread less

71 citations

Proceedings Article•

Randomization of real-valued matrices for assessing the significance of data mining results

[...]

Markus Ojala¹, Niko Vuokko¹, Aleksi Kallio², Niina Haiminen, Heikki Mannila - Show less +1 more•Institutions (2)

Helsinki University of Technology¹, CSC – IT Center for Science²

01 Jan 2008

TL;DR: Three alternative algorithms based on local transformations and Metropolis sampling are described, and it is shown that they are efficient and usable in practice and work efficiently and solve the defined problem.

...read moreread less

Abstract: Randomization is an important technique for assessing the significance of data mining results. Given an input data set, a randomization method samples at random from some class of datasets that share certain characteristics with the original data. The measure of interest on the original data is then compared to the measure on the samples to assess its significance. For certain types of data, e.g., gene expression matrices, it is useful to be able to sample datasets that share row and column means and variances. Testing whether the results of a data mining algorithm on such randomized datasets differ from the results on the true dataset tells us whether the results on the true data were an artifact of the row and column means and variances, or due to some more interesting phenomena in the data. In this paper, we study the problem of generating such randomized datasets. We describe three alternative algorithms based on local transformations and Metropolis sampling, and show that the methods are efficient and usable in practice. We evaluate the performance of the methods both on real and generated data. The results indicate that the methods work efficiently and solve the defined problem.

...read moreread less

27 citations

Proceedings Article•

Reconstructing Randomized Social Networks.

[...]

Niko Vuokko¹, Evimaria Terzi²•Institutions (2)

Aalto University¹, Boston University²

01 Dec 2010

TL;DR: This work identifies the cases in which the original network G and feature vectors F can be reconstructed in polynomial time and addresses the problem of reconstructing the originalnetwork and set of features given their randomized counterparts G and F and knowledge of the randomization model.

...read moreread less

Abstract: In social networks, nodes correspond to entities and edges to links between them. In most of the cases, nodes are also associated with a set of features. Noise, missing values or efforts to preserve privacy in the network may transform the original network G and its feature vectors F . This transformation can be modeled as a randomization method. Here, we address the problem of reconstructing the original network and set of features given their randomized counterparts G′ and F ′ and knowledge of the randomization model. We identify the cases in which the original network G and feature vectors F can be reconstructed in polynomial time. Finally, we illustrate the efficacy of our methods using both generated and real datasets.

...read moreread less

26 citations

Journal Issue•DOI•

Randomization methods for assessing data analysis results on real-valued matrices

[...]

Markus Ojala¹, Niko Vuokko¹, Aleksi Kallio², Niina Haiminen³, Heikki Mannila¹ - Show less +1 more•Institutions (3)

Helsinki University of Technology¹, CSC – IT Center for Science², University of Helsinki³

01 Nov 2009-Statistical Analysis and Data Mining

TL;DR: Methods based on local transformations and Metropolis sampling are described, and it is shown that the methods are efficient and usable in practice in significance testing of data mining results on real-valued matrices.

...read moreread less

Abstract: Randomization is an important technique for assessing the significance of data analysis results. Given an input dataset, a randomization method samples at random from some class of datasets that share certain characteristics with the original data. The measure of interest on the original data is then compared to the measure on the samples to assess its significance. For certain types of data, e.g., gene expression matrices, it is useful to be able to sample datasets that have the same row and column distributions of values as the original dataset. Testing whether the results of a data mining algorithm on such randomized datasets differ from the results on the true dataset tells us whether the results on the true data were an artifact of the row and column statistics, or due to some more interesting phenomena in the data. We study the problem of generating such randomized datasets. We describe methods based on local transformations and Metropolis sampling, and show that the methods are efficient and usable in practice. We evaluate the performance of the methods both on real and generated data. We also show how our methods can be applied to a real data analysis scenario on DNA microarray data. The results indicate that the methods work efficiently and are usable in significance testing of data mining results on real-valued matrices. Copyright © 2009 Wiley Periodicals, Inc., A Wiley Company

...read moreread less

25 citations

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Random graphs

[...]

Alan Frieze¹•Institutions (1)

Carnegie Mellon University¹

22 Jan 2006

TL;DR: Some of the major results in random graphs and some of the more challenging open problems are reviewed, including those related to the WWW.

...read moreread less

Abstract: We will review some of the major results in random graphs and some of the more challenging open problems. We will cover algorithmic and structural questions. We will touch on newer models, including those related to the WWW.

...read moreread less

7,116 citations

Journal Article•DOI•

Permutation Tests for Studying Classifier Performance

[...]

Markus Ojala, Gemma C. Garriga

01 Mar 2010-Journal of Machine Learning Research

TL;DR: The analysis shows that studying the classification error via permutation tests is effective; in particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data.

...read moreread less

Abstract: We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classifier performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classifier performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data.

...read moreread less

436 citations

Proceedings Article•DOI•

Permutation Tests for Studying Classifier Performance

[...]

Markus Ojala¹, Gemma C. Garriga¹•Institutions (1)

Helsinki University of Technology¹

06 Dec 2009

TL;DR: In this paper, the authors explore the framework of permutation-based p-values for assessing the behavior of the classification error and study two simple permutation tests: the first test estimates the null distribution by permuting the labels in the data; this has been used extensively in classification problems in computational biology and the second test produces permutations of the features within classes, inspired by restricted randomization techniques traditionally used in statistics.

...read moreread less

Abstract: We explore the framework of permutation-based p-values for assessing the behavior of the classification error. In this paper we study two simple permutation tests. The first test estimates the null distribution by permuting the labels in the data; this has been used extensively in classification problems in computational biology. The second test produces permutations of the features within classes, inspired by restricted randomization techniques traditionally used in statistics. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classification error via permutation tests is effective; in particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data.

...read moreread less

392 citations

Wavelet-based representations for the 1/f family of fractal processes : Fractals in electrical engineering

[...]

G. W. Wornell

01 Jan 1993

TL;DR: In this paper, it was shown that 1/f processes are optimally represented in terms of orthonormal wavelet bases, and the wavelet expansion's role as a Karhunen-Loeve-type expansion was developed.

...read moreread less

Abstract: The 1/f family of fractal random processes model a truly extraordinary range of natural and man-made phenomena, many of which arise in a variety of signal processing scenarios. Yet despite their apparent importance, the lack of convenient representations for 1/f processes has, at least until recently, strongly limited their popularity. In this paper, we demonstrate that 1/f processes are, in a broad sense, optimally represented in terms of orthonormal wavelet bases. Specifically, via a useful frequency domain characterization for 1/f processes, we develop the wavelet expansion's role as a Karhunen-Loeve-type expansion for 1/f processes. As an illustration of potential, we show that wavelet based representations naturally lead to highly efficient solutions to some fundamental detection and estimation problems involving 1/f processes

...read moreread less

314 citations

Journal Article•DOI•

Maximum entropy models and subjective interestingness: an application to tiles in binary databases

[...]

Tijl De Bie¹•Institutions (1)

University of Bristol¹

01 Nov 2011-Data Mining and Knowledge Discovery

TL;DR: It is shown how the MaxEnt model can be computed remarkably efficiently in this situation, and how it can be used for the same purpose as swap randomizations but computationally more efficiently.

...read moreread less

Abstract: Recent research has highlighted the practical benefits of subjective interestingness measures, which quantify the novelty or unexpectedness of a pattern when contrasted with any prior information of the data miner (Silberschatz and Tuzhilin, Proceedings of the 1st ACM SIGKDD international conference on Knowledge discovery and data mining (KDD95), 1995; Geng and Hamilton, ACM Comput Surv 38(3):9, 2006). A key challenge here is the formalization of this prior information in a way that lends itself to the definition of a subjective interestingness measure that is both meaningful and practical. In this paper, we outline a general strategy of how this could be achieved, before working out the details for a use case that is important in its own right. Our general strategy is based on considering prior information as constraints on a probabilistic model representing the uncertainty about the data. More specifically, we represent the prior information by the maximum entropy (MaxEnt) distribution subject to these constraints. We briefly outline various measures that could subsequently be used to contrast patterns with this MaxEnt model, thus quantifying their subjective interestingness. We demonstrate this strategy for rectangular databases with knowledge of the row and column sums. This situation has been considered before using computation intensive approaches based on swap randomizations, allowing for the computation of empirical p-values as interestingness measures (Gionis et al., ACM Trans Knowl Discov Data 1(3):14, 2007). We show how the MaxEnt model can be computed remarkably efficiently in this situation, and how it can be used for the same purpose as swap randomizations but computationally more efficiently. More importantly, being an explicitly represented distribution, the MaxEnt model can additionally be used to define analytically computable interestingness measures, as we demonstrate for tiles (Geerts et al., Proceedings of the 7th international conference on Discovery science (DS04), 2004) in binary databases.

...read moreread less

162 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

Collapse