Showing papers by "Kai Puolamäki published in 2009"

PDF

Open Access

Proceedings Article•

[...]

Sami Hanhijärvi¹, Gemma C. Garriga², Kai Puolamäki³•Institutions (3)

Aalto University¹, Helsinki University of Technology², University of Helsinki³

01 Jan 2009

TL;DR: This paper focuses on randomization techniques for unweighted undirected graphs for graph mining within the framework of statistical hypothesis testing, and describes three alternative algorithms based on local edge swapping and Metropolis sampling.

...read moreread less

Abstract: Mining graph data is an active research area Several data mining methods and algorithms have been proposed to identify structures from graphs; still, the evaluation of those results is lacking Within the framework of statistical hypothesis testing, we focus in this paper on randomization techniques for unweighted undirected graphs Randomization is an important approach to assess the statistical significance of data mining results Given an input graph, our randomization method will sample data from the class of graphs that share certain structural properties with the input graph Here we describe three alternative algorithms based on local edge swapping and Metropolis sampling We test our framework with various graph data sets and mining algorithms for two applications, namely graph clustering and frequent subgraph mining

...read moreread less

104 citations

Proceedings Article•DOI•

Tell me something I don't know: randomization strategies for iterative data mining

[...]

Sami Hanhijärvi¹, Markus Ojala¹, Niko Vuokko¹, Kai Puolamäki¹, Nikolaj Tatti¹, Heikki Mannila¹ - Show less +2 more•Institutions (1)

Helsinki University of Technology¹

28 Jun 2009

TL;DR: The problem of randomizing data so that previously discovered patterns or models are taken into account, and the results indicate that in many cases, the results of, e.g., clustering actually imply theresults of, say, frequent pattern discovery.

...read moreread less

Abstract: There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure.In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

...read moreread less

81 citations

Journal Article•DOI•

Can eyes reveal interest? Implicit queries from gaze patterns

[...]

Antti Ajanki¹, David R. Hardoon², Samuel Kaski¹, Kai Puolamäki¹, John Shawe-Taylor² - Show less +1 more•Institutions (2)

Helsinki Institute for Information Technology¹, University College London²

01 Oct 2009-User Modeling and User-adapted Interaction

TL;DR: This work constructed a controlled experimental setting to show that when the system has no prior information as to what the user is searching, the eye movements help significantly in the search.

...read moreread less

Abstract: We study a new research problem, where an implicit information retrieval query is inferred from eye movements measured when the user is reading, and used to retrieve new documents. In the training phase, the user's interest is known, and we learn a mapping from how the user looks at a term to the role of the term in the implicit query. Assuming the mapping is universal, that is, the same for all queries in a given domain, we can use it to construct queries even for new topics for which no learning data is available. We constructed a controlled experimental setting to show that when the system has no prior information as to what the user is searching, the eye movements help significantly in the search. This is the case in a proactive search, for instance, where the system monitors the reading behaviour of the user in a new topic. In contrast, during a search or reading session where the set of inspected documents is biased towards being relevant, a stronger strategy is to search for content-wise similar documents than to use the eye movements.

...read moreread less

71 citations

Book Chapter•DOI•

Bayesian Solutions to the Label Switching Problem

[...]

Kai Puolamäki¹, Samuel Kaski¹•Institutions (1)

Helsinki Institute for Information Technology¹

27 Aug 2009

TL;DR: A fully Bayesian treatment of the permutations which performs better than alternatives and can even be used to compute summaries of the posterior samples for nonparametric Bayesian methods, for which no good solutions exist so far.

...read moreread less

Abstract: The label switching problem, the unidentifiability of the permutation of clusters or more generally latent variables, makes interpretation of results computed with MCMC sampling difficult. We introduce a fully Bayesian treatment of the permutations which performs better than alternatives. The method can even be used to compute summaries of the posterior samples for nonparametric Bayesian methods, for which no good solutions exist so far. Although being approximative in that case, the results are very promising. The summaries are intuitively appealing: A summarized cluster is defined as a set of points for which the likelihood of being in the same cluster is maximized.

...read moreread less

29 citations

Journal Article•DOI•

A randomized approximation algorithm for computing bucket orders

[...]

Antti Ukkonen¹, Kai Puolamäki¹, Aristides Gionis², Heikki Mannila¹•Institutions (2)

Helsinki Institute for Information Technology¹, Yahoo!²

01 Mar 2009-Information Processing Letters

TL;DR: It is shown that a simple randomized algorithm has an expected constant factor approximation guarantee for fitting bucket orders to a set of pairwise preferences.

...read moreread less

28 citations

Ubiquitous Contextual Information Access with Proactive Retrieval and Augmentation

[...]

Antti Ajanki, Mark Billinghurst, Melih Kandemir, Samuel Kaski, Markus Koskela, Mikko Kurimo, Jorma Laaksonen, Kai Puolamäki, Timo Tossavainen - Show less +5 more

01 Jan 2009

TL;DR: A prototype platform for accessing abstract information in real-world pervasive computing environments through Augmented Reality displays and the first use of the platform to develop a pilot application, a virtual laboratory guide, and early evaluation results are described.

...read moreread less

Abstract: In this paper we report on a prototype platform for accessing abstract information in real-world pervasive computing environments through Augmented Reality displays. Objects, people, and the environment serve as contextual channels to more information. Adaptive models will infer from eye movement patterns and other implicit feedback signals the interests of users with respect to the environment, and results of proactive context-sensitive information retrieval are augmented onto the view of data glasses or other see-through displays. The augmented information becomes part of the context, and if it is relevant the system detects it and zooms progressively further. In this paper we describe the first use of the platform to develop a pilot application, a virtual laboratory guide, and early evaluation results.

...read moreread less

24 citations

Journal Article•DOI•

Latent grouping models for user preference prediction

[...]

Eerika Savia¹, Kai Puolamäki¹, Samuel Kaski¹•Institutions (1)

Helsinki Institute for Information Technology¹

01 Jan 2009-Machine Learning

TL;DR: A probabilistic latent grouping model for predicting the relevance of a document to a user and compares it against a state-of-the-art method, the User Rating Profile model, where only the users have a latent group structure.

...read moreread less

Abstract: We tackle the problem of new users or documents in collaborative filtering. Generalization over users by grouping them into user groups is beneficial when a rating is to be predicted for a relatively new document having only few observed ratings. Analogously, generalization over documents improves predictions in the case of new users. We show that if either users and documents or both are new, two-way generalization becomes necessary. We demonstrate the benefits of grouping of users, grouping of documents, and two-way grouping, with artificial data and in two case studies with real data. We have introduced a probabilistic latent grouping model for predicting the relevance of a document to a user. The model assumes a latent group structure for both users and items. We compare the model against a state-of-the-art method, the User Rating Profile model, where only the users have a latent group structure. We compute the posterior of both models by Gibbs sampling. The Two-Way Model predicts relevance more accurately when the target consists of both new documents and new users. The reason is that generalization over documents becomes beneficial for new documents and at the same time generalization over users is needed for new users.

...read moreread less

14 citations

Visual Analytics: Final report

[...]

Paula Järvinen, Kai Puolamäki, Pekka Siltanen, Markus Ylikerälä

01 Jan 2009

TL;DR: In this article, the authors present a case study of meteorological services in South Eastern Europe with potential benefits in Albania, Bosnia-Herzegovina, FYR Macedonia, Moldova and Montenegro.

...read moreread less

Abstract: 102 Stephen Fox. Ontological uncertainty and semantic uncertainty in global network organizations. 2008. 122 p. 103 Kati Tillander, Helena Jarnstrom, Tuula Hakkarainen, Juha Laitinen, Mauri Makela, & Panu Oksa. Palokohteiden savu-, nokija kemikaalijaamat ja niiden vaikutukset tyoturvallisuuteen. Polttokokeet ja altistumisen arviointi. 2008. 67 s. 104 Eija Kupi, Sanna-Kaisa Ilomaki, Virpi Sillanpaa, Heli Talja & Antti Lonnqvist. Aineettoman paaoman riskienhallinta. Riskit ja riskienhallinnan kaytannot yrityksissa. 2008. 44 s. 105 Teemu Mutanen, Joni Niemi, Sami Nousiainen, Lauri Seitsonen & Teppo Veijonen. Cultural Event Recommendations. A Case Study. 2008. 17 p. 106 Hannele Holttinen. Tuulivoiman tuotantotilastot. Vuosiraportti 2007. 2008. 44 s. + liitt. 8 s. 107 Kari Keinanen, Jarkko Leino & Jani Suomalainen. Developing Keyboard Service for NoTA. 2008. 17 p. + app. 2 p. 108 Hannele Antikainen, Asta Back & Pirjo Nakki. Sosiaalisen median hyodyntaminen paikallisissa mediapalveluissa. 2008. 64 s. 109 Raine Hautala, Pekka Leviakangas, Jukka Rasanen, Risto Oorni, Sanna Sonninen, Pasi Vahanne, Martti Hekkanen, Mikael Ohlstrom, Bengt Tammelin, Seppo Saku & Ari Venalainen. Benefits of meteorological services in South Eastern Europe. An assessment of potential benefits in Albania, Bosnia-Herzegovina, FYR Macedonia, Moldova and Montenegro. 2008. 63 p. + app. 35 p. 110 Jaana Leikas. Ikaantyvat, teknologia ja etiikka. Nakokulmia ihmisen ja teknologian vuorovaikutustutkimukseen ja -suunnitteluun. 2008. 155 s. 111 Tomi J. Lindroos. Sectoral Approaches in the Case of the Iron and Steel Industry. 2008. 58 p. + app. 11 p.

...read moreread less

9 citations

Book Chapter•DOI•

Two-Way Grouping by One-Way Topic Models

[...]

Eerika Savia¹, Kai Puolamäki¹, Samuel Kaski¹•Institutions (1)

Helsinki Institute for Information Technology¹

27 Aug 2009

TL;DR: This work suggests approximating the Two-Way Model with two URP models; one that groups users and one thatgroups documents, which achieves even better prediction performance than the original Two- Way Model.

...read moreread less

Abstract: We tackle the problem of new users or documents in collaborative filtering. Generalization over users by grouping them into user groups is beneficial when a rating is to be predicted for a relatively new document having only few observed ratings. The same applies for documents in the case of new users. We have shown earlier that if there are both new users and new documents, two-way generalization becomes necessary, and introduced a probabilistic Two-Way Model for the task. The task of finding a two-way grouping is a non-trivial combinatorial problem, which makes it computationally difficult. We suggest approximating the Two-Way Model with two URP models; one that groups users and one that groups documents. Their two predictions are combined using a product of experts model. This combination of two one-way models achieves even better prediction performance than the original Two-Way Model.

...read moreread less

7 citations

Posted Content•

Multiple Hypothesis Testing in Pattern Discovery

[...]

Sami Hanhijärvi, Kai Puolamäki, Gemma C. Garriga

29 Jun 2009-arXiv: Machine Learning

TL;DR: In this paper, the authors extend the multiple hypothesis framework to be used with a generic data mining algorithm, and provide a method that provably controls the family-wise error rate (FWER, the probability of at least one false positive) in the strong sense.

...read moreread less

Abstract: The problem of multiple hypothesis testing arises when there are more than one hypothesis to be tested simultaneously for statistical significance. This is a very common situation in many data mining applications. For instance, assessing simultaneously the significance of all frequent itemsets of a single dataset entails a host of hypothesis, one for each itemset. A multiple hypothesis testing method is needed to control the number of false positives (Type I error). Our contribution in this paper is to extend the multiple hypothesis framework to be used with a generic data mining algorithm. We provide a method that provably controls the family-wise error rate (FWER, the probability of at least one false positive) in the strong sense. We evaluate the performance of our solution on both real and generated data. The results show that our method controls the FWER while maintaining the power of the test.

...read moreread less

5 citations

Proceedings Article•

Proceedings of the ACM SIGKDD Workshop on Visual Analytics and Knowledge Discovery: Integrating Automated Analysis with Interactive Exploration

[...]