scispace - formally typeset
Search or ask a question

Showing papers by "Kai Puolamäki published in 2009"


Proceedings Article
01 Jan 2009
TL;DR: This paper focuses on randomization techniques for unweighted undirected graphs for graph mining within the framework of statistical hypothesis testing, and describes three alternative algorithms based on local edge swapping and Metropolis sampling.
Abstract: Mining graph data is an active research area Several data mining methods and algorithms have been proposed to identify structures from graphs; still, the evaluation of those results is lacking Within the framework of statistical hypothesis testing, we focus in this paper on randomization techniques for unweighted undirected graphs Randomization is an important approach to assess the statistical significance of data mining results Given an input graph, our randomization method will sample data from the class of graphs that share certain structural properties with the input graph Here we describe three alternative algorithms based on local edge swapping and Metropolis sampling We test our framework with various graph data sets and mining algorithms for two applications, namely graph clustering and frequent subgraph mining

104 citations


Proceedings ArticleDOI
28 Jun 2009
TL;DR: The problem of randomizing data so that previously discovered patterns or models are taken into account, and the results indicate that in many cases, the results of, e.g., clustering actually imply theresults of, say, frequent pattern discovery.
Abstract: There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure.In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

81 citations


Journal ArticleDOI
TL;DR: This work constructed a controlled experimental setting to show that when the system has no prior information as to what the user is searching, the eye movements help significantly in the search.
Abstract: We study a new research problem, where an implicit information retrieval query is inferred from eye movements measured when the user is reading, and used to retrieve new documents. In the training phase, the user's interest is known, and we learn a mapping from how the user looks at a term to the role of the term in the implicit query. Assuming the mapping is universal, that is, the same for all queries in a given domain, we can use it to construct queries even for new topics for which no learning data is available. We constructed a controlled experimental setting to show that when the system has no prior information as to what the user is searching, the eye movements help significantly in the search. This is the case in a proactive search, for instance, where the system monitors the reading behaviour of the user in a new topic. In contrast, during a search or reading session where the set of inspected documents is biased towards being relevant, a stronger strategy is to search for content-wise similar documents than to use the eye movements.

71 citations


Book ChapterDOI
27 Aug 2009
TL;DR: A fully Bayesian treatment of the permutations which performs better than alternatives and can even be used to compute summaries of the posterior samples for nonparametric Bayesian methods, for which no good solutions exist so far.
Abstract: The label switching problem, the unidentifiability of the permutation of clusters or more generally latent variables, makes interpretation of results computed with MCMC sampling difficult. We introduce a fully Bayesian treatment of the permutations which performs better than alternatives. The method can even be used to compute summaries of the posterior samples for nonparametric Bayesian methods, for which no good solutions exist so far. Although being approximative in that case, the results are very promising. The summaries are intuitively appealing: A summarized cluster is defined as a set of points for which the likelihood of being in the same cluster is maximized.

29 citations


Journal ArticleDOI
TL;DR: It is shown that a simple randomized algorithm has an expected constant factor approximation guarantee for fitting bucket orders to a set of pairwise preferences.

28 citations


01 Jan 2009
TL;DR: A prototype platform for accessing abstract information in real-world pervasive computing environments through Augmented Reality displays and the first use of the platform to develop a pilot application, a virtual laboratory guide, and early evaluation results are described.
Abstract: In this paper we report on a prototype platform for accessing abstract information in real-world pervasive computing environments through Augmented Reality displays. Objects, people, and the environment serve as contextual channels to more information. Adaptive models will infer from eye movement patterns and other implicit feedback signals the interests of users with respect to the environment, and results of proactive context-sensitive information retrieval are augmented onto the view of data glasses or other see-through displays. The augmented information becomes part of the context, and if it is relevant the system detects it and zooms progressively further. In this paper we describe the first use of the platform to develop a pilot application, a virtual laboratory guide, and early evaluation results.

24 citations


Journal ArticleDOI
TL;DR: A probabilistic latent grouping model for predicting the relevance of a document to a user and compares it against a state-of-the-art method, the User Rating Profile model, where only the users have a latent group structure.
Abstract: We tackle the problem of new users or documents in collaborative filtering. Generalization over users by grouping them into user groups is beneficial when a rating is to be predicted for a relatively new document having only few observed ratings. Analogously, generalization over documents improves predictions in the case of new users. We show that if either users and documents or both are new, two-way generalization becomes necessary. We demonstrate the benefits of grouping of users, grouping of documents, and two-way grouping, with artificial data and in two case studies with real data. We have introduced a probabilistic latent grouping model for predicting the relevance of a document to a user. The model assumes a latent group structure for both users and items. We compare the model against a state-of-the-art method, the User Rating Profile model, where only the users have a latent group structure. We compute the posterior of both models by Gibbs sampling. The Two-Way Model predicts relevance more accurately when the target consists of both new documents and new users. The reason is that generalization over documents becomes beneficial for new documents and at the same time generalization over users is needed for new users.

14 citations


01 Jan 2009
TL;DR: In this article, the authors present a case study of meteorological services in South Eastern Europe with potential benefits in Albania, Bosnia-Herzegovina, FYR Macedonia, Moldova and Montenegro.
Abstract: 102 Stephen Fox. Ontological uncertainty and semantic uncertainty in global network organizations. 2008. 122 p. 103 Kati Tillander, Helena Jarnstrom, Tuula Hakkarainen, Juha Laitinen, Mauri Makela, & Panu Oksa. Palokohteiden savu-, nokija kemikaalijaamat ja niiden vaikutukset tyoturvallisuuteen. Polttokokeet ja altistumisen arviointi. 2008. 67 s. 104 Eija Kupi, Sanna-Kaisa Ilomaki, Virpi Sillanpaa, Heli Talja & Antti Lonnqvist. Aineettoman paaoman riskienhallinta. Riskit ja riskienhallinnan kaytannot yrityksissa. 2008. 44 s. 105 Teemu Mutanen, Joni Niemi, Sami Nousiainen, Lauri Seitsonen & Teppo Veijonen. Cultural Event Recommendations. A Case Study. 2008. 17 p. 106 Hannele Holttinen. Tuulivoiman tuotantotilastot. Vuosiraportti 2007. 2008. 44 s. + liitt. 8 s. 107 Kari Keinanen, Jarkko Leino & Jani Suomalainen. Developing Keyboard Service for NoTA. 2008. 17 p. + app. 2 p. 108 Hannele Antikainen, Asta Back & Pirjo Nakki. Sosiaalisen median hyodyntaminen paikallisissa mediapalveluissa. 2008. 64 s. 109 Raine Hautala, Pekka Leviakangas, Jukka Rasanen, Risto Oorni, Sanna Sonninen, Pasi Vahanne, Martti Hekkanen, Mikael Ohlstrom, Bengt Tammelin, Seppo Saku & Ari Venalainen. Benefits of meteorological services in South Eastern Europe. An assessment of potential benefits in Albania, Bosnia-Herzegovina, FYR Macedonia, Moldova and Montenegro. 2008. 63 p. + app. 35 p. 110 Jaana Leikas. Ikaantyvat, teknologia ja etiikka. Nakokulmia ihmisen ja teknologian vuorovaikutustutkimukseen ja -suunnitteluun. 2008. 155 s. 111 Tomi J. Lindroos. Sectoral Approaches in the Case of the Iron and Steel Industry. 2008. 58 p. + app. 11 p.

9 citations


Book ChapterDOI
27 Aug 2009
TL;DR: This work suggests approximating the Two-Way Model with two URP models; one that groups users and one thatgroups documents, which achieves even better prediction performance than the original Two- Way Model.
Abstract: We tackle the problem of new users or documents in collaborative filtering. Generalization over users by grouping them into user groups is beneficial when a rating is to be predicted for a relatively new document having only few observed ratings. The same applies for documents in the case of new users. We have shown earlier that if there are both new users and new documents, two-way generalization becomes necessary, and introduced a probabilistic Two-Way Model for the task. The task of finding a two-way grouping is a non-trivial combinatorial problem, which makes it computationally difficult. We suggest approximating the Two-Way Model with two URP models; one that groups users and one that groups documents. Their two predictions are combined using a product of experts model. This combination of two one-way models achieves even better prediction performance than the original Two-Way Model.

7 citations


Posted Content
TL;DR: In this paper, the authors extend the multiple hypothesis framework to be used with a generic data mining algorithm, and provide a method that provably controls the family-wise error rate (FWER, the probability of at least one false positive) in the strong sense.
Abstract: The problem of multiple hypothesis testing arises when there are more than one hypothesis to be tested simultaneously for statistical significance. This is a very common situation in many data mining applications. For instance, assessing simultaneously the significance of all frequent itemsets of a single dataset entails a host of hypothesis, one for each itemset. A multiple hypothesis testing method is needed to control the number of false positives (Type I error). Our contribution in this paper is to extend the multiple hypothesis framework to be used with a generic data mining algorithm. We provide a method that provably controls the family-wise error rate (FWER, the probability of at least one false positive) in the strong sense. We evaluate the performance of our solution on both real and generated data. The results show that our method controls the FWER while maintaining the power of the test.

5 citations