scispace - formally typeset
Author

Kai Puolamäki

Bio: Kai Puolamäki is an academic researcher from University of Helsinki. The author has contributed to research in topic(s): Supersymmetry & Exploratory data analysis. The author has an hindex of 26, co-authored 122 publication(s) receiving 2259 citation(s). Previous affiliations of Kai Puolamäki include Helsinki Institute of Physics & Helsinki Institute for Information Technology.


Papers
More filters
Journal Article
TL;DR: The early Miocene retained the overall humid conditions of the late Paleogene, while the late Miocene as a whole was a time of large changes, and there was continent-wide restructuring of the distribution of environments.
Abstract: Background: We developed a method to estimate precipitation using mammalian ecomorphology, specifically the relative height of the molars of herbivores (see companion paper, this issue) Question: If we apply the new method to paleoenvironments, do the results agree with previous results from fossil mammals and paleobotanical proxies? Data: Large herbivorous fossil mammals of Eurasia Data from NOW database covers 23–22 Ma and is Eurasia-wide Method: We apply the new precipitation estimation method (based on present-day mammalian ecomorphology) to fossil assemblages from different localities Conclusions: The early Miocene retained the overall humid conditions of the late Paleogene A shift to more arid conditions began during the middle Miocene The late Miocene as a whole was a time of large changes, and there was continent-wide restructuring of the distribution of environments Our new results agree with previous investigations and the mammal proxy data are in good agreement with palaeovegetation data Mammals and vegetation produce similar precipitation values and large-scale patterns

123 citations

Journal Article
TL;DR: The methods unravelled the complex relationships between the environment and the characteristics of mammalian communities and provide a reasonably accurate estimate of precipitation values for today’s world.
Abstract: Question: How can mammalian community characteristics be used to estimate regional precipitation? Data: Global distribution data of large mammals and their ecomorphology; global climate data. Research methods: Non-linear regression-tree analysis and linear regression. Conclusions: The methods unravelled the complex relationships between the environment and the characteristics of mammalian communities. The regression trees described here provide a reasonably accurate estimate of precipitation values for today’s world. The strongest correlations are for annual precipitation versus diet (R 2 = 0.665), precipitation versus tooth crown height (R 2 = 0.658), and precipitation versus diet and tooth crown height combined (R 2 = 0.742)

102 citations

Proceedings Article
01 Jan 2009
TL;DR: This paper focuses on randomization techniques for unweighted undirected graphs for graph mining within the framework of statistical hypothesis testing, and describes three alternative algorithms based on local edge swapping and Metropolis sampling.
Abstract: Mining graph data is an active research area Several data mining methods and algorithms have been proposed to identify structures from graphs; still, the evaluation of those results is lacking Within the framework of statistical hypothesis testing, we focus in this paper on randomization techniques for unweighted undirected graphs Randomization is an important approach to assess the statistical significance of data mining results Given an input graph, our randomization method will sample data from the class of graphs that share certain structural properties with the input graph Here we describe three alternative algorithms based on local edge swapping and Metropolis sampling We test our framework with various graph data sets and mining algorithms for two applications, namely graph clustering and frequent subgraph mining

98 citations

Proceedings ArticleDOI
15 Aug 2005
TL;DR: The best prediction accuracy still leaves room for improvement but shows that proactive information retrieval and combination of many sources of relevance feedback is feasible.
Abstract: We study a new task, proactive information retrieval by combining implicit relevance feedback and collaborative filtering. We have constructed a controlled experimental setting, a prototype application, in which the users try to find interesting scientific articles by browsing their titles. Implicit feedback is inferred from eye movement signals, with discriminative hidden Markov models estimated from existing data in which explicit relevance feedback is available. Collaborative filtering is carried out using the User Rating Profile model, a state-of-the-art probabilistic latent variable model, computed using Markov Chain Monte Carlo techniques. For new document titles the prediction accuracy with eye movements, collaborative filtering, and their combination was significantly better than by chance. The best prediction accuracy still leaves room for improvement but shows that proactive information retrieval and combination of many sources of relevance feedback is feasible.

96 citations

Proceedings ArticleDOI
28 Jun 2009
TL;DR: The problem of randomizing data so that previously discovered patterns or models are taken into account, and the results indicate that in many cases, the results of, e.g., clustering actually imply theresults of, say, frequent pattern discovery.
Abstract: There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure.In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

80 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Preface to the Princeton Landmarks in Biology Edition vii Preface xi Symbols used xiii 1.
Abstract: Preface to the Princeton Landmarks in Biology Edition vii Preface xi Symbols Used xiii 1. The Importance of Islands 3 2. Area and Number of Speicies 8 3. Further Explanations of the Area-Diversity Pattern 19 4. The Strategy of Colonization 68 5. Invasibility and the Variable Niche 94 6. Stepping Stones and Biotic Exchange 123 7. Evolutionary Changes Following Colonization 145 8. Prospect 181 Glossary 185 References 193 Index 201

14,169 citations

Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

12,323 citations

Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

Book
24 Aug 2012
TL;DR: This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach, and is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.
Abstract: Today's Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software package--PMTK (probabilistic modeling toolkit)--that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

7,045 citations

01 Jan 2004
TL;DR: Comprehensive and up-to-date, this book includes essential topics that either reflect practical significance or are of theoretical importance and describes numerous important application areas such as image based rendering and digital libraries.
Abstract: From the Publisher: The accessible presentation of this book gives both a general view of the entire computer vision enterprise and also offers sufficient detail to be able to build useful applications. Users learn techniques that have proven to be useful by first-hand experience and a wide range of mathematical methods. A CD-ROM with every copy of the text contains source code for programming practice, color images, and illustrative movies. Comprehensive and up-to-date, this book includes essential topics that either reflect practical significance or are of theoretical importance. Topics are discussed in substantial and increasing depth. Application surveys describe numerous important application areas such as image based rendering and digital libraries. Many important algorithms broken down and illustrated in pseudo code. Appropriate for use by engineers as a comprehensive reference to the computer vision enterprise.

3,492 citations