scispace - formally typeset
Search or ask a question
Journal ArticleDOI

The Statistical Analysis of Compositional Data

01 Jul 1987-Vol. 150, Iss: 4, pp 396-396
About: The article was published on 1987-07-01. It has received 4051 citations till now. The article focuses on the topics: Compositional data.
Citations
More filters
Book
24 Aug 2012
TL;DR: This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach, and is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.
Abstract: Today's Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software package--PMTK (probabilistic modeling toolkit)--that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

8,059 citations

Journal ArticleDOI
TL;DR: It was found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold-based pre-selection and the value of GLM in combination with penalised methods and thresholds when omitted variables are considered in the final interpretation.
Abstract: Collinearity refers to the non independence of predictor variables, usually in a regression-type analysis. It is a common feature of any descriptive ecological data set and can be a problem for parameter estimation because it inflates the variance of regression parameters and hence potentially leads to the wrong identification of relevant predictors in a statistical model. Collinearity is a severe problem when a model is trained on data from one region or time, and predicted to another with a different or unknown structure of collinearity. To demonstrate the reach of the problem of collinearity in ecology, we show how relationships among predictors differ between biomes, change over spatial scales and through time. Across disciplines, different approaches to addressing collinearity problems have been developed, ranging from clustering of predictors, threshold-based pre-selection, through latent variable methods, to shrinkage and regularisation. Using simulated data with five predictor-response relationships of increasing complexity and eight levels of collinearity we compared ways to address collinearity with standard multiple regression and machine-learning approaches. We assessed the performance of each approach by testing its impact on prediction to new data. In the extreme, we tested whether the methods were able to identify the true underlying relationship in a training dataset with strong collinearity by evaluating its performance on a test dataset without any collinearity. We found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold-based pre-selection. Our results highlight the value of GLM in combination with penalised methods (particularly ridge) and threshold-based pre-selection when omitted variables are considered in the final interpretation. However, all approaches tested yielded degraded predictions under change in collinearity structure and the ‘folk lore’-thresholds of correlation coefficients between predictor variables of |r| >0.7 was an appropriate indicator for when collinearity begins to severely distort model estimation and subsequent prediction. The use of ecological understanding of the system in pre-analysis variable selection and the choice of the least sensitive statistical approaches reduce the problems of collinearity, but cannot ultimately solve them.

6,199 citations

Journal ArticleDOI
TL;DR: The Discriminant Analysis of Principal Components (DAPC) is introduced, a multivariate method designed to identify and describe clusters of genetically related individuals that performs generally better than STRUCTURE at characterizing population subdivision.
Abstract: The dramatic progress in sequencing technologies offers unprecedented prospects for deciphering the organization of natural populations in space and time. However, the size of the datasets generated also poses some daunting challenges. In particular, Bayesian clustering algorithms based on pre-defined population genetics models such as the STRUCTURE or BAPS software may not be able to cope with this unprecedented amount of data. Thus, there is a need for less computer-intensive approaches. Multivariate analyses seem particularly appealing as they are specifically devoted to extracting information from large datasets. Unfortunately, currently available multivariate methods still lack some essential features needed to study the genetic structure of natural populations. We introduce the Discriminant Analysis of Principal Components (DAPC), a multivariate method designed to identify and describe clusters of genetically related individuals. When group priors are lacking, DAPC uses sequential K-means and model selection to infer genetic clusters. Our approach allows extracting rich information from genetic data, providing assignment of individuals to groups, a visual assessment of between-population differentiation, and contribution of individual alleles to population structuring. We evaluate the performance of our method using simulated data, which were also analyzed using STRUCTURE as a benchmark. Additionally, we illustrate the method by analyzing microsatellite polymorphism in worldwide human populations and hemagglutinin gene sequence variation in seasonal influenza. Analysis of simulated data revealed that our approach performs generally better than STRUCTURE at characterizing population subdivision. The tools implemented in DAPC for the identification of clusters and graphical representation of between-group structures allow to unravel complex population structures. Our approach is also faster than Bayesian clustering algorithms by several orders of magnitude, and may be applicable to a wider range of datasets.

3,770 citations

Proceedings ArticleDOI
25 Jun 2006
TL;DR: A family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections, and dynamic topic models provide a qualitative window into the contents of a large document collection.
Abstract: A family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections. The approach is to use state space models on the natural parameters of the multinomial distributions that represent the topics. Variational approximations based on Kalman filters and nonparametric wavelet regression are developed to carry out approximate posterior inference over the latent topics. In addition to giving quantitative, predictive models of a sequential corpus, dynamic topic models provide a qualitative window into the contents of a large document collection. The models are demonstrated by analyzing the OCR'ed archives of the journal Science from 1880 through 2000.

2,410 citations

References
More filters
Book
01 Jan 1932
TL;DR: This detailed study of the different rates of growth of parts of the body relative to the body as a whole represents Sir Julian Huxley's great contribution to analytical morphology, and it is still a basis for modern investigations in morphometrics and evolutionary biology.
Abstract: This detailed study of the different rates of growth of parts of the body relative to the body as a whole represents Sir Julian Huxley's great contribution to analytical morphology, and it is still a basis for modern investigations in morphometrics and evolutionary biology. Huxley was the first to put the concept of relative growth - or allometry - upon a firm mathematical foundation, and since publication of this book in 1932, his work has been found to have greater implications than even he imagined. Problems of Relative Growth is at once a formulation of the basic principles of allometry and a survey of its many and various occurrences and applications. Examples are taken from such widely divergent areas as the development of the large claw in male fiddler-crabs, the size and number of points of deer antlers, heterogony in neuter social insects, the disproportionate growth of the human head from infancy to adulthood, and the formation of spiral shapes in certain mollusk shells and of the curved shape of the rhinoceros' horn. Starting from the fact of obvious disharmonic growth, Huxley formulates his first and fundamental law - that of the Constant Differential Growth Ratio. He then demonstrates that the distribution of growth potential occurs in an orderly and systematic way - that there are growth-gradients culminating in growth-centers. Other topics treated include multiplicative and accretionary kinds of growth, the role of hormones and mutations, and the relevance of the entire investigation to the problems of orthogenesis, recapitulation, vestigial organs, the existence of nonadaptive characters, physiological genetics, comparative physiology, and systematics. In theirintroduction to this unabridged facsimile republication of the original 1932 edition, Frederick B. Churchill and Richard E. Strauss place Huxley's work in the context of modern research in history and biology.

2,198 citations

Book
01 Jan 1981
TL;DR: In this paper, the original Mixture Problem is described and models for exploring the Entire Simplex Factor Space are presented, including matrix algebra, least squares, and the analysis of variance.
Abstract: Preface to the Third Edition. Preface to the Second Edition. Introduction. The Original Mixture Problem: Designs and Models for Exploring the Entire Simplex Factor Space. The Use of Independent Variables. Multiple Constraints on the Component Proportions. The Analysis of Mixture Data. Other Mixture Model Forms. The Inclusion of Process Variables in Mixture Experiments. Additional Topics. Matrix Algebra, Least Squares, and the Analysis of Variance. Data Sets from Mixture Experiments with Partial Solutions. Bibliography and Index of Authors. Answers to Selected Questions. Appendix. Index.

964 citations

Journal ArticleDOI

479 citations

01 Jan 1979
TL;DR: In this paper, Mosimann et al. presented new statistical methods for the study of size and shape, and used these methods to study the morphology of red-winged blackbirds, Agelaius phoeniceus breeding in Florida.
Abstract: Allometry, the association of size and shape in populations of organisms, is the subject of an extensive literature (Reeve and Huxley, 1945; Gould, 1966, 1975; Spielman, 1973; Sprent, 1972; Thorpe, 1976). In this paper we present new statistical methods for the study of size and shape, and use these methods to study the morphology of red-winged blackbirds, Agelaius phoeniceus, breeding in Florida. We use definitions of size and shape variables that permit the study of the entire statistical distribution of a given variable. We offer these methods as an alternative to classical methods based on the allometric equation in which relations are summarized in single coefficients reflecting at best the mean trend of shape with some size variables. Our alternative goal is to determine visually and geometrically meaningful size and shape variables which permit the direct study of the association of size and shape. This paper is in two parts. In the first part we discuss definitions of size and shape variables and summarize statistical results which include exact tests for sizeshape associations (Mosimann, 1970, 1975a, 1975b). In the second part we use these methods to study geographic variation in red-winged blackbirds in Florida. Throughout we consider the following situation: N individuals are independently sampled from some population. For each individual there are k positive measurements, all in the same units, say millimeters. For the i-th individual we then

395 citations

Journal ArticleDOI
TL;DR: In this article, a series of approximate tests of significance is given for hypotheses about the precision constant K and the polar vector on a sphere, where 8' is the angle between the polar and observation vectors.
Abstract: Summary The probability density on a sphere, exp (K cos e'), where 8' is the angle between the polar and observation vectors, has recently been suggested by Fisher for the analysis of palaeo-magnetic data. In this paper, a series OF approximate tests of significance is given for hypotheses about the precision constant K and the polar vector

256 citations