scispace - formally typeset
Search or ask a question
Author

Peter Filzmoser

Other affiliations: University of Vienna
Bio: Peter Filzmoser is an academic researcher from Vienna University of Technology. The author has contributed to research in topics: Outlier & Estimator. The author has an hindex of 60, co-authored 345 publications receiving 13760 citations. Previous affiliations of Peter Filzmoser include University of Vienna.


Papers
More filters
Book
17 Feb 2009
TL;DR: Introduction Chemoinformatics-Chemometrics-Statistics This Book Historical Remarks about Chemometrics Bibliography Starting Examples Univariate Statistics-A Reminder Multivariate Data Definitions Basic Preprocessing Covariance and Correlation Distances and Similarities Multivariate Outlier Identification Linear Latent Variables
Abstract: Introduction Chemoinformatics-Chemometrics-Statistics This Book Historical Remarks about Chemometrics Bibliography Starting Examples Univariate Statistics-A Reminder Multivariate Data Definitions Basic Preprocessing Covariance and Correlation Distances and Similarities Multivariate Outlier Identification Linear Latent Variables Summary Principal Component Analysis (PCA) Concepts Number of PCA Components Centering and Scaling Outliers and Data Distribution Robust PCA Algorithms for PCA Evaluation and Diagnostics Complementary Methods for Exploratory Data Analysis Examples Summary Calibration Concepts Performance of Regression Models Ordinary Least Squares Regression Robust Regression Variable Selection Principal Component Regression Partial Least Squares Regression Related Methods Examples Summary Classification Concepts Linear Classification Methods Kernel and Prototype Methods Classification Trees Artificial Neural Networks Support Vector Machine Evaluation Examples Summary Cluster Analysis Concepts Distance and Similarity Measures Partitioning Methods Hierarchical Clustering Methods Fuzzy Clustering Model-Based Clustering Cluster Validity and Clustering Tendency Measures Examples Summary Preprocessing Concepts Smoothing and Differentiation Multiplicative Signal Correction Mass Spectral Features Appendix 1: Symbols and Abbreviations Appendix 2: Matrix Algebra Appendix 3: Introduction to R Index References appear at the end of each chapter

1,003 citations

Journal ArticleDOI
TL;DR: There is no good reason to continue to use the [mean+/-2 sdev] rule, originally proposed as a 'filter' to identify approximately 2(1/2)% of the data at each extreme for further inspection at a time when computers to do the drudgery of numerical operations were not widely available and no other practical methods existed.

686 citations

Journal ArticleDOI
TL;DR: The use of cluster analysis as an exploratory data analysis tool requires a powerful program system to test different data preparation, processing and clustering methods, including the ability to present the results in a number of easy to grasp graphics.

676 citations

Journal ArticleDOI
TL;DR: In this paper, all variables of several large data sets from regional geochemical and environmental surveys were tested for a normal or lognormal data distribution and almost all variables (up to more than 50 analysed chemical elements per data set) showed neither a normal nor a lognormous data distribution.
Abstract: All variables of several large data sets from regional geochemical and environmental surveys were tested for a normal or lognormal data distribution. As a general rule, almost all variables (up to more than 50 analysed chemical elements per data set) show neither a normal or a lognormal data distribution. Even when different transformation methods are used more than 70 % of all variables in every single data set do not approach a normal distribution. Distributions are usually skewed, have outliers and originate from more than one process. When dealing with regional geochemical or environmental data normal and/or lognormal distributions are an exception and not the rule. This observation has serious consequences for the further statistical treatment of geochemical and environmental data. The most widely used statistical methods are all based on the assumption that the studied data show a normal or lognormal distribution. Neglecting that geochemcial and environmental data show neither a normal or lognormal distribution will lead to biased or faulty results when such techniques are used.

573 citations

01 Jan 2008
TL;DR: This paper presents statistical methods to identify extreme values and data outliers in the ECDF- or CP-plot, a mighty tool in graphical data analysis, and some common mistakes in geochemical mapping.
Abstract: Preface. Acknowledgements. About the Authors. 1. Introduction. 1.1 The Kola Ecogeochemistry Project. 2. Preparing the Data for Use in R and DAS+R. 2.1 Required data format for import into R and DAS+R. 2.2 The detection limit problem. 2.3 Missing Values. 2.4 Some "typical" problems encountered when editing a laboratory data report file to a DAS+R file. 2.5 Appending and linking data files. 2.6 Requirements for a geochemical database. 2.7 Summary. 3. Graphics to Display the Data Distribution. 3.1 The one-dimensional scatterplot. 3.2 The histogram. 3.3 The density trace. 3.4 Plots of the distribution function. 3.5 Boxplots. 3.6 Combination of histogram, density trace, one-dimensional scatterplot, boxplot, and ECDF-plot. 3.7 Combination of histogram, boxplot or box-and-whisker plot, ECDF-plot, and CP-plot. 3.8 Summary. 4. Statistical Distribution Measures. 4.1 Central value. 4.2 Measures of spread. 4.3 Quartiles, quantiles and percentiles. 4.4 Skewness. 4.5 Kurtosis. 4.6 Summary table of statistical distribution measures. 4.7 Summary. 5. Mapping Spatial Data. 5.1 Map coordinate systems (map projection). 5.2 Map scale. 5.3 Choice of the base map for geochemical mapping 5.4 Mapping geochemical data with proportional dots. 5.5 Mapping geochemical data using classes. 5.6 Surface maps constructed with smoothing techniques. 5.7 Surface maps constructed with kriging. 5.8 Colour maps. 5.9 Some common mistakes in geochemical mapping. 5.10 Summary. 6. Further Graphics for Exploratory Data Analysis. 6.1 Scatterplots (xy-plots). 6.2 Linear regression lines. 6.3 Time trends. 6.4 Spatial trends. 6.5 Spatial distance plot. 6.6 Spiderplots (normalized multi-element diagrams). 6.7 Scatterplot matrix. 6.8 Ternary plots. 6.9 Summary. 7. Defining Background and Threshold, Identification of Data Outliers and Element Sources. 7.1 Statistical methods to identify extreme values and data outliers. 7.2 Detecting outliers and extreme values in the ECDF- or CP-plot. 7.3 Including the spatial distribution in the definition of background. 7.4 Methods to distinguish geogenic from anthropogenic element sources. 7.5 Summary. 8. Comparing Data in Tables and Graphics. 8.1 Comparing data in tables. 8.2 Graphical comparison of the data distributions of several data sets. 8.3 Comparing the spatial data structure. 8.4 Subset creation - a mighty tool in graphical data analysis. 8.5 Data subsets in scatterplots. 8.6 Data subsets in time and spatial trend diagrams. 8.7 Data subsets in ternary plots. 8.8 Data subsets in the scatterplot matrix. 8.9 Data subsets in maps. 8.10 Summary. 9. Comparing Data Using Statistical Tests. 9.1 Tests for distribution (Kolmogorov-Smirnov and Shapiro-Wilk tests). 9.2 The one-sample t-test (test for the central value). 9.3 Wilcoxon signed-rank test. 9.4 Comparing two central values of the distributions of independent data groups. 9.5 Comparing two central values of matched pairs of data. 9.6 Comparing the variance of two test. 9.7 Comparing several central values. 9.8 Comparing the variance of several data groups. 9.9 Comparing several central values of dependent groups. 9.10 Summary. 10. Improving Data Behaviour for Statistical Analysis: Ranking and Transformations. 10.1 Ranking/sorting. 10.2 Non-linear transformations. 10.3 Linear transformations. 10.4 Preparing a data set for multivariate data analysis. 10.5 Transformations for closed number systems. 10.6 Summary. 11. Correlation. 11.1 Pearson correlation. 11.2 Spearman rank correlation. 11.3 Kendall-tau correlation. 11.4 Robust correlation coefficients. 11.5 When is a correlation coefficient significant? 11.6 Working with many variables. 11.7 Correlation analysis and inhomogeneous data. 11.8 Correlation results following addictive logratio or central logratio transformations. 11.9 Summary. 12. Multivariate Graphics . 12.1 Profiles. 12.2 Stars. 12.3 Segments. 12.4 Boxes. 12.5 Castles and trees. 12.6 Parallel coordinates plot. 12.7 Summary. 13. Multivariate Outlier Detection. 13.1 Univariate versus multivariate outlier detection. 13.2 Robust versus non-robust outlier detection. 13.3 The chi-square plot. 13.4 Automated multivariate outlier detection and visualization. 13.5 Other graphical approaches for identifying outliers and groups. 13.6 Summary. 14. Principal Component Analysis (PCA) and Factor Analysis (FA). 14.1 Conditioning the data for PCA and FA. 14.2 Principal component analysis (PCA). 14.3 Factor Analysis. 14.4 Summary. 15. Cluster Analysis. 15.1 Possible data problems in the context of cluster analysis. 15.2 Distance measures. 15.3 Clustering samples. 15.4 Clustering variables. 15.5 Evaluation of cluster validity. 15.6 Selection of variables for cluster analysis. 15.7 Summary. 16. Regression Analysis (RA). 16.1 Data requirements for regression analysis. 16.2 Multiple regression. 16.3 Classical least squares (LS) regression. 16.4 Robust regression. 16.5 Model selection in regression analysis. 16.6 Other regression methods. 16.7 Summary. 17. Discriminant Analysis (DA) and Other Knowledge-Based Classification Methods. 17.1 Methods for discriminant analysis. 17.2 Data requirements for discriminant analysis. 17.3 Visualisation of the discriminant function. 17.4 Prediction with discriminant analysis. 17.5 Exploring for similar data structures. 17.6 Other knowledge-based classification methods/ 17.7 Summary. 18. Quality Control (QC). 18.1 Randomised samples. 18.2 Trueness. 18.3 Accuracy. 18.4 Precision. 18.5 Analysis of variance (ANOVA) 18.6 Using Maps to assess data quality. 18.7 Variables analysed by two different analytical techniques. 18.8 Working with censored data - a practical example. 18.9 Summary. 19. Introduction to R and Structure of the DAS+R Graphical User Interface. 19.1 R. 19.2 R-scripts. 19.3 A brief overview of relevant R commands. 19.4 DAS+R. 19.5 Summary. References. Index.

506 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Mice adds new functionality for imputing multilevel data, automatic predictor selection, data handling, post-processing imputed values, specialized pooling routines, model selection tools, and diagnostic graphs.
Abstract: The R package mice imputes incomplete multivariate data by chained equations. The software mice 1.0 appeared in the year 2000 as an S-PLUS library, and in 2001 as an R package. mice 1.0 introduced predictor selection, passive imputation and automatic pooling. This article documents mice, which extends the functionality of mice 1.0 in several ways. In mice, the analysis of imputed data is made completely general, whereas the range of models under which pooling works is substantially extended. mice adds new functionality for imputing multilevel data, automatic predictor selection, data handling, post-processing imputed values, specialized pooling routines, model selection tools, and diagnostic graphs. Imputation of categorical data is improved in order to bypass problems caused by perfect prediction. Special attention is paid to transformations, sum scores, indices and interactions using passive imputation, and to the proper setup of the predictor matrix. mice can be downloaded from the Comprehensive R Archive Network. This article provides a hands-on, stepwise approach to solve applied incomplete data problems.

10,234 citations

Journal ArticleDOI

6,278 citations

01 Jan 2016
TL;DR: The modern applied statistics with s is universally compatible with any devices to read, and is available in the digital library an online access to it is set as public so you can download it instantly.
Abstract: Thank you very much for downloading modern applied statistics with s. As you may know, people have search hundreds times for their favorite readings like this modern applied statistics with s, but end up in harmful downloads. Rather than reading a good book with a cup of coffee in the afternoon, instead they cope with some harmful virus inside their laptop. modern applied statistics with s is available in our digital library an online access to it is set as public so you can download it instantly. Our digital library saves in multiple countries, allowing you to get the most less latency time to download any of our books like this one. Kindly say, the modern applied statistics with s is universally compatible with any devices to read.

5,249 citations