Showing papers on "Outlier published in 2011"

PDF

Open Access

Journal Article•DOI•

Robust Point Set Registration Using Gaussian Mixture Models

[...]

Bing Jian¹, Baba C. Vemuri²•Institutions (2)

01 Aug 2011-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper presents a unified framework for the rigid and nonrigid point set registration problem in the presence of significant amounts of noise and outliers, and shows that the popular iterative closest point (ICP) method and several existing point setRegistration methods in the field are closely related and can be reinterpreted meaningfully in this general framework.

...read moreread less

Abstract: In this paper, we present a unified framework for the rigid and nonrigid point set registration problem in the presence of significant amounts of noise and outliers. The key idea of this registration framework is to represent the input point sets using Gaussian mixture models. Then, the problem of point set registration is reformulated as the problem of aligning two Gaussian mixtures such that a statistical discrepancy measure between the two corresponding mixtures is minimized. We show that the popular iterative closest point (ICP) method and several existing point set registration methods in the field are closely related and can be reinterpreted meaningfully in our general framework. Our instantiation of this general framework is based on the the L2 distance between two Gaussian mixtures, which has the closed-form expression and in turn leads to a computationally efficient registration algorithm. The resulting registration algorithm exhibits inherent statistical robustness, has an intuitive interpretation, and is simple to implement. We also provide theoretical and experimental comparisons with other robust methods for point set registration.

...read moreread less

909 citations

Journal Article•DOI•

Robust statistics for outlier detection

[...]

Peter J. Rousseeuw¹, Mia Hubert¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: An overview of several robust methods and outlier detection tools for univariate, low‐dimensional, and high‐dimensional data such as estimation of location and scatter, linear regression, principal component analysis, and classification are presented.

...read moreread less

Abstract: When analyzing data, outlying observations cause problems because they may strongly influence the result. Robust statistics aims at detecting the outliers by searching for the model fitted by the majority of the data. We present an overview of several robust methods and outlier detection tools. We discuss robust procedures for univariate, low-dimensional, and high-dimensional data such as estimation of location and scatter, linear regression, principal component analysis, and classification. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 73-79 DOI: 10.1002/widm.2 This article is categorized under: Algorithmic Development > Biological Data Mining Algorithmic Development > Spatial and Temporal Data Mining Application Areas > Health Care Technologies > Structure Discovery and Clustering

...read moreread less

533 citations

Journal Article•DOI•

Comparison of F ST outlier tests for SNP loci under selection

[...]

Shawn R. Narum, Jon E. Hess

06 Feb 2011-Molecular Ecology Resources

TL;DR: It is important that outlier loci are interpreted cautiously and error rates of various methods are taken into consideration in studies of adaptive molecular variation, especially when hierarchical structure is included.

...read moreread less

Abstract: Genome scans with many genetic markers provide the opportunity to investigate local adaptation in natural populations and identify candidate genes under selection. In particular, SNPs are dense throughout the genome of most organisms and are commonly observed in functional genes making them ideal markers to study adaptive molecular variation. This approach has become commonly employed in ecological and population genetics studies to detect outlier loci that are putatively under selection. However, there are several challenges to address with outlier approaches including genotyping errors, underlying population structure and false positives, variation in mutation rate and limited sensitivity (false negatives). In this study, we evaluated multiple outlier tests and their type I (false positive) and type II (false negative) error rates in a series of simulated data sets. Comparisons included simulation procedures (FDIST2, ARLEQUIN v.3.5 and BAYESCAN) as well as more conventional tools such as global F(ST) histograms. Of the three simulation methods, FDIST2 and BAYESCAN typically had the lowest type II error, BAYESCAN had the least type I error and Arlequin had highest type I and II error. High error rates in Arlequin with a hierarchical approach were partially because of confounding scenarios where patterns of adaptive variation were contrary to neutral structure; however, Arlequin consistently had highest type I and type II error in all four simulation scenarios tested in this study. Given the results provided here, it is important that outlier loci are interpreted cautiously and error rates of various methods are taken into consideration in studies of adaptive molecular variation, especially when hierarchical structure is included.

...read moreread less

459 citations

Journal Article•DOI•

Robust Principal Component Analysis Based on Maximum Correntropy Criterion

[...]

Ran He¹, Bao-Gang Hu¹, Wei-Shi Zheng², Xiangwei Kong³•Institutions (3)

Chinese Academy of Sciences¹, Sun Yat-sen University², Dalian University of Technology³

01 Jun 2011-IEEE Transactions on Image Processing

TL;DR: Numerical results demonstrate that the proposed method can outperform robust rotational-invariant PCAs based on L1 norm when outliers occur and requires no assumption about the zero-mean of data for processing and can estimate data mean during optimization.

...read moreread less

Abstract: Principal component analysis (PCA) minimizes the mean square error (MSE) and is sensitive to outliers. In this paper, we present a new rotational-invariant PCA based on maximum correntropy criterion (MCC). A half-quadratic optimization algorithm is adopted to compute the correntropy objective. At each iteration, the complex optimization problem is reduced to a quadratic problem that can be efficiently solved by a standard optimization method. The proposed method exhibits the following benefits: 1) it is robust to outliers through the mechanism of MCC which can be more theoretically solid than a heuristic rule based on MSE; 2) it requires no assumption about the zero-mean of data for processing and can estimate data mean during optimization; and 3) its optimal solution consists of principal eigenvectors of a robust covariance matrix corresponding to the largest eigenvalues. In addition, kernel techniques are further introduced in the proposed method to deal with nonlinearly distributed data. Numerical results demonstrate that the proposed method can outperform robust rotational-invariant PCAs based on L1 norm when outliers occur.

...read moreread less

327 citations

Book•

Statistical Inference: The Minimum Distance Approach

[...]

Ayanendranath Basu, Hiroyuki Shioya, Chanseok Park

22 Jun 2011

TL;DR: Introduction General Notation Illustrative Examples Some Background and Relevant Definitions Parametric Inference based on the Maximum Likelihood Method Hypothesis Testing by Likelihood Methods Statistical Functionals and Influence Function Outline of the Book Statistical Distances.

...read moreread less

Abstract: Introduction General Notation Illustrative Examples Some Background and Relevant Definitions Parametric Inference based on the Maximum Likelihood Method Hypothesis Testing by Likelihood Methods Statistical Functionals and Influence Function Outline of the Book Statistical Distances Introduction Distances Based on Distribution Functions Density-Based Distances Minimum Hellinger Distance Estimation: Discrete Models Minimum Distance Estimation Based on Disparities: Discrete Models Some Examples Continuous Models Introduction Minimum Hellinger Distance Estimation Estimation of Multivariate Location and Covariance A General Structure The Basu-Lindsay Approach for Continuous Data Examples Measures of Robustness and Computational Issues The Residual Adjustment Function The Graphical Interpretation of Robustness The Generalized Hellinger Distance Higher Order Influence Analysis Higher Order Influence Analysis: Continuous Models Asymptotic Breakdown Properties The alpha-Influence Function Outlier Stability of Minimum Distance Estimators Contamination Envelopes The Iteratively Reweighted Least Squares (IRLS) The Hypothesis Testing Problem Disparity Difference Test: Hellinger Distance Case Disparity Difference Tests in Discrete Models Disparity Difference Tests: The Continuous Case Power Breakdown of Disparity Difference Tests Outlier Stability of Hypothesis Tests The Two Sample Problem Techniques for Inlier Modification Minimum Distance Estimation: Inlier Correction in Small Samples Penalized Distances Combined Distances o-Combined Distances Coupled Distances The Inlier-Shrunk Distances Numerical Simulations and Examples Weighted Likelihood Estimation The Discrete Case The Continuous Case Examples Hypothesis Testing Further Reading Multinomial Goodness-of-fit Testing Introduction Asymptotic Distribution of the Goodness-of-Fit Statistics Exact Power Comparisons in Small Samples Choosing a Disparity to Minimize the Correction Terms Small Sample Comparisons of the Test Statistics Inlier Modified Statistics An Application: Kappa Statistics The Density Power Divergence The Minimum L2 Distance Estimator The Minimum Density Power Divergence Estimator A Related Divergence Measure The Censored Survival Data Problem The Normal Mixture Model Problem Selection of Tuning Parameters Other Applications of the Density Power Divergence Other Applications Censored Data Minimum Hellinger Distance Methods in Mixture Models Minimum Distance Estimation Based on Grouped Data Semiparametric Problems Other Miscellaneous Topics Distance Measures in Information and Engineering Introduction Entropies and Divergences Csiszar's f-Divergence The Bregman Divergence Extended f-Divergences Additional Remarks Applications to Other Models Introduction Preliminaries for Other Models Neural Networks Fuzzy Theory Phase Retrieval Summary

...read moreread less

311 citations

Book Chapter•DOI•

robCompositions: An R-package for Robust Statistical Analysis of Compositional Data

[...]

Matthias Templ¹, Matthias Templ², Karel Hron, Peter Filzmoser¹•Institutions (2)

Vienna University of Technology¹, Statistics Austria²

28 Jul 2011

TL;DR: The R-package robCompositions (Templ et al., 2009) contains functions for robust statistical methods designed for compositional data, like principal component analysis, factor analysis, and discriminant analysis.

...read moreread less

Abstract: Compositional data are data that contain only relative information (see, e.g. Aitchison 1986)). Typical examples are data describing expenditures of persons on certain goods, or environmental data like the concentration of chemical elements in the soil. If all the compositional parts would be available, they would sum up to a total, like 100case of geochemical concentrations. Frequently, practical data sets include outliers, and thus a robust analysis is desirable. The R-package robCompositions (Templ et al., 2009) contains functions for robust statistical methods designed for compositional data, like principal component analysis (Filzmoser et al., 2009a) (including the robust compositional biplot), factor analysis (Filzmoser et al., 2009b), and discriminant analysis (Filzmoser et al., 2009c). Furthermore, methods to improve the quality of compositional data sets are implemented, like outlier detection (Filzmoser et al., 2008), and imputation of missing values (Hron et al, 2010). The latter one, based on a modified k-nearest neighbor algorithm and a model-based imputation, is also supported with measures of quality of imputation and diagnostic plots. The usage of the package will be illustrated on practical examples.

...read moreread less

298 citations

Proceedings Article•

Interpreting and Unifying Outlier Scores

[...]

Hans-Peter Kriegel¹, Peer Kröger¹, Erich Schubert¹, Arthur Zimek¹•Institutions (1)

Ludwig Maximilian University of Munich¹

01 Jan 2011

TL;DR: It is shown that a unification of outlier scores provided by various outlier models and a translation of the arbitrary “outlier factors” to values in the range of 0, 1 interpretable as values describing the probability of a data object of being an outlier facilitates enhanced ensembles for outlier detection.

...read moreread less

Abstract: Outlier scores provided by different outlier models differ widely in their meaning, range, and contrast between different outlier models and, hence, are not easily comparable or interpretable. We propose a unification of outlier scores provided by various outlier models and a translation of the arbitrary “outlier factors” to values in the range [0, 1] interpretable as values describing the probability of a data object of being an outlier. As an application, we show that this unification facilitates enhanced ensembles for outlier detection.

...read moreread less

255 citations

Journal Article•DOI•

Outlier Detection Using Nonconvex Penalized Regression

[...]

Yiyuan She¹, Art B. Owen¹•Institutions (1)

Florida State University¹

01 Jun 2011-Journal of the American Statistical Association

TL;DR: In this paper, a thresholding based iterative procedure for outlier detection (Θ-IPOD) was proposed to identify outliers and estimate regression coefficients, which is based on hard thresholding and soft thresholding.

...read moreread less

Abstract: This article studies the outlier detection problem from the standpoint of penalized regression. In the regression model, we add one mean shift parameter for each of the n data points. We then apply a regularization favoring a sparse vector of mean shift parameters. The usual L1 penalty yields a convex criterion, but fails to deliver a robust estimator. The L1 penalty corresponds to soft thresholding. We introduce a thresholding (denoted by Θ) based iterative procedure for outlier detection (Θ–IPOD). A version based on hard thresholding correctly identifies outliers on some hard test problems. We describe the connection between Θ–IPOD and M-estimators. Our proposed method has one tuning parameter with which to both identify outliers and estimate regression coefficients. A data-dependent choice can be made based on the Bayes information criterion. The tuned Θ–IPOD shows outstanding performance in identifying outliers in various situations compared with other existing approaches. In addition, Θ–IPOD is much ...

...read moreread less

250 citations

Journal Article•DOI•

Geometric Inference for Probability Measures

[...]

Frédéric Chazal¹, David Cohen-Steiner¹, Quentin Mérigot²•Institutions (2)

French Institute for Research in Computer Science and Automation¹, University of Grenoble²

01 Dec 2011-Foundations of Computational Mathematics

TL;DR: Replacing compact subsets by measures, a notion of distance function to a probability distribution in ℝd is introduced and it is shown that it is possible to reconstruct offsets of sampled shapes with topological guarantees even in the presence of outliers.

...read moreread less

Abstract: Data often comes in the form of a point cloud sampled from an unknown compact subset of Euclidean space. The general goal of geometric inference is then to recover geometric and topological features (e.g., Betti numbers, normals) of this subset from the approximating point cloud data. It appears that the study of distance functions allows one to address many of these questions successfully. However, one of the main limitations of this framework is that it does not cope well with outliers or with background noise. In this paper, we show how to extend the framework of distance functions to overcome this problem. Replacing compact subsets by measures, we introduce a notion of distance function to a probability distribution in ℝ d . These functions share many properties with classical distance functions, which make them suitable for inference purposes. In particular, by considering appropriate level sets of these distance functions, we show that it is possible to reconstruct offsets of sampled shapes with topological guarantees even in the presence of outliers. Moreover, in settings where empirical measures are considered, these functions can be easily evaluated, making them of particular practical interest.

...read moreread less

236 citations

Journal Article•DOI•

1-Point-RANSAC Structure from Motion for Vehicle-Mounted Cameras by Exploiting Non-holonomic Constraints

[...]

Davide Scaramuzza¹•Institutions (1)

Institute of Robotics and Intelligent Systems¹

01 Oct 2011-International Journal of Computer Vision

TL;DR: By exploiting the nonholonomic constraints of wheeled vehicles it is possible to use a restrictive motion model which allows us to parameterize the motion with only 1 point correspondence and results in the two most efficient algorithms for removing outliers: 1-point RANSAC and histogram voting.

...read moreread less

Abstract: This paper presents a new method to estimate the relative motion of a vehicle from images of a single camera. The computational cost of the algorithm is limited only by the feature extraction and matching process, as the outlier removal and the motion estimation steps take less than a fraction of millisecond with a normal laptop computer. The biggest problem in visual motion estimation is data association; matched points contain many outliers that must be detected and removed for the motion to be accurately estimated. In the last few years, a very established method for removing outliers has been the "5-point RANSAC" algorithm which needs a minimum of 5 point correspondences to estimate the model hypotheses. Because of this, however, it can require up to several hundreds of iterations to find a set of points free of outliers. In this paper, we show that by exploiting the nonholonomic constraints of wheeled vehicles it is possible to use a restrictive motion model which allows us to parameterize the motion with only 1 point correspondence. Using a single feature correspondence for motion estimation is the lowest model parameterization possible and results in the two most efficient algorithms for removing outliers: 1-point RANSAC and histogram voting. To support our method we run many experiments on both synthetic and real data and compare the performance with a state-of-the-art approach. Finally, we show an application of our method to visual odometry by recovering a 3 Km trajectory in a cluttered urban environment and in real-time.

...read moreread less

223 citations

Journal Article•DOI•

A Survey of Outlier Detection Methods in Network Anomaly Identification

[...]

Prasanta Gogoi¹, Dhruba K. Bhattacharyya¹, Bhogeswar Borah¹, Jugal Kalita²•Institutions (2)

Tezpur University¹, University of Colorado Colorado Springs²

01 Apr 2011-The Computer Journal

TL;DR: A comprehensive survey of well-known distance-based, density-based and other techniques for outlier detection and compare them is presented and definitions of outliers are provided and their detection based on supervised and unsupervised learning in the context of network anomaly detection are discussed.

...read moreread less

Abstract: The detection of outliers has gained considerable interest in data mining with the realization that outliers can be the key discovery to be made from very large databases. Outliers arise due to various reasons such as mechanical faults, changes in system behavior, fraudulent behavior, human error and instrument error. Indeed, for many applications the discovery of outliers leads to more interesting and useful results than the discovery of inliers. Detection of outliers can lead to identification of system faults so that administrators can take preventive measures before they escalate. It is possible that anomaly detection may enable detection of new attacks. Outlier detection is an important anomaly detection approach. In this paper, we present a comprehensive survey of well-known distance-based, density-based and other techniques for outlier detection and compare them. We provide definitions of outliers and discuss their detection based on supervised and unsupervised learning in the context of network anomaly detection.

...read moreread less

Proceedings Article•DOI•

Outlier detection in graph streams

[...]

Charu C. Aggarwal¹, Yuchen Zhao², Philip S. Yu²•Institutions (2)

IBM¹, University of Illinois at Chicago²

11 Apr 2011

TL;DR: First results on the problem of structural outlier detection in massive network streams are provided, using a structural connectivity model in order to define outliers in graph streams and designing a reservoir sampling method to maintain structural summaries of the underlying network.

...read moreread less

Abstract: A number of applications in social networks, telecommunications, and mobile computing create massive streams of graphs. In many such applications, it is useful to detect structural abnormalities which are different from the “typical” behavior of the underlying network. In this paper, we will provide first results on the problem of structural outlier detection in massive network streams. Such problems are inherently challenging, because the problem of outlier detection is specially challenging because of the high volume of the underlying network stream. The stream scenario also increases the computational challenges for the approach. We use a structural connectivity model in order to define outliers in graph streams. In order to handle the sparsity problem of massive networks, we dynamically partition the network in order to construct statistically robust models of the connectivity behavior. We design a reservoir sampling method in order to maintain structural summaries of the underlying network. These structural summaries are designed in order to create robust, dynamic and efficient models for outlier detection in graph streams. We present experimental results illustrating the effectiveness and efficiency of our approach.

...read moreread less

Journal Article•DOI•

Statistical outlier detection using direct density ratio estimation

[...]

Shohei Hido¹, Yuta Tsuboi², Hisashi Kashima³, Masashi Sugiyama⁴, Takafumi Kanamori⁵ - Show less +1 more•Institutions (5)

Kyoto University¹, IBM², University of Tokyo³, Tokyo Institute of Technology⁴, Nagoya University⁵

01 Feb 2011-Knowledge and Information Systems

TL;DR: A new statistical approach to the problem of inlier-based outlier detection, i.e., finding outliers in the test set based on the training set consisting only of inliers, using the ratio of training and test data densities as an outlier score is proposed.

...read moreread less

Abstract: We propose a new statistical approach to the problem of inlier-based outlier detection, i.e., finding outliers in the test set based on the training set consisting only of inliers. Our key idea is to use the ratio of training and test data densities as an outlier score. This approach is expected to have better performance even in high-dimensional problems since methods for directly estimating the density ratio without going through density estimation are available. Among various density ratio estimation methods, we employ the method called unconstrained least-squares importance fitting (uLSIF) since it is equipped with natural cross-validation procedures, allowing us to objectively optimize the value of tuning parameters such as the regularization parameter and the kernel width. Furthermore, uLSIF offers a closed-form solution as well as a closed-form formula for the leave-one-out error, so it is computationally very efficient and is scalable to massive datasets. Simulations with benchmark and real-world datasets illustrate the usefulness of the proposed approach.

...read moreread less

Proceedings Article•DOI•

Continuous monitoring of distance-based outliers over data streams

[...]

Maria Kontaki¹, Anastasios Gounaris¹, Apostolos N. Papadopoulos¹, Kostas Tsichlas¹, Yannis Manolopoulos¹ - Show less +1 more•Institutions (1)

Aristotle University of Thessaloniki¹

11 Apr 2011

TL;DR: New algorithms for continuous outlier monitoring in data streams, based on sliding windows are proposed, able to reduce the required storage overhead, run faster than previously proposed techniques and offer significant flexibility.

...read moreread less

Abstract: Anomaly detection is considered an important data mining task, aiming at the discovery of elements (also known as outliers) that show significant diversion from the expected case. More specifically, given a set of objects the problem is to return the suspicious objects that deviate significantly from the typical behavior. As in the case of clustering, the application of different criteria lead to different definitions for an outlier. In this work, we focus on distance-based outliers: an object x is an outlier if there are less than k objects lying at distance at most R from x. The problem offers significant challenges when a stream-based environment is considered, where data arrive continuously and outliers must be detected on-the-fly. There are a few research works studying the problem of continuous outlier detection. However, none of these proposals meets the requirements of modern stream-based applications for the following reasons: (i) they demand a significant storage overhead, (ii) their efficiency is limited and (iii) they lack flexibility. In this work, we propose new algorithms for continuous outlier monitoring in data streams, based on sliding windows. Our techniques are able to reduce the required storage overhead, run faster than previously proposed techniques and offer significant flexibility. Experiments performed on real-life as well as synthetic data sets verify our theoretical study.

...read moreread less

Journal Article•DOI•

Electric Load Forecasting Based on Statistical Robust Methods

[...]

Yacine Chakhchoukh, Patrick Panciatici, Lamine Mili¹•Institutions (1)

Virginia Tech¹

01 Aug 2011-IEEE Transactions on Power Systems

TL;DR: The goal is to propose an efficient and robust load forecasting method for prediction up to a day-ahead and to deal with heteroscedasticity, a simple novel multivariate modeling that improves the quality of the forecast.

...read moreread less

Abstract: In this paper, the stochastic characteristics of the electric consumption in France are analyzed. It is shown that the load time series exhibit lasting abrupt changes in the stochastic pattern, termed breaks. The goal is to propose an efficient and robust load forecasting method for prediction up to a day-ahead. To this end, two new robust procedures for outlier identification and suppression are developed. They are termed the multivariate ratio-of-medians-based estimator (RME) and the multivariate minimum-Hellinger-distance-based estimator (MHDE). The performance of the proposed methods has been evaluated on the French electric load time series in terms of execution times, ability to detect and suppress outliers, and forecasting accuracy. Their performances are compared with those of the robust methods proposed in the literature to estimate the parameters of SARIMA models and of the multiplicative double seasonal exponential smoothing. A new robust version of the latter is proposed as well. It is found that the RME approach outperforms all the other methods for “normal days” and presents several interesting properties such as good robustness, fast execution, simplicity, and easy online implementation. Finally, to deal with heteroscedasticity, we propose a simple novel multivariate modeling that improves the quality of the forecast.

...read moreread less

Journal Article•DOI•

Support Vector Machines with the Ramp Loss and the Hard Margin Loss

[...]

J. Paul Brooks¹•Institutions (1)

Virginia Commonwealth University¹

01 Mar 2011-Operations Research

TL;DR: This work presents integer programming formulations of Vapnik's support vector machine (SVM) with the ramp loss and hard margin loss, and shows SVM with these loss functions is shown to be a consistent estimator when used with certain kernel functions.

...read moreread less

Abstract: In the interest of deriving classifiers that are robust to outlier observations, we present integer programming formulations of Vapnik's support vector machine (SVM) with the ramp loss and hard margin loss. The ramp loss allows a maximum error of 2 for each training observation, while the hard margin loss calculates error by counting the number of training observations that are in the margin or misclassified outside of the margin. SVM with these loss functions is shown to be a consistent estimator when used with certain kernel functions. In computational studies with simulated and real-world data, SVM with the robust loss functions ignores outlier observations effectively, providing an advantage over SVM with the traditional hinge loss when using the linear kernel. Despite the fact that training SVM with the robust loss functions requires the solution of a quadratic mixed-integer program (QMIP) and is NP-hard, while traditional SVM requires only the solution of a continuous quadratic program (QP), we are able to find good solutions and prove optimality for instances with up to 500 observations. Solution methods are presented for the new formulations that improve computational performance over industry-standard integer programming solvers alone.

...read moreread less

Proceedings Article•DOI•

An outlier-robust Kalman filter

[...]

Gabriel Agamennoni¹, Juan Nieto¹, Eduardo Nebot¹•Institutions (1)

University of Sydney¹

09 May 2011

TL;DR: The outlier-robust Kalman filter proposed is a discrete-time model for sequential data corrupted with non-Gaussian and heavy-tailed noise and efficient filtering and smoothing algorithms are presented which are straightforward modifications of the standardKalman filter Rauch-Tung-Striebel recursions and yet are much more robust to outliers and anomalous observations.

...read moreread less

Abstract: We introduce a novel approach for processing sequential data in the presence of outliers. The outlier-robust Kalman filter we propose is a discrete-time model for sequential data corrupted with non-Gaussian and heavy-tailed noise. We present efficient filtering and smoothing algorithms which are straightforward modifications of the standard Kalman filter Rauch-Tung-Striebel recursions and yet are much more robust to outliers and anomalous observations. Additionally, we present an algorithm for learning all of the parameters of our outlier-robust Kalman filter in a completely unsupervised manner. The potential of our approach is borne out in experiments with synthetic and real data.

...read moreread less

Journal Article•DOI•

Iterative stepwise regression imputation using standard and robust methods

[...]

Matthias Templ¹, Alexander Kowarik², Peter Filzmoser¹•Institutions (2)

Vienna University of Technology¹, Statistics Austria²

01 Oct 2011-Computational Statistics & Data Analysis

TL;DR: The aim is to propose an automatic algorithm called IRMI for iterative model-based imputation using robust methods, encountering for the mentioned challenges, and to provide a software tool in R for this algorithm.

...read moreread less

Proceedings Article•DOI•

Statistical selection of relevant subspace projections for outlier ranking

[...]

Emmanuel Müller¹, Matthias Schiffer², Thomas Seidl²•Institutions (2)

Karlsruhe Institute of Technology¹, RWTH Aachen University²

11 Apr 2011

TL;DR: This work proposes a novel outlier ranking based on the objects deviation in a statistically selected set of relevant subspace projections and provides a selection of subspaces with high contrast to tackle the general challenges of detecting outliers hidden in subspaced of the data.

...read moreread less

Abstract: Outlier mining is an important data analysis task to distinguish exceptional outliers from regular objects. For outlier mining in the full data space, there are well established methods which are successful in measuring the degree of deviation for outlier ranking. However, in recent applications traditional outlier mining approaches miss outliers as they are hidden in subspace projections. Especially, outlier ranking approaches measuring deviation on all available attributes miss outliers deviating from their local neighborhood only in subsets of the attributes. In this work, we propose a novel outlier ranking based on the objects deviation in a statistically selected set of relevant subspace projections. This ensures to find objects deviating in multiple relevant subspaces, while it excludes irrelevant projections showing no clear contrast between outliers and the residual objects. Thus, we tackle the general challenges of detecting outliers hidden in subspaces of the data. We provide a selection of subspaces with high contrast and propose a novel ranking based on an adaptive degree of deviation in arbitrary subspaces. In thorough experiments on real and synthetic data we show that our approach outperforms competing outlier ranking approaches by detecting outliers in arbitrary subspace projections.

...read moreread less

Journal Article•DOI•

Features and performance of some outlier detection methods

[...]

Giulio Barbato¹, Emanuele Modesto Barini¹, Gianfranco Genta¹, Raffaello Levi¹•Institutions (1)

Polytechnic University of Turin¹

17 Jan 2011-Journal of Applied Statistics

TL;DR: A review of several statistical methods that are currently in use for outlier identification is presented, and their performances are compared theoretically for typical statistical distributions of experimental data, considering values derived from the distribution of extreme order statistics as reference terms as mentioned in this paper.

...read moreread less

Abstract: A review of several statistical methods that are currently in use for outlier identification is presented, and their performances are compared theoretically for typical statistical distributions of experimental data, considering values derived from the distribution of extreme order statistics as reference terms. A simple modification of a popular, broadly used method based upon box-plot is introduced, in order to overcome a major limitation concerning sample size. Examples are presented concerning exploitation of methods considered on two data sets: a historical one concerning evaluation of an astronomical constant performed by a number of leading observatories and a substantial database pertaining to an ongoing investigation on absolute measurement of gravity acceleration, exhibiting peculiar aspects concerning outliers. Some problems related to outlier treatment are examined, and the requirement of both statistical analysis and expert opinion for proper outlier management is underlined.

...read moreread less

Journal Article•DOI•

Linguistic Summarization Using IF–THEN Rules and Interval Type-2 Fuzzy Sets

[...]

Dongrui Wu¹, Jerry M. Mendel¹•Institutions (1)

University of Southern California¹

01 Feb 2011-IEEE Transactions on Fuzzy Systems

TL;DR: An LS approach to generate IF-THEN rules for causal databases is proposed and both type-1 and interval type-2 fuzzy sets are considered, and the degree of reliability is especially valuable for finding the most reliable and representative rules.

...read moreread less

Abstract: Linguistic summarization (LS) is a data mining or knowledge discovery approach to extract patterns from databases. Many authors have used this technique to generate summaries like “Most senior workers have high salary,” which can be used to better understand and communicate about data; however, few of them have used it to generate IF-THEN rules like “IF X is large and Y is medium, THEN Z is small,” which not only facilitate understanding and communication of data but can also be used in decision-making. In this paper, an LS approach to generate IF-THEN rules for causal databases is proposed. Both type-1 and interval type-2 fuzzy sets are considered. Five quality measures-the degrees of truth, sufficient coverage, reliability, outlier, and simplicity-are defined. Among them, the degree of reliability is especially valuable for finding the most reliable and representative rules, and the degree of outlier can be used to identify outlier rules and data for close-up investigation. An improved parallel coordinates approach for visualizing the IF-THEN rules is also proposed. Experiments on two datasets demonstrate our LS and rule visualization approaches. Finally, the relationships between our LS approach and the Wang-Mendel (WM) method, perceptual reasoning, and granular computing are pointed out.

...read moreread less

Proceedings Article•DOI•

Improving classification accuracy by identifying and removing instances that should be misclassified

[...]

Michael R. Smith¹, Tony Martinez¹•Institutions (1)

Brigham Young University¹

03 Oct 2011

TL;DR: A filtering method called PRISM is introduced that identifies and removes instances that should be misclassified and achieves a higher classification accuracy than the outlier detection methods and compares favorably with the noise reduction method.

...read moreread less

Abstract: Appropriately handling noise and outliers is an important issue in data mining. In this paper we examine how noise and outliers are handled by learning algorithms. We introduce a filtering method called PRISM that identifies and removes instances that should be misclassified. We refer to the set of removed instances as ISMs (instances that should be misclassified). We examine PRISM and compare it against 3 existing outlier detection methods and 1 noise reduction technique on 48 data sets using 9 learning algorithms. Using PRISM, the classification accuracy increases from 78.5% to 79.8% on a set of 53 data sets and is statistically significant. In addition, the accuracy on the non-outlier instances increases from 82.8% to 84.7%. PRISM achieves a higher classification accuracy than the outlier detection methods and compares favorably with the noise reduction method.

...read moreread less

Journal Article•DOI•

Doubly Robust Smoothing of Dynamical Processes via Outlier Sparsity Constraints

[...]

Shahrokh Farahmand¹, Georgios B. Giannakis¹, Daniele Angelosante•Institutions (1)

University of Minnesota¹

01 Oct 2011-IEEE Transactions on Signal Processing

TL;DR: Novel fixed-lag and fixed-interval smoothing algorithms that are robust to outliers simultaneously present in the measurements and in the state dynamics and which rely on coordinate descent and the alternating direction method of multipliers, are developed.

...read moreread less

Abstract: Coping with outliers contaminating dynamical processes is of major importance in various applications because mismatches from nominal models are not uncommon in practice. In this context, the present paper develops novel fixed-lag and fixed-interval smoothing algorithms that are robust to outliers simultaneously present in the measurements and in the state dynamics. Outliers are handled through auxiliary unknown variables that are jointly estimated along with the state based on the least-squares criterion that is regularized with the l1-norm of the outliers in order to effect sparsity control. The resultant iterative estimators rely on coordinate descent and the alternating direction method of multipliers, are expressed in closed form per iteration, and are provably convergent. Additional attractive features of the novel doubly robust smoother include: i) ability to handle both types of outliers; ii) universality to unknown nominal noise and outlier distributions; iii) flexibility to encompass maximum a posteriori optimal estimators with reliable performance under nominal conditions; and iv) improved pCoping with outliers contaminating dynamical processes is of major importance in various applications because mismatches from nominal models are not uncommon in practice. In this context, the present paper develops novel fixed-lag and fixed-interval smoothing algorithms that are robust to outliers simultaneously present in the measurements and in the state dynamics. Outliers are handled through auxiliary unknown variables that are jointly estimated along with the state based on the least-squares criterion that is regularized with the l1-norm of the outliers in order to effect sparsity control. The resultant iterative estimators rely on coordinate descent and the alternating direction method of multipliers, are expressed in closed form per iteration, and are provably convergent. Additional attractive features of the novel doubly robust smoother include: i) ability to handle both types of outliers; ii) universality to unknown nominal noise and outlier distributions; iii) flexibility to encompass maximum a posteriori optimal estimators with reliable performance under nominal conditions; and iv) improved performance relative to competing alternatives at comparable complexity, as corroborated via simulated tests.erformance relative to competing alternatives at comparable complexity, as corroborated via simulated tests.

...read moreread less

Journal Article•DOI•

Robust recovery of multiple subspaces by geometric l_p minimization

[...]

Gilad Lerman, Teng Zhang

19 Apr 2011-arXiv: Machine Learning

TL;DR: In this paper, the authors study the simultaneous recovery of the K fixed subspaces by minimizing the l_p-averaged distances of the sampled data points from any K subspacing.

...read moreread less

Abstract: We assume i.i.d. data sampled from a mixture distribution with K components along fixed d-dimensional linear subspaces and an additional outlier component. For p>0, we study the simultaneous recovery of the K fixed subspaces by minimizing the l_p-averaged distances of the sampled data points from any K subspaces. Under some conditions, we show that if $0 1 and p>1, then the underlying subspaces cannot be recovered or even nearly recovered by l_p minimization. The results of this paper partially explain the successes and failures of the basic approach of l_p energy minimization for modeling data by multiple subspaces.

...read moreread less

Proceedings Article•DOI•

Algorithms for speeding up distance-based outlier detection

[...]

Kanishka Bhaduri¹, Bryan Matthews¹, Chris Giannella²•Institutions (2)

Ames Research Center¹, Mitre Corporation²

21 Aug 2011

TL;DR: By combining simple but effective indexing and disk block accessing techniques, a sequential algorithm iOrca is developed that is up to an order- of-magnitude faster than the state-of-the-art.

...read moreread less

Abstract: The problem of distance-based outlier detection is difficult to solve efficiently in very large datasets because of potential quadratic time complexity. We address this problem and develop sequential and distributed algorithms that are significantly more efficient than state-of-the-art methods while still guaranteeing the same outliers. By combining simple but effective indexing and disk block accessing techniques, we have developed a sequential algorithm iOrca that is up to an order-of-magnitude faster than the state-of-the-art. The indexing scheme is based on sorting the data points in order of increasing distance from a fixed reference point and then accessing those points based on this sorted order. To speed up the basic outlier detection technique, we develop two distributed algorithms (DOoR and iDOoR) for modern distributed multi-core clusters of machines, connected on a ring topology. The first algorithm passes data blocks from each machine around the ring, incrementally updating the nearest neighbors of the points passed. By maintaining a cutoff threshold, it is able to prune a large number of points in a distributed fashion. The second distributed algorithm extends this basic idea with the indexing scheme discussed earlier. In our experiments, both distributed algorithms exhibit significant improvements compared to the state-of-the-art distributed method [13].

...read moreread less

Proceedings Article•DOI•

An Outlier Detection Method Based on Clustering

[...]

Rajendra Pamula¹, Jatindra Kumar Deka¹, Sukumar Nandi¹•Institutions (1)

Indian Institute of Technology Guwahati¹

19 Feb 2011

TL;DR: A clustering based method to capture outliers using K-means clustering algorithm to divide the data set into clusters and declares the top $n$ points with the highest score as outliers.

...read moreread less

Abstract: In this paper we propose a clustering based method to capture outliers. We apply K-means clustering algorithm to divide the data set into clusters. The points which are lying near the centroid of the cluster are not probable candidate for outlier and we can prune out such points from each cluster. Next we calculate a distance based outlier score for remaining points. The computations needed to calculate the outlier score reduces considerably due to the pruning of some points. Based on the outlier score we declare the top $n$ points with the highest score as outliers. The experimental results using real data set demonstrate that even though the number of computations is less, the proposed method performs better than the existing method.

...read moreread less

Proceedings Article•

A decision tree-based missing value imputation technique for data pre-processing

[...]

Geaur Rahman¹, Zahidul Islam¹•Institutions (1)

Charles Sturt University¹

01 Dec 2011

TL;DR: An efficient missing value imputation technique called DMI, which makes use of a decision tree and expectation maximization (EM) algorithm, argues that the correlations among attributes within a horizontal partition of a data set can be higher than the correlations over the whole data set.

...read moreread less

Abstract: Data pre-processing plays a vital role in data mining for ensuring good quality of data. In general data pre-processing tasks include imputation of missing values, identification of outliers, smoothening out of noisy data and correction of inconsistent data. In this paper, we present an efficient missing value imputation technique called DMI, which makes use of a decision tree and expectation maximization (EM) algorithm. We argue that the correlations among attributes within a horizontal partition of a data set can be higher than the correlations over the whole data set. For some existing algorithms such as EM based imputation (EMI) accuracy of imputation is expected to be better for a data set having higher correlations than a data set having lower correlations. Therefore, our technique (DMI) applies EMI on various horizontal segments (of a data set) where correlations among attributes are high. We evaluate DMI on two publicly available natural data sets by comparing its performance with the performance of EMI. We use various patterns of missing values each having different missing ratios up to 10%. Several evaluation criteria such as coefficient of determination (R2), Index of agreement (d2) and root mean squared error (RMSE) are used. Our initial experimental results indicate that DMI performs significantly better than EMI.

...read moreread less

Journal Article•DOI•

On the Iterative Censoring for Target Detection in SAR Images

[...]

Yi Cui¹, Guangyi Zhou¹, Jian Yang¹, Yoshio Yamaguchi²•Institutions (2)

Tsinghua University¹, Niigata University²

24 Jan 2011-IEEE Geoscience and Remote Sensing Letters

TL;DR: A censoring scheme that iteratively updates the outlier/target maps for target detection in synthetic aperture radar (SAR) images is proposed, and its effectiveness was successfully demonstrated.

...read moreread less

Abstract: In this letter, a censoring scheme that iteratively updates the outlier/target maps for target detection in synthetic aperture radar (SAR) images is proposed. For each iteration, any pixels that are indicated by the outlier map as outliers are rejected (censored out) from the clutter estimation. The resulting detected target map is then used as the new outlier map for the next iteration. This procedure is continued until there is no change to the target map, which is then output as the final detection result. The proposed scheme is generically applicable for target detection in both single-channel and multichannel SAR images. In our experiment, in particular, we tested the proposed method on both single-channel and polarimetric SAR data, and its effectiveness was successfully demonstrated.

...read moreread less

Journal Article•DOI•

Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements

[...]

Emma J. Cooke¹, Richard S. Savage¹, Paul D. W. Kirk¹, Robert Darkins¹, David L. Wild¹ - Show less +1 more•Institutions (1)

University of Warwick¹

13 Oct 2011-BMC Bioinformatics

TL;DR: A generative model-based Bayesian hierarchical clustering algorithm for microarray time series that employs Gaussian process regression to capture the structure of the data and highlights the importance of including replicate information, which is found enables the discrimination of additional distinct expression profiles.

...read moreread less

Abstract: Background Post-genomic molecular biology has resulted in an explosion of data, providing measurements for large numbers of genes, proteins and metabolites. Time series experiments have become increasingly common, necessitating the development of novel analysis tools that capture the resulting data structure. Outlier measurements at one or more time points present a significant challenge, while potentially valuable replicate information is often ignored by existing techniques.

...read moreread less

Journal Article•DOI•

Robust functional principal components: A projection-pursuit approach

[...]

Juan Lucas Bali¹, Graciela Boente¹, David E. Tyler, Jane-Ling Wang²•Institutions (2)

University of Buenos Aires¹, University of California, Davis²

01 Dec 2011-Annals of Statistics

TL;DR: In this article, robust estimators for principal components are considered by adapting the projection pursuit approach to the functional data setting, which combines robust projection-pursuit with different smoothing methods.

...read moreread less

Abstract: In many situations, data are recorded over a period of time and may be regarded as realizations of a stochastic process. In this paper, robust estimators for the principal components are considered by adapting the projection pursuit approach to the functional data setting. Our approach combines robust projection-pursuit with different smoothing methods. Consistency of the estimators are shown under mild assumptions. The performance of the classical and robust procedures are compared in a simulation study under different contamination schemes. 1. Introduction. Analogous to classical principal components analysis (PCA), the projection-pursuit approach to robust PCA is based on finding projections of the data which have maximal dispersion. Instead of using the variance as a measure of dispersion, a robust scale estimator sn is used for the maximization problem. This approach was introduced by Li and Chen (1985), who proposed estimators based on maximizing (or minimizing) a robust scale. In this way, given a sample xi ∈ R d ,1 ≤ i ≤ n, the first robust principal component vector is defined as

...read moreread less

Collapse