scispace - formally typeset
Search or ask a question

Showing papers on "Outlier published in 2011"


Journal ArticleDOI
TL;DR: This paper presents a unified framework for the rigid and nonrigid point set registration problem in the presence of significant amounts of noise and outliers, and shows that the popular iterative closest point (ICP) method and several existing point setRegistration methods in the field are closely related and can be reinterpreted meaningfully in this general framework.
Abstract: In this paper, we present a unified framework for the rigid and nonrigid point set registration problem in the presence of significant amounts of noise and outliers. The key idea of this registration framework is to represent the input point sets using Gaussian mixture models. Then, the problem of point set registration is reformulated as the problem of aligning two Gaussian mixtures such that a statistical discrepancy measure between the two corresponding mixtures is minimized. We show that the popular iterative closest point (ICP) method and several existing point set registration methods in the field are closely related and can be reinterpreted meaningfully in our general framework. Our instantiation of this general framework is based on the the L2 distance between two Gaussian mixtures, which has the closed-form expression and in turn leads to a computationally efficient registration algorithm. The resulting registration algorithm exhibits inherent statistical robustness, has an intuitive interpretation, and is simple to implement. We also provide theoretical and experimental comparisons with other robust methods for point set registration.

909 citations


Journal ArticleDOI
TL;DR: An overview of several robust methods and outlier detection tools for univariate, low‐dimensional, and high‐dimensional data such as estimation of location and scatter, linear regression, principal component analysis, and classification are presented.
Abstract: When analyzing data, outlying observations cause problems because they may strongly influence the result. Robust statistics aims at detecting the outliers by searching for the model fitted by the majority of the data. We present an overview of several robust methods and outlier detection tools. We discuss robust procedures for univariate, low-dimensional, and high-dimensional data such as estimation of location and scatter, linear regression, principal component analysis, and classification. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 73-79 DOI: 10.1002/widm.2 This article is categorized under: Algorithmic Development > Biological Data Mining Algorithmic Development > Spatial and Temporal Data Mining Application Areas > Health Care Technologies > Structure Discovery and Clustering

533 citations


Journal ArticleDOI
TL;DR: It is important that outlier loci are interpreted cautiously and error rates of various methods are taken into consideration in studies of adaptive molecular variation, especially when hierarchical structure is included.
Abstract: Genome scans with many genetic markers provide the opportunity to investigate local adaptation in natural populations and identify candidate genes under selection. In particular, SNPs are dense throughout the genome of most organisms and are commonly observed in functional genes making them ideal markers to study adaptive molecular variation. This approach has become commonly employed in ecological and population genetics studies to detect outlier loci that are putatively under selection. However, there are several challenges to address with outlier approaches including genotyping errors, underlying population structure and false positives, variation in mutation rate and limited sensitivity (false negatives). In this study, we evaluated multiple outlier tests and their type I (false positive) and type II (false negative) error rates in a series of simulated data sets. Comparisons included simulation procedures (FDIST2, ARLEQUIN v.3.5 and BAYESCAN) as well as more conventional tools such as global F(ST) histograms. Of the three simulation methods, FDIST2 and BAYESCAN typically had the lowest type II error, BAYESCAN had the least type I error and Arlequin had highest type I and II error. High error rates in Arlequin with a hierarchical approach were partially because of confounding scenarios where patterns of adaptive variation were contrary to neutral structure; however, Arlequin consistently had highest type I and type II error in all four simulation scenarios tested in this study. Given the results provided here, it is important that outlier loci are interpreted cautiously and error rates of various methods are taken into consideration in studies of adaptive molecular variation, especially when hierarchical structure is included.

459 citations


Journal ArticleDOI
TL;DR: Numerical results demonstrate that the proposed method can outperform robust rotational-invariant PCAs based on L1 norm when outliers occur and requires no assumption about the zero-mean of data for processing and can estimate data mean during optimization.
Abstract: Principal component analysis (PCA) minimizes the mean square error (MSE) and is sensitive to outliers. In this paper, we present a new rotational-invariant PCA based on maximum correntropy criterion (MCC). A half-quadratic optimization algorithm is adopted to compute the correntropy objective. At each iteration, the complex optimization problem is reduced to a quadratic problem that can be efficiently solved by a standard optimization method. The proposed method exhibits the following benefits: 1) it is robust to outliers through the mechanism of MCC which can be more theoretically solid than a heuristic rule based on MSE; 2) it requires no assumption about the zero-mean of data for processing and can estimate data mean during optimization; and 3) its optimal solution consists of principal eigenvectors of a robust covariance matrix corresponding to the largest eigenvalues. In addition, kernel techniques are further introduced in the proposed method to deal with nonlinearly distributed data. Numerical results demonstrate that the proposed method can outperform robust rotational-invariant PCAs based on L1 norm when outliers occur.

327 citations


Book
22 Jun 2011
TL;DR: Introduction General Notation Illustrative Examples Some Background and Relevant Definitions Parametric Inference based on the Maximum Likelihood Method Hypothesis Testing by Likelihood Methods Statistical Functionals and Influence Function Outline of the Book Statistical Distances.
Abstract: Introduction General Notation Illustrative Examples Some Background and Relevant Definitions Parametric Inference based on the Maximum Likelihood Method Hypothesis Testing by Likelihood Methods Statistical Functionals and Influence Function Outline of the Book Statistical Distances Introduction Distances Based on Distribution Functions Density-Based Distances Minimum Hellinger Distance Estimation: Discrete Models Minimum Distance Estimation Based on Disparities: Discrete Models Some Examples Continuous Models Introduction Minimum Hellinger Distance Estimation Estimation of Multivariate Location and Covariance A General Structure The Basu-Lindsay Approach for Continuous Data Examples Measures of Robustness and Computational Issues The Residual Adjustment Function The Graphical Interpretation of Robustness The Generalized Hellinger Distance Higher Order Influence Analysis Higher Order Influence Analysis: Continuous Models Asymptotic Breakdown Properties The alpha-Influence Function Outlier Stability of Minimum Distance Estimators Contamination Envelopes The Iteratively Reweighted Least Squares (IRLS) The Hypothesis Testing Problem Disparity Difference Test: Hellinger Distance Case Disparity Difference Tests in Discrete Models Disparity Difference Tests: The Continuous Case Power Breakdown of Disparity Difference Tests Outlier Stability of Hypothesis Tests The Two Sample Problem Techniques for Inlier Modification Minimum Distance Estimation: Inlier Correction in Small Samples Penalized Distances Combined Distances o-Combined Distances Coupled Distances The Inlier-Shrunk Distances Numerical Simulations and Examples Weighted Likelihood Estimation The Discrete Case The Continuous Case Examples Hypothesis Testing Further Reading Multinomial Goodness-of-fit Testing Introduction Asymptotic Distribution of the Goodness-of-Fit Statistics Exact Power Comparisons in Small Samples Choosing a Disparity to Minimize the Correction Terms Small Sample Comparisons of the Test Statistics Inlier Modified Statistics An Application: Kappa Statistics The Density Power Divergence The Minimum L2 Distance Estimator The Minimum Density Power Divergence Estimator A Related Divergence Measure The Censored Survival Data Problem The Normal Mixture Model Problem Selection of Tuning Parameters Other Applications of the Density Power Divergence Other Applications Censored Data Minimum Hellinger Distance Methods in Mixture Models Minimum Distance Estimation Based on Grouped Data Semiparametric Problems Other Miscellaneous Topics Distance Measures in Information and Engineering Introduction Entropies and Divergences Csiszar's f-Divergence The Bregman Divergence Extended f-Divergences Additional Remarks Applications to Other Models Introduction Preliminaries for Other Models Neural Networks Fuzzy Theory Phase Retrieval Summary

311 citations


Book ChapterDOI
28 Jul 2011
TL;DR: The R-package robCompositions (Templ et al., 2009) contains functions for robust statistical methods designed for compositional data, like principal component analysis, factor analysis, and discriminant analysis.
Abstract: Compositional data are data that contain only relative information (see, e.g. Aitchison 1986)). Typical examples are data describing expenditures of persons on certain goods, or environmental data like the concentration of chemical elements in the soil. If all the compositional parts would be available, they would sum up to a total, like 100case of geochemical concentrations. Frequently, practical data sets include outliers, and thus a robust analysis is desirable. The R-package robCompositions (Templ et al., 2009) contains functions for robust statistical methods designed for compositional data, like principal component analysis (Filzmoser et al., 2009a) (including the robust compositional biplot), factor analysis (Filzmoser et al., 2009b), and discriminant analysis (Filzmoser et al., 2009c). Furthermore, methods to improve the quality of compositional data sets are implemented, like outlier detection (Filzmoser et al., 2008), and imputation of missing values (Hron et al, 2010). The latter one, based on a modified k-nearest neighbor algorithm and a model-based imputation, is also supported with measures of quality of imputation and diagnostic plots. The usage of the package will be illustrated on practical examples.

298 citations


Proceedings Article
01 Jan 2011
TL;DR: It is shown that a unification of outlier scores provided by various outlier models and a translation of the arbitrary “outlier factors” to values in the range of 0, 1 interpretable as values describing the probability of a data object of being an outlier facilitates enhanced ensembles for outlier detection.
Abstract: Outlier scores provided by different outlier models differ widely in their meaning, range, and contrast between different outlier models and, hence, are not easily comparable or interpretable. We propose a unification of outlier scores provided by various outlier models and a translation of the arbitrary “outlier factors” to values in the range [0, 1] interpretable as values describing the probability of a data object of being an outlier. As an application, we show that this unification facilitates enhanced ensembles for outlier detection.

255 citations


Journal ArticleDOI
TL;DR: In this paper, a thresholding based iterative procedure for outlier detection (Θ-IPOD) was proposed to identify outliers and estimate regression coefficients, which is based on hard thresholding and soft thresholding.
Abstract: This article studies the outlier detection problem from the standpoint of penalized regression. In the regression model, we add one mean shift parameter for each of the n data points. We then apply a regularization favoring a sparse vector of mean shift parameters. The usual L1 penalty yields a convex criterion, but fails to deliver a robust estimator. The L1 penalty corresponds to soft thresholding. We introduce a thresholding (denoted by Θ) based iterative procedure for outlier detection (Θ–IPOD). A version based on hard thresholding correctly identifies outliers on some hard test problems. We describe the connection between Θ–IPOD and M-estimators. Our proposed method has one tuning parameter with which to both identify outliers and estimate regression coefficients. A data-dependent choice can be made based on the Bayes information criterion. The tuned Θ–IPOD shows outstanding performance in identifying outliers in various situations compared with other existing approaches. In addition, Θ–IPOD is much ...

250 citations


Journal ArticleDOI
TL;DR: Replacing compact subsets by measures, a notion of distance function to a probability distribution in ℝd is introduced and it is shown that it is possible to reconstruct offsets of sampled shapes with topological guarantees even in the presence of outliers.
Abstract: Data often comes in the form of a point cloud sampled from an unknown compact subset of Euclidean space. The general goal of geometric inference is then to recover geometric and topological features (e.g., Betti numbers, normals) of this subset from the approximating point cloud data. It appears that the study of distance functions allows one to address many of these questions successfully. However, one of the main limitations of this framework is that it does not cope well with outliers or with background noise. In this paper, we show how to extend the framework of distance functions to overcome this problem. Replacing compact subsets by measures, we introduce a notion of distance function to a probability distribution in ℝ d . These functions share many properties with classical distance functions, which make them suitable for inference purposes. In particular, by considering appropriate level sets of these distance functions, we show that it is possible to reconstruct offsets of sampled shapes with topological guarantees even in the presence of outliers. Moreover, in settings where empirical measures are considered, these functions can be easily evaluated, making them of particular practical interest.

236 citations


Journal ArticleDOI
TL;DR: By exploiting the nonholonomic constraints of wheeled vehicles it is possible to use a restrictive motion model which allows us to parameterize the motion with only 1 point correspondence and results in the two most efficient algorithms for removing outliers: 1-point RANSAC and histogram voting.
Abstract: This paper presents a new method to estimate the relative motion of a vehicle from images of a single camera. The computational cost of the algorithm is limited only by the feature extraction and matching process, as the outlier removal and the motion estimation steps take less than a fraction of millisecond with a normal laptop computer. The biggest problem in visual motion estimation is data association; matched points contain many outliers that must be detected and removed for the motion to be accurately estimated. In the last few years, a very established method for removing outliers has been the "5-point RANSAC" algorithm which needs a minimum of 5 point correspondences to estimate the model hypotheses. Because of this, however, it can require up to several hundreds of iterations to find a set of points free of outliers. In this paper, we show that by exploiting the nonholonomic constraints of wheeled vehicles it is possible to use a restrictive motion model which allows us to parameterize the motion with only 1 point correspondence. Using a single feature correspondence for motion estimation is the lowest model parameterization possible and results in the two most efficient algorithms for removing outliers: 1-point RANSAC and histogram voting. To support our method we run many experiments on both synthetic and real data and compare the performance with a state-of-the-art approach. Finally, we show an application of our method to visual odometry by recovering a 3 Km trajectory in a cluttered urban environment and in real-time.

223 citations


Journal ArticleDOI
TL;DR: A comprehensive survey of well-known distance-based, density-based and other techniques for outlier detection and compare them is presented and definitions of outliers are provided and their detection based on supervised and unsupervised learning in the context of network anomaly detection are discussed.
Abstract: The detection of outliers has gained considerable interest in data mining with the realization that outliers can be the key discovery to be made from very large databases. Outliers arise due to various reasons such as mechanical faults, changes in system behavior, fraudulent behavior, human error and instrument error. Indeed, for many applications the discovery of outliers leads to more interesting and useful results than the discovery of inliers. Detection of outliers can lead to identification of system faults so that administrators can take preventive measures before they escalate. It is possible that anomaly detection may enable detection of new attacks. Outlier detection is an important anomaly detection approach. In this paper, we present a comprehensive survey of well-known distance-based, density-based and other techniques for outlier detection and compare them. We provide definitions of outliers and discuss their detection based on supervised and unsupervised learning in the context of network anomaly detection.

Proceedings ArticleDOI
11 Apr 2011
TL;DR: First results on the problem of structural outlier detection in massive network streams are provided, using a structural connectivity model in order to define outliers in graph streams and designing a reservoir sampling method to maintain structural summaries of the underlying network.
Abstract: A number of applications in social networks, telecommunications, and mobile computing create massive streams of graphs. In many such applications, it is useful to detect structural abnormalities which are different from the “typical” behavior of the underlying network. In this paper, we will provide first results on the problem of structural outlier detection in massive network streams. Such problems are inherently challenging, because the problem of outlier detection is specially challenging because of the high volume of the underlying network stream. The stream scenario also increases the computational challenges for the approach. We use a structural connectivity model in order to define outliers in graph streams. In order to handle the sparsity problem of massive networks, we dynamically partition the network in order to construct statistically robust models of the connectivity behavior. We design a reservoir sampling method in order to maintain structural summaries of the underlying network. These structural summaries are designed in order to create robust, dynamic and efficient models for outlier detection in graph streams. We present experimental results illustrating the effectiveness and efficiency of our approach.

Journal ArticleDOI
TL;DR: A new statistical approach to the problem of inlier-based outlier detection, i.e., finding outliers in the test set based on the training set consisting only of inliers, using the ratio of training and test data densities as an outlier score is proposed.
Abstract: We propose a new statistical approach to the problem of inlier-based outlier detection, i.e., finding outliers in the test set based on the training set consisting only of inliers. Our key idea is to use the ratio of training and test data densities as an outlier score. This approach is expected to have better performance even in high-dimensional problems since methods for directly estimating the density ratio without going through density estimation are available. Among various density ratio estimation methods, we employ the method called unconstrained least-squares importance fitting (uLSIF) since it is equipped with natural cross-validation procedures, allowing us to objectively optimize the value of tuning parameters such as the regularization parameter and the kernel width. Furthermore, uLSIF offers a closed-form solution as well as a closed-form formula for the leave-one-out error, so it is computationally very efficient and is scalable to massive datasets. Simulations with benchmark and real-world datasets illustrate the usefulness of the proposed approach.

Proceedings ArticleDOI
11 Apr 2011
TL;DR: New algorithms for continuous outlier monitoring in data streams, based on sliding windows are proposed, able to reduce the required storage overhead, run faster than previously proposed techniques and offer significant flexibility.
Abstract: Anomaly detection is considered an important data mining task, aiming at the discovery of elements (also known as outliers) that show significant diversion from the expected case. More specifically, given a set of objects the problem is to return the suspicious objects that deviate significantly from the typical behavior. As in the case of clustering, the application of different criteria lead to different definitions for an outlier. In this work, we focus on distance-based outliers: an object x is an outlier if there are less than k objects lying at distance at most R from x. The problem offers significant challenges when a stream-based environment is considered, where data arrive continuously and outliers must be detected on-the-fly. There are a few research works studying the problem of continuous outlier detection. However, none of these proposals meets the requirements of modern stream-based applications for the following reasons: (i) they demand a significant storage overhead, (ii) their efficiency is limited and (iii) they lack flexibility. In this work, we propose new algorithms for continuous outlier monitoring in data streams, based on sliding windows. Our techniques are able to reduce the required storage overhead, run faster than previously proposed techniques and offer significant flexibility. Experiments performed on real-life as well as synthetic data sets verify our theoretical study.

Journal ArticleDOI
TL;DR: The goal is to propose an efficient and robust load forecasting method for prediction up to a day-ahead and to deal with heteroscedasticity, a simple novel multivariate modeling that improves the quality of the forecast.
Abstract: In this paper, the stochastic characteristics of the electric consumption in France are analyzed. It is shown that the load time series exhibit lasting abrupt changes in the stochastic pattern, termed breaks. The goal is to propose an efficient and robust load forecasting method for prediction up to a day-ahead. To this end, two new robust procedures for outlier identification and suppression are developed. They are termed the multivariate ratio-of-medians-based estimator (RME) and the multivariate minimum-Hellinger-distance-based estimator (MHDE). The performance of the proposed methods has been evaluated on the French electric load time series in terms of execution times, ability to detect and suppress outliers, and forecasting accuracy. Their performances are compared with those of the robust methods proposed in the literature to estimate the parameters of SARIMA models and of the multiplicative double seasonal exponential smoothing. A new robust version of the latter is proposed as well. It is found that the RME approach outperforms all the other methods for “normal days” and presents several interesting properties such as good robustness, fast execution, simplicity, and easy online implementation. Finally, to deal with heteroscedasticity, we propose a simple novel multivariate modeling that improves the quality of the forecast.

Journal ArticleDOI
TL;DR: This work presents integer programming formulations of Vapnik's support vector machine (SVM) with the ramp loss and hard margin loss, and shows SVM with these loss functions is shown to be a consistent estimator when used with certain kernel functions.
Abstract: In the interest of deriving classifiers that are robust to outlier observations, we present integer programming formulations of Vapnik's support vector machine (SVM) with the ramp loss and hard margin loss. The ramp loss allows a maximum error of 2 for each training observation, while the hard margin loss calculates error by counting the number of training observations that are in the margin or misclassified outside of the margin. SVM with these loss functions is shown to be a consistent estimator when used with certain kernel functions. In computational studies with simulated and real-world data, SVM with the robust loss functions ignores outlier observations effectively, providing an advantage over SVM with the traditional hinge loss when using the linear kernel. Despite the fact that training SVM with the robust loss functions requires the solution of a quadratic mixed-integer program (QMIP) and is NP-hard, while traditional SVM requires only the solution of a continuous quadratic program (QP), we are able to find good solutions and prove optimality for instances with up to 500 observations. Solution methods are presented for the new formulations that improve computational performance over industry-standard integer programming solvers alone.

Proceedings ArticleDOI
09 May 2011
TL;DR: The outlier-robust Kalman filter proposed is a discrete-time model for sequential data corrupted with non-Gaussian and heavy-tailed noise and efficient filtering and smoothing algorithms are presented which are straightforward modifications of the standardKalman filter Rauch-Tung-Striebel recursions and yet are much more robust to outliers and anomalous observations.
Abstract: We introduce a novel approach for processing sequential data in the presence of outliers. The outlier-robust Kalman filter we propose is a discrete-time model for sequential data corrupted with non-Gaussian and heavy-tailed noise. We present efficient filtering and smoothing algorithms which are straightforward modifications of the standard Kalman filter Rauch-Tung-Striebel recursions and yet are much more robust to outliers and anomalous observations. Additionally, we present an algorithm for learning all of the parameters of our outlier-robust Kalman filter in a completely unsupervised manner. The potential of our approach is borne out in experiments with synthetic and real data.

Journal ArticleDOI
TL;DR: The aim is to propose an automatic algorithm called IRMI for iterative model-based imputation using robust methods, encountering for the mentioned challenges, and to provide a software tool in R for this algorithm.

Proceedings ArticleDOI
11 Apr 2011
TL;DR: This work proposes a novel outlier ranking based on the objects deviation in a statistically selected set of relevant subspace projections and provides a selection of subspaces with high contrast to tackle the general challenges of detecting outliers hidden in subspaced of the data.
Abstract: Outlier mining is an important data analysis task to distinguish exceptional outliers from regular objects. For outlier mining in the full data space, there are well established methods which are successful in measuring the degree of deviation for outlier ranking. However, in recent applications traditional outlier mining approaches miss outliers as they are hidden in subspace projections. Especially, outlier ranking approaches measuring deviation on all available attributes miss outliers deviating from their local neighborhood only in subsets of the attributes. In this work, we propose a novel outlier ranking based on the objects deviation in a statistically selected set of relevant subspace projections. This ensures to find objects deviating in multiple relevant subspaces, while it excludes irrelevant projections showing no clear contrast between outliers and the residual objects. Thus, we tackle the general challenges of detecting outliers hidden in subspaces of the data. We provide a selection of subspaces with high contrast and propose a novel ranking based on an adaptive degree of deviation in arbitrary subspaces. In thorough experiments on real and synthetic data we show that our approach outperforms competing outlier ranking approaches by detecting outliers in arbitrary subspace projections.

Journal ArticleDOI
TL;DR: A review of several statistical methods that are currently in use for outlier identification is presented, and their performances are compared theoretically for typical statistical distributions of experimental data, considering values derived from the distribution of extreme order statistics as reference terms as mentioned in this paper.
Abstract: A review of several statistical methods that are currently in use for outlier identification is presented, and their performances are compared theoretically for typical statistical distributions of experimental data, considering values derived from the distribution of extreme order statistics as reference terms. A simple modification of a popular, broadly used method based upon box-plot is introduced, in order to overcome a major limitation concerning sample size. Examples are presented concerning exploitation of methods considered on two data sets: a historical one concerning evaluation of an astronomical constant performed by a number of leading observatories and a substantial database pertaining to an ongoing investigation on absolute measurement of gravity acceleration, exhibiting peculiar aspects concerning outliers. Some problems related to outlier treatment are examined, and the requirement of both statistical analysis and expert opinion for proper outlier management is underlined.

Journal ArticleDOI
TL;DR: An LS approach to generate IF-THEN rules for causal databases is proposed and both type-1 and interval type-2 fuzzy sets are considered, and the degree of reliability is especially valuable for finding the most reliable and representative rules.
Abstract: Linguistic summarization (LS) is a data mining or knowledge discovery approach to extract patterns from databases. Many authors have used this technique to generate summaries like “Most senior workers have high salary,” which can be used to better understand and communicate about data; however, few of them have used it to generate IF-THEN rules like “IF X is large and Y is medium, THEN Z is small,” which not only facilitate understanding and communication of data but can also be used in decision-making. In this paper, an LS approach to generate IF-THEN rules for causal databases is proposed. Both type-1 and interval type-2 fuzzy sets are considered. Five quality measures-the degrees of truth, sufficient coverage, reliability, outlier, and simplicity-are defined. Among them, the degree of reliability is especially valuable for finding the most reliable and representative rules, and the degree of outlier can be used to identify outlier rules and data for close-up investigation. An improved parallel coordinates approach for visualizing the IF-THEN rules is also proposed. Experiments on two datasets demonstrate our LS and rule visualization approaches. Finally, the relationships between our LS approach and the Wang-Mendel (WM) method, perceptual reasoning, and granular computing are pointed out.

Proceedings ArticleDOI
03 Oct 2011
TL;DR: A filtering method called PRISM is introduced that identifies and removes instances that should be misclassified and achieves a higher classification accuracy than the outlier detection methods and compares favorably with the noise reduction method.
Abstract: Appropriately handling noise and outliers is an important issue in data mining. In this paper we examine how noise and outliers are handled by learning algorithms. We introduce a filtering method called PRISM that identifies and removes instances that should be misclassified. We refer to the set of removed instances as ISMs (instances that should be misclassified). We examine PRISM and compare it against 3 existing outlier detection methods and 1 noise reduction technique on 48 data sets using 9 learning algorithms. Using PRISM, the classification accuracy increases from 78.5% to 79.8% on a set of 53 data sets and is statistically significant. In addition, the accuracy on the non-outlier instances increases from 82.8% to 84.7%. PRISM achieves a higher classification accuracy than the outlier detection methods and compares favorably with the noise reduction method.

Journal ArticleDOI
TL;DR: Novel fixed-lag and fixed-interval smoothing algorithms that are robust to outliers simultaneously present in the measurements and in the state dynamics and which rely on coordinate descent and the alternating direction method of multipliers, are developed.
Abstract: Coping with outliers contaminating dynamical processes is of major importance in various applications because mismatches from nominal models are not uncommon in practice. In this context, the present paper develops novel fixed-lag and fixed-interval smoothing algorithms that are robust to outliers simultaneously present in the measurements and in the state dynamics. Outliers are handled through auxiliary unknown variables that are jointly estimated along with the state based on the least-squares criterion that is regularized with the l1-norm of the outliers in order to effect sparsity control. The resultant iterative estimators rely on coordinate descent and the alternating direction method of multipliers, are expressed in closed form per iteration, and are provably convergent. Additional attractive features of the novel doubly robust smoother include: i) ability to handle both types of outliers; ii) universality to unknown nominal noise and outlier distributions; iii) flexibility to encompass maximum a posteriori optimal estimators with reliable performance under nominal conditions; and iv) improved pCoping with outliers contaminating dynamical processes is of major importance in various applications because mismatches from nominal models are not uncommon in practice. In this context, the present paper develops novel fixed-lag and fixed-interval smoothing algorithms that are robust to outliers simultaneously present in the measurements and in the state dynamics. Outliers are handled through auxiliary unknown variables that are jointly estimated along with the state based on the least-squares criterion that is regularized with the l1-norm of the outliers in order to effect sparsity control. The resultant iterative estimators rely on coordinate descent and the alternating direction method of multipliers, are expressed in closed form per iteration, and are provably convergent. Additional attractive features of the novel doubly robust smoother include: i) ability to handle both types of outliers; ii) universality to unknown nominal noise and outlier distributions; iii) flexibility to encompass maximum a posteriori optimal estimators with reliable performance under nominal conditions; and iv) improved performance relative to competing alternatives at comparable complexity, as corroborated via simulated tests.erformance relative to competing alternatives at comparable complexity, as corroborated via simulated tests.

Journal ArticleDOI
TL;DR: In this paper, the authors study the simultaneous recovery of the K fixed subspaces by minimizing the l_p-averaged distances of the sampled data points from any K subspacing.
Abstract: We assume i.i.d. data sampled from a mixture distribution with K components along fixed d-dimensional linear subspaces and an additional outlier component. For p>0, we study the simultaneous recovery of the K fixed subspaces by minimizing the l_p-averaged distances of the sampled data points from any K subspaces. Under some conditions, we show that if $0 1 and p>1, then the underlying subspaces cannot be recovered or even nearly recovered by l_p minimization. The results of this paper partially explain the successes and failures of the basic approach of l_p energy minimization for modeling data by multiple subspaces.

Proceedings ArticleDOI
21 Aug 2011
TL;DR: By combining simple but effective indexing and disk block accessing techniques, a sequential algorithm iOrca is developed that is up to an order- of-magnitude faster than the state-of-the-art.
Abstract: The problem of distance-based outlier detection is difficult to solve efficiently in very large datasets because of potential quadratic time complexity. We address this problem and develop sequential and distributed algorithms that are significantly more efficient than state-of-the-art methods while still guaranteeing the same outliers. By combining simple but effective indexing and disk block accessing techniques, we have developed a sequential algorithm iOrca that is up to an order-of-magnitude faster than the state-of-the-art. The indexing scheme is based on sorting the data points in order of increasing distance from a fixed reference point and then accessing those points based on this sorted order. To speed up the basic outlier detection technique, we develop two distributed algorithms (DOoR and iDOoR) for modern distributed multi-core clusters of machines, connected on a ring topology. The first algorithm passes data blocks from each machine around the ring, incrementally updating the nearest neighbors of the points passed. By maintaining a cutoff threshold, it is able to prune a large number of points in a distributed fashion. The second distributed algorithm extends this basic idea with the indexing scheme discussed earlier. In our experiments, both distributed algorithms exhibit significant improvements compared to the state-of-the-art distributed method [13].

Proceedings ArticleDOI
19 Feb 2011
TL;DR: A clustering based method to capture outliers using K-means clustering algorithm to divide the data set into clusters and declares the top $n$ points with the highest score as outliers.
Abstract: In this paper we propose a clustering based method to capture outliers. We apply K-means clustering algorithm to divide the data set into clusters. The points which are lying near the centroid of the cluster are not probable candidate for outlier and we can prune out such points from each cluster. Next we calculate a distance based outlier score for remaining points. The computations needed to calculate the outlier score reduces considerably due to the pruning of some points. Based on the outlier score we declare the top $n$ points with the highest score as outliers. The experimental results using real data set demonstrate that even though the number of computations is less, the proposed method performs better than the existing method.

Proceedings Article
01 Dec 2011
TL;DR: An efficient missing value imputation technique called DMI, which makes use of a decision tree and expectation maximization (EM) algorithm, argues that the correlations among attributes within a horizontal partition of a data set can be higher than the correlations over the whole data set.
Abstract: Data pre-processing plays a vital role in data mining for ensuring good quality of data. In general data pre-processing tasks include imputation of missing values, identification of outliers, smoothening out of noisy data and correction of inconsistent data. In this paper, we present an efficient missing value imputation technique called DMI, which makes use of a decision tree and expectation maximization (EM) algorithm. We argue that the correlations among attributes within a horizontal partition of a data set can be higher than the correlations over the whole data set. For some existing algorithms such as EM based imputation (EMI) accuracy of imputation is expected to be better for a data set having higher correlations than a data set having lower correlations. Therefore, our technique (DMI) applies EMI on various horizontal segments (of a data set) where correlations among attributes are high. We evaluate DMI on two publicly available natural data sets by comparing its performance with the performance of EMI. We use various patterns of missing values each having different missing ratios up to 10%. Several evaluation criteria such as coefficient of determination (R2), Index of agreement (d2) and root mean squared error (RMSE) are used. Our initial experimental results indicate that DMI performs significantly better than EMI.

Journal ArticleDOI
TL;DR: A censoring scheme that iteratively updates the outlier/target maps for target detection in synthetic aperture radar (SAR) images is proposed, and its effectiveness was successfully demonstrated.
Abstract: In this letter, a censoring scheme that iteratively updates the outlier/target maps for target detection in synthetic aperture radar (SAR) images is proposed. For each iteration, any pixels that are indicated by the outlier map as outliers are rejected (censored out) from the clutter estimation. The resulting detected target map is then used as the new outlier map for the next iteration. This procedure is continued until there is no change to the target map, which is then output as the final detection result. The proposed scheme is generically applicable for target detection in both single-channel and multichannel SAR images. In our experiment, in particular, we tested the proposed method on both single-channel and polarimetric SAR data, and its effectiveness was successfully demonstrated.

Journal ArticleDOI
TL;DR: A generative model-based Bayesian hierarchical clustering algorithm for microarray time series that employs Gaussian process regression to capture the structure of the data and highlights the importance of including replicate information, which is found enables the discrimination of additional distinct expression profiles.
Abstract: Background Post-genomic molecular biology has resulted in an explosion of data, providing measurements for large numbers of genes, proteins and metabolites. Time series experiments have become increasingly common, necessitating the development of novel analysis tools that capture the resulting data structure. Outlier measurements at one or more time points present a significant challenge, while potentially valuable replicate information is often ignored by existing techniques.

Journal ArticleDOI
TL;DR: In this article, robust estimators for principal components are considered by adapting the projection pursuit approach to the functional data setting, which combines robust projection-pursuit with different smoothing methods.
Abstract: In many situations, data are recorded over a period of time and may be regarded as realizations of a stochastic process. In this paper, robust estimators for the principal components are considered by adapting the projection pursuit approach to the functional data setting. Our approach combines robust projection-pursuit with different smoothing methods. Consistency of the estimators are shown under mild assumptions. The performance of the classical and robust procedures are compared in a simulation study under different contamination schemes. 1. Introduction. Analogous to classical principal components analysis (PCA), the projection-pursuit approach to robust PCA is based on finding projections of the data which have maximal dispersion. Instead of using the variance as a measure of dispersion, a robust scale estimator sn is used for the maximization problem. This approach was introduced by Li and Chen (1985), who proposed estimators based on maximizing (or minimizing) a robust scale. In this way, given a sample xi ∈ R d ,1 ≤ i ≤ n, the first robust principal component vector is defined as